[ACCEPTED]-pandas rolling sum of last five minutes-time-series
In general, if the dates are completely 17 arbitrary, I think you would be forced to 16 use a Python for-loop
over the rows or use df.apply
, (which 15 under the hood, also uses a Python loop.)
However, if 14 your Dates share a common frequency, as 13 is the case above, then there is a trick 12 which should be much quicker than using 11 df.apply
: Expand the timeseries according to the 10 common frequency -- in this case, 1 minute 9 -- fill in the NaNs with zeros, and then 8 call rolling_sum
:
In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]:
A
Date
2014-11-21 11:00:00 1
2014-11-21 11:03:00 5
2014-11-21 11:04:00 6
2014-11-21 11:05:00 7
2014-11-21 11:07:00 11
2014-11-21 11:08:00 8
2014-11-21 11:12:00 2
2014-11-21 11:13:00 3
Of course, any time series has a common 7 frequency if you are willing to accept a 6 small enough granularity, but the required 5 size of df.asfreq(...)
may make this trick impractical.
Here 4 is an example of the more general approach 3 using df.apply
. Note that calling searchsorted
relies on df['Date']
being 2 in sorted order.
import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))
def sum_window(row):
return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)
print(df[['Date', 'A', 'rolling_sum']])
yields
Date A rolling_sum
0 2014-11-21 11:00:00 1 1
1 2014-11-21 11:03:00 4 5
2 2014-11-21 11:04:00 1 6
3 2014-11-21 11:05:00 2 7
4 2014-11-21 11:07:00 4 11
5 2014-11-21 11:08:00 1 8
6 2014-11-21 11:12:00 1 2
7 2014-11-21 11:13:00 2 3
Here is a benchmark 1 comparing the df.asfreq
trick versus calling df.apply
:
import numpy as np
import pandas as pd
df = pd.read_csv('data', parse_dates=[0], sep=',\s*')
def big_df(df):
df = df.copy()
for i in range(7):
dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
df = pd.concat([df, df2])
df = df.reset_index(drop=True)
return df
def using_apply():
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))
def sum_window(row):
return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)
return df[['Date', 'rolling_sum']]
def using_asfreq():
result = (pd.rolling_sum(
df.set_index(['Date']).asfreq('1T').fillna(0),
window=5, min_periods=1).reindex(df['Date']))
return result
In [364]: df = big_df(df)
In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop
In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.