# [ACCEPTED]-pandas rolling sum of last five minutes-time-series

Score: 21

In general, if the dates are completely 17 arbitrary, I think you would be forced to 16 use a Python `for-loop` over the rows or use `df.apply`, (which 15 under the hood, also uses a Python loop.)

However, if 14 your Dates share a common frequency, as 13 is the case above, then there is a trick 12 which should be much quicker than using 11 `df.apply`: Expand the timeseries according to the 10 common frequency -- in this case, 1 minute 9 -- fill in the NaNs with zeros, and then 8 call `rolling_sum`:

``````In [279]: pd.rolling_sum(df.set_index(['Date']).asfreq('1T').fillna(0), window=5, min_periods=1).reindex(df['Date'])
Out[279]:
A
Date
2014-11-21 11:00:00   1
2014-11-21 11:03:00   5
2014-11-21 11:04:00   6
2014-11-21 11:05:00   7
2014-11-21 11:07:00  11
2014-11-21 11:08:00   8
2014-11-21 11:12:00   2
2014-11-21 11:13:00   3
``````

Of course, any time series has a common 7 frequency if you are willing to accept a 6 small enough granularity, but the required 5 size of `df.asfreq(...)` may make this trick impractical.

Here 4 is an example of the more general approach 3 using `df.apply`. Note that calling `searchsorted` relies on `df['Date']` being 2 in sorted order.

``````import numpy as np
import pandas as pd
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
return df['A'].iloc[row['start_index']:row['end_index']+1].sum()
df['rolling_sum'] = df.apply(sum_window, axis=1)

print(df[['Date', 'A', 'rolling_sum']])
``````

yields

``````                 Date  A  rolling_sum
0 2014-11-21 11:00:00  1            1
1 2014-11-21 11:03:00  4            5
2 2014-11-21 11:04:00  1            6
3 2014-11-21 11:05:00  2            7
4 2014-11-21 11:07:00  4           11
5 2014-11-21 11:08:00  1            8
6 2014-11-21 11:12:00  1            2
7 2014-11-21 11:13:00  2            3
``````

Here is a benchmark 1 comparing the `df.asfreq` trick versus calling `df.apply`:

``````import numpy as np
import pandas as pd

def big_df(df):
df = df.copy()
for i in range(7):
dates = df['Date'] + pd.Timedelta(df.iloc[-1]['Date']-df.iloc[0]['Date']) + pd.Timedelta('1 minute')
df2 = pd.DataFrame({'Date': dates, 'A': df['A']})
df = pd.concat([df, df2])
df = df.reset_index(drop=True)
return df

def using_apply():
start_dates = df['Date'] - pd.Timedelta(minutes=5)
df['start_index'] = df['Date'].values.searchsorted(start_dates, side='right')
df['end_index'] = np.arange(len(df))

def sum_window(row):
return df['A'].iloc[row['start_index']:row['end_index']+1].sum()

df['rolling_sum'] = df.apply(sum_window, axis=1)
return df[['Date', 'rolling_sum']]

def using_asfreq():
result = (pd.rolling_sum(
df.set_index(['Date']).asfreq('1T').fillna(0),
window=5, min_periods=1).reindex(df['Date']))
return result
``````

``````In [364]: df = big_df(df)

In [367]: %timeit using_asfreq()
1000 loops, best of 3: 1.21 ms per loop

In [368]: %timeit using_apply()
1 loops, best of 3: 208 ms per loop
``````

More Related questions