Time Series Aggregation in Pandas

Time Series Aggregation in Pandas

A Comprehensive Guide Using Groupby, Grouper, and Resample

If you are working with time series data, you may need to group the records by various frequencies and perform some kind of aggregation / computation. These frequencies may range from month, week to minutes. Pandas provides more than one way to group the records in a Dataframe.

dataset - https://raw.githubusercontent.com/kapilgarg/datasets/main/abc.csv

This is a day's stock price data for hypothetical abc Inc collected at second's interval.

First, load the data set

>>> df = pd.read_csv(r"https://raw.githubusercontent.com/kapilgarg/datasets/main/abc.csv", parse_dates=['time'])
>>> df.head()
        price   volume                time
0     3665.85  3037260 2024-01-04 15:08:02
1     3666.15  3037479 2024-01-04 15:08:03
2     3665.85  3037479 2024-01-04 15:08:03
3     3665.85  3037713 2024-01-04 15:08:04
4     3666.10  3037713 2024-01-04 15:08:04

Say, we want to recreate this dataset at a 1 minute interval (instead of every second) with price as mean for that minute's window

With groupby (only)

you could use dt operator to get the minute component of time. This method is not very flexible as it is same as creating another column with minute component and group based on that. It is not really treating data as time series.

>>> df.groupby(df.time.dt.minute).agg({'price':'mean','volume':'sum'})
            price     volume
time
8     3664.440206  295258900
9     3665.055556  275862638
10    3665.539506  250020518
11    3666.129878  254552164
13    3666.443158  299523531
14    3665.159524  267358683
15    3666.131410  250088741

With groupby and Grouper

pandas also provides Grouper object which is much more flexible. It is used along with groupby.

If your data frame already has any datetime like column as index, you can just provide it a frequency and it will group the data using that. If you do not want to set datetime like column as index, you could pass that as key to this.

Result is same.

>>> df.groupby(pd.Grouper(key='time',freq='1min')).agg({'price':'mean','volume':'sum'})
                           price     volume
time
2024-01-04 15:08:00  3664.440206  295258900
2024-01-04 15:09:00  3665.055556  275862638
2024-01-04 15:10:00  3665.539506  250020518
2024-01-04 15:11:00  3666.129878  254552164

In this example, for every 1 minute window, it aggregates the price as the mean of all prices and volume as sum of all the volumes in that window.

freq parameter controls how the data is grouped. You can provide any number of minutes / second / week etc for grouping and it works accordingly(provided you have the data). More on freq here.

>>> df.groupby(pd.Grouper(key='time',freq='5min')).agg({'price':'mean','volume':'sum'})
                           price      volume
time
2024-01-04 15:05:00  3664.736364   571121538
2024-01-04 15:10:00  3666.000230  1359137902
2024-01-04 15:15:00  3664.740525  1428817569
2024-01-04 15:20:00  3665.729691  1543825803
2024-01-04 15:25:00  3666.779657  1442752971

Another example with 5 minute for grouping window

origin - You can also control the starting point from which the grouping window is created.

By default, it starts creating the window of the given frequency from start of the day but if you need to change that, use 'origin'.

If you have the data, say, starting from 15:08 but you still want to group the data from window starting 15:00, you can use 'origin' parameter to control that.

origin parameter impacts the window in which a particular record falls.

>>> df.groupby(pd.Grouper(key='time',freq='5min',origin='2024-01-04 15:00:00')).agg({'price':'mean','volume':'sum'})
                           price      volume
time
2024-01-04 15:05:00  3664.736364   571121538
2024-01-04 15:10:00  3666.000230  1359137902
2024-01-04 15:15:00  3664.740525  1428817569
2024-01-04 15:20:00  3665.729691  1543825803
2024-01-04 15:25:00  3666.779657  1442752971

With resample

resample is another api on dataframe using which you can convert data to another frequency. With resample, you could either set any datetime like column as index or you can pass that column name as value for on parameter. First parameter is the frequency for resampling.

>>> df.resample('5T',on='time').agg({'price':'mean','volume':'sum'})
                           price      volume
time
2024-01-04 15:05:00  3664.736364   571121538
2024-01-04 15:10:00  3666.000230  1359137902
2024-01-04 15:15:00  3664.740525  1428817569
2024-01-04 15:20:00  3665.729691  1543825803
2024-01-04 15:25:00  3666.779657  1442752971

The result of using resample or groupby with Grouper is same.

With resample, you could also upsample the data. For ex: convert dataset at second to 100 millisecond level. Once you do that, you will have records with NaN and you can fill those values using various methods available.

>>> # removing deplicate records from time since we need to set it as index.
>>> df.drop_duplicates('time').set_index('time').resample('100L').asfreq()
                           price     volume
time
2024-01-04 15:08:02.000  3665.85  3037260.0
2024-01-04 15:08:02.100      NaN        NaN
2024-01-04 15:08:02.200      NaN        NaN
2024-01-04 15:08:02.300      NaN        NaN
2024-01-04 15:08:02.400      NaN        NaN
...                          ...        ...
2024-01-04 15:58:13.600      NaN        NaN
2024-01-04 15:58:13.700      NaN        NaN
2024-01-04 15:58:13.800      NaN        NaN
2024-01-04 15:58:13.900      NaN        NaN

In the example above, we have upsampled it to 100 milliseconds and getting NaN in those places where no data is available. This can be handled with appropriate strategy. Here I'll just use forwardfill

>>> df.drop_duplicates('time').set_index('time').resample('100L').ffill()
                           price   volume
time
2024-01-04 15:08:02.000  3665.85  3037260
2024-01-04 15:08:02.100  3665.85  3037260
2024-01-04 15:08:02.200  3665.85  3037260
2024-01-04 15:08:02.300  3665.85  3037260
2024-01-04 15:08:02.400  3665.85  3037260
...                          ...      ...
2024-01-04 15:58:13.600  3666.80  3598132
2024-01-04 15:58:13.700  3666.80  3598132
2024-01-04 15:58:13.800  3666.80  3598132
2024-01-04 15:58:13.900  3666.80  3598132
2024-01-04 15:58:14.000  3666.80  3598144

The best method really depends on the use case, familiarity and preference.

Thanks for reading...


ref:

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html#pandas-grouper

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html