Time Series Aggregation in Pandas
A Comprehensive Guide Using Groupby, Grouper, and Resample
If you are working with time series data, you may need to group the records by various frequencies and perform some kind of aggregation / computation. These frequencies may range from month, week to minutes. Pandas provides more than one way to group the records in a Dataframe.
dataset - https://raw.githubusercontent.com/kapilgarg/datasets/main/abc.csv
This is a day's stock price data for hypothetical abc Inc collected at second's interval.
First, load the data set
>>> df = pd.read_csv(r"https://raw.githubusercontent.com/kapilgarg/datasets/main/abc.csv", parse_dates=['time'])
>>> df.head()
price volume time
0 3665.85 3037260 2024-01-04 15:08:02
1 3666.15 3037479 2024-01-04 15:08:03
2 3665.85 3037479 2024-01-04 15:08:03
3 3665.85 3037713 2024-01-04 15:08:04
4 3666.10 3037713 2024-01-04 15:08:04
Say, we want to recreate this dataset at a 1 minute interval (instead of every second) with price as mean for that minute's window
With groupby (only)
you could use dt operator to get the minute component of time. This method is not very flexible as it is same as creating another column with minute component and group based on that. It is not really treating data as time series.
>>> df.groupby(df.time.dt.minute).agg({'price':'mean','volume':'sum'})
price volume
time
8 3664.440206 295258900
9 3665.055556 275862638
10 3665.539506 250020518
11 3666.129878 254552164
13 3666.443158 299523531
14 3665.159524 267358683
15 3666.131410 250088741
With groupby and Grouper
pandas also provides Grouper object which is much more flexible. It is used along with groupby.
If your data frame already has any datetime like column as index, you can just provide it a frequency and it will group the data using that. If you do not want to set datetime like column as index, you could pass that as key to this.
Result is same.
>>> df.groupby(pd.Grouper(key='time',freq='1min')).agg({'price':'mean','volume':'sum'})
price volume
time
2024-01-04 15:08:00 3664.440206 295258900
2024-01-04 15:09:00 3665.055556 275862638
2024-01-04 15:10:00 3665.539506 250020518
2024-01-04 15:11:00 3666.129878 254552164
In this example, for every 1 minute window, it aggregates the price as the mean of all prices and volume as sum of all the volumes in that window.
freq parameter controls how the data is grouped. You can provide any number of minutes / second / week etc for grouping and it works accordingly(provided you have the data). More on freq here.
>>> df.groupby(pd.Grouper(key='time',freq='5min')).agg({'price':'mean','volume':'sum'})
price volume
time
2024-01-04 15:05:00 3664.736364 571121538
2024-01-04 15:10:00 3666.000230 1359137902
2024-01-04 15:15:00 3664.740525 1428817569
2024-01-04 15:20:00 3665.729691 1543825803
2024-01-04 15:25:00 3666.779657 1442752971
Another example with 5 minute for grouping window
origin - You can also control the starting point from which the grouping window is created.
By default, it starts creating the window of the given frequency from start of the day but if you need to change that, use 'origin'.
If you have the data, say, starting from 15:08 but you still want to group the data from window starting 15:00, you can use 'origin' parameter to control that.
origin parameter impacts the window in which a particular record falls.
>>> df.groupby(pd.Grouper(key='time',freq='5min',origin='2024-01-04 15:00:00')).agg({'price':'mean','volume':'sum'})
price volume
time
2024-01-04 15:05:00 3664.736364 571121538
2024-01-04 15:10:00 3666.000230 1359137902
2024-01-04 15:15:00 3664.740525 1428817569
2024-01-04 15:20:00 3665.729691 1543825803
2024-01-04 15:25:00 3666.779657 1442752971
With resample
resample is another api on dataframe using which you can convert data to another frequency. With resample, you could either set any datetime like column as index or you can pass that column name as value for on parameter. First parameter is the frequency for resampling.
>>> df.resample('5T',on='time').agg({'price':'mean','volume':'sum'})
price volume
time
2024-01-04 15:05:00 3664.736364 571121538
2024-01-04 15:10:00 3666.000230 1359137902
2024-01-04 15:15:00 3664.740525 1428817569
2024-01-04 15:20:00 3665.729691 1543825803
2024-01-04 15:25:00 3666.779657 1442752971
The result of using resample or groupby with Grouper is same.
With resample, you could also upsample the data. For ex: convert dataset at second to 100 millisecond level. Once you do that, you will have records with NaN and you can fill those values using various methods available.
>>> # removing deplicate records from time since we need to set it as index.
>>> df.drop_duplicates('time').set_index('time').resample('100L').asfreq()
price volume
time
2024-01-04 15:08:02.000 3665.85 3037260.0
2024-01-04 15:08:02.100 NaN NaN
2024-01-04 15:08:02.200 NaN NaN
2024-01-04 15:08:02.300 NaN NaN
2024-01-04 15:08:02.400 NaN NaN
... ... ...
2024-01-04 15:58:13.600 NaN NaN
2024-01-04 15:58:13.700 NaN NaN
2024-01-04 15:58:13.800 NaN NaN
2024-01-04 15:58:13.900 NaN NaN
In the example above, we have upsampled it to 100 milliseconds and getting NaN in those places where no data is available. This can be handled with appropriate strategy. Here I'll just use forwardfill
>>> df.drop_duplicates('time').set_index('time').resample('100L').ffill()
price volume
time
2024-01-04 15:08:02.000 3665.85 3037260
2024-01-04 15:08:02.100 3665.85 3037260
2024-01-04 15:08:02.200 3665.85 3037260
2024-01-04 15:08:02.300 3665.85 3037260
2024-01-04 15:08:02.400 3665.85 3037260
... ... ...
2024-01-04 15:58:13.600 3666.80 3598132
2024-01-04 15:58:13.700 3666.80 3598132
2024-01-04 15:58:13.800 3666.80 3598132
2024-01-04 15:58:13.900 3666.80 3598132
2024-01-04 15:58:14.000 3666.80 3598144
The best method really depends on the use case, familiarity and preference.
Thanks for reading...
ref:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Grouper.html#pandas-grouper
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html