Pandas date parsing performance

Dates and times provide an unlimited source of hassles for anyone working with them. In this post I'll discuss a potential performance pitfall I encountered parsing dates in pandas. Conclusion: Create DatetimeIndices by parsing data with to_datetime(my_dates, format='my_format').

In [1]:
import numpy as np
import pandas as pd
import random
import time

from datetime import datetime

To get some data to test on, I create a list of dictionaries with strings representing a datetime and an observed value.

In [2]:
# Generate some fake timestamped data
num_rows = int(1e6)
data = np.random.randn(num_rows)
records = []
for i in xrange(num_rows):
    timestamp = np.random.randint(0, int(time.time()))
    date_string = datetime.fromtimestamp(timestamp).strftime('%m/%d/%y %H:%M')
    records.append({'timestamp': date_string, 'value': data[i]})
print records[0]
{'timestamp': '10/12/98 18:03', 'value': 0.49671415301123267}

Now I'll parse the random data into a DataFrame.

In [3]:
df = pd.DataFrame(records).set_index('timestamp')
10/12/98 18:03 0.496714
11/16/88 08:49 -0.138264
03/21/78 21:13 0.647689


I want to create a datetime index so I can calculate the sum of each value by day. Here are two different methods with very different performance.

Method 1: Use the DatetimeIndex constructor on the strings (very slow!):

In [4]:
df_dt_constructor = df.copy()
df_dt_constructor.index = pd.DatetimeIndex(df_dt_constructor.index)
1 loops, best of 3: 2min 31s per loop

Method 2: Parse the datetimes using to_datetime and assign the index (much faster!):

In [5]:
df_parse_with_format = df.copy()
df_parse_with_format.index = pd.to_datetime(df_parse_with_format.index, 
                                            format='%m/%d/%y %H:%M')
1 loops, best of 3: 4.79 s per loop

By parsing the dates with to_datetime the operation runs about 30x faster. If a format argument isn't supplied to_datetime it is still faster than calling the DatetimeIndex constructor directly, however only by about 2x. This can be beneficial since to_datetime can handle data with inconsistent date formats.

Method 3: Parse without format (flexible but slow):

In [6]:
df_parse_no_format = df.copy()
df_parse_no_format.index = pd.to_datetime(df_parse_no_format.index)
1 loops, best of 3: 1min 27s per loop

Pandas makes it easy to aggregate the data into different time periods. This code computes the sum of the values for each day.

In [7]:
df_parse_first = df.copy()
df_parse_first.index = pd.to_datetime(df_parse_first.index,
                                      format='%m/%d/%y %H:%M')
daily_sum = df_parse_first.resample('D', how='sum')

Although we collapsed many records into few we can still learn some things about the original data set. For example, we expect the variance of the daily sums to be the number of values in each bucket (because the underlying data was random normal). Using this the number of total observations can be estimated.

In [8]:
variance = daily_sum.var()[0]
num_days = len(daily_sum)
est_num_original_records = variance * num_days
error_pct = 100 * (est_num_original_records - num_rows) / num_rows
print 'Estimated there were {:,.2f} original records'.format(est_num_original_records)
print 'Error: {0:.2f}%'.format(error_pct)
Estimated there were 998,176.47 original records
Error: -0.18%

Similar Posts