Cleaning, reshaping, and plotting BART time series data with pandas

Introduction

I recently starting collecting data from the BART API, specifically estimated time to departure for trains at the two stations I use most frequently. In this notebook I'll show how I parsed the data from a csv file, reshaped it to fit the questions at hand, and made a few plots. Download notebook.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import datetime

def prettify_axis(ax, ylabel='', xlabel=''):
    label_format_dict = dict(fontsize=20, fontweight='bold')
    tick_format_dict = dict(labelsize=16, direction='out', top='off', right='off', 
                            length=4, width=1)
    ax.set_xlabel(xlabel, label_format_dict)
    ax.set_ylabel(ylabel, label_format_dict)
    ax.tick_params(**tick_format_dict)

The file plza.csv contains the data obtained from the BART API for trains leaving the El Cerrito Plaza station with the timestamp, train destination, train direction, number of cars, and estimated departure time.

In [2]:
!head data/plza.csv
time,dest,dir,len,etd
1417108837.85,Fremont,South,6,3
1417108837.85,Richmond,North,6,10
1417108921.12,Fremont,South,6,2
1417108921.12,Richmond,North,6,9
1417108981.46,Fremont,South,6,Leaving
1417108981.46,Richmond,North,6,7
1417109041.84,Fremont,South,6,20
1417109041.84,Richmond,North,6,6
1417109101.18,Fremont,South,6,19

Parsing the data

To convert the timestamp into a pandas datetime I wrote a custom date parsing funcion, which also converts to Pacific time.

In [3]:
def parse_time(timestamp):
    try:
        dt = pd.to_datetime(float(timestamp), unit='s')
        return dt.tz_localize('UTC').tz_convert('US/Pacific')
    except AttributeError, ValueError:
        return pd.NaT

Here I read in the data using the date parser defined above. I also do a bit of cleanup, replacing data points where the train is "Leaving" with an estimated departure time of 0 minutes.

In [4]:
df = pd.read_csv('data/plza.csv', parse_dates=['time'], date_parser=parse_time)
df['etd'] = df['etd'].replace('Leaving', 0).astype(np.float) 
df.head()
Out[4]:
time dest dir len etd
0 2014-11-27 09:20:37.850000-08:00 Fremont South 6 3
1 2014-11-27 09:20:37.850000-08:00 Richmond North 6 10
2 2014-11-27 09:22:01.120000-08:00 Fremont South 6 2
3 2014-11-27 09:22:01.120000-08:00 Richmond North 6 9
4 2014-11-27 09:23:01.460000-08:00 Fremont South 6 0

I take the Millbrae train to work, so I'm most interested in its data. Here I filter the DataFrame to keep only data for trains going to Millbrae.

In [5]:
df_mill = df[df['dest'] == 'Millbrae']
df_mill.head(3)
Out[5]:
time dest dir len etd
1992 2014-11-28 03:29:01.120000-08:00 Millbrae South 5 52
1994 2014-11-28 03:30:01.460000-08:00 Millbrae South 5 51
1996 2014-11-28 03:31:01.810000-08:00 Millbrae South 5 50

Reshaping the data

To investigate the daily variability in estimated departure times, I want each date be a column of data with the time of day as the index. To do this I'll do a few transformations on the time column to extract the dates and the time of day, then use a pivot table to reshape the data.

In [6]:
df_mill['time_of_day'] = df['time'].apply(lambda x: datetime.time(x.time().hour, x.time().minute))
df_mill['date'] = df['time'].apply(lambda x: x.date())
mill_pivot = df_mill.pivot(index='time_of_day', columns='date', values='etd')
mill_pivot.ix[:3, :4]
Out[6]:
date 2014-11-28 2014-12-01 2014-12-02 2014-12-03
time_of_day
03:29:00 52 52 52 52
03:30:00 51 51 51 51
03:31:00 50 50 50 50

Plots

Now I'll plot a line for each day. If the trains are on schedule each line should overlap in a sawtooth pattern. If BART's estimated times are correct the lines should have a slope of 1 everywhere they are differentiable (the estimated time to departure should decrease 1 minute per minute). In the weeks covered by this data the there are a few days where significant deviations occured.

In [7]:
PLOT_ARGS = dict(figsize=(10, 10), cmap='Paired', lw=2, alpha=.5, ylim=[0, 40],
                 xlim=[datetime.time(5, 0), datetime.time(18, 0)])
X_TICKS = [datetime.time(hour, minute) 
            for minute in xrange(0, 60, 20) 
            for hour in xrange(5, 18)]
ax = mill_pivot.plot(**PLOT_ARGS)
prettify_axis(ax, xlabel='Time of day', ylabel='ETD of next train (minutes)')
plt.xticks(X_TICKS, rotation=90);

To focus on the dates where something went wrong, this function emphasizes the line for a given date.

In [8]:
def emphasize_date(pivot_table, date, plot_args=PLOT_ARGS, xticks=X_TICKS):
    ax = pivot_table.plot(**plot_args)
    prettify_axis(ax, xlabel='Time of Day', ylabel='ETD of next train (minutes)')
    ax.set_title(date, fontsize=24)
    for lines in zip(ax.get_lines(), ax.get_legend().get_lines()):
        if lines[0].get_label() == str(date):
            plt.setp(lines, lw=2, zorder=10, alpha=1, color='r')
        else:
            plt.setp(lines, alpha=.5)
    plt.xticks(xticks, rotation=90);
In [9]:
emphasize_date(mill_pivot, datetime.date(2014, 11, 28))

The Millbrae trains are scheduled to come every 15 minutes, below is a histogram of the actual time between trains. In general much of the data is within a few minutes of this ideal, although there are times when the trains are spread out or clustered.

In [10]:
leaving_times = df_mill[df_mill['etd'] == 0]['time']
arrival_diff_minutes = pd.to_datetime(leaving_times).diff() / np.timedelta64(1, 'm')
sele = (arrival_diff_minutes > 2) & (arrival_diff_minutes < 30) 
ax = arrival_diff_minutes[sele].hist(figsize=(8, 6), bins=np.arange(.5, 30, 1))
prettify_axis(ax, xlabel='Minutes between trains', ylabel='# of trains')

Similar Posts



Comments

Links

Social