Writing and debugging joins can be especially difficult when dealing with data from text files. In some cases there is no resulting data, or (much harder to notice!) a few lines that should be included are dropped. Here I'll go into an example of a failed join in pandas, and how to fix it.
Consider trying to join the following two csv files on the city_id column.
$ cat city-names.csv city_id,city_name 1,Dublin 2,Pleasanton 3,Millbrae 4,Richmond $ cat city-populations.csv city_id,city_population 1,52105 2,74110 3,22424 4,107571 Unknown,116768
Loading the data
Read the data into pandas with
In : import pandas as pd In : city_names_df = pd.read_csv('city-names.csv') In : city_pop_df = pd.read_csv('city-populations.csv') In : city_names_df Out: city_id city_name 0 1 Dublin 1 2 Pleasanton 2 3 Millbrae 3 4 Richmond In : city_pop_df Out: city_id city_population 0 1 52105 1 2 74110 2 3 22424 3 4 107571 4 Unknown 116768
Attempting the join
If we attempt to join the two DataFrames on their shared columns (
city_id in this case), the result is empty, although we would expect ids 1-4 to match:
In : city_names_df.merge(city_pop_df) Out: Empty DataFrame Columns: [city_id, city_name, city_population] Index: 
What went wrong
What's going on? The problem is that the value "Unknown" in
city-populations.csv forces that column to be parsed as string values, which then don't equate with the their matching values in
city-names.csv. This can be seen by inspecting the data types of the DataFrames.
In : city_names_df.dtypes Out: city_id int64 city_name object dtype: object In : city_pop_df.dtypes Out: city_id object city_population int64 dtype: object
Notice that the
city_id column has a numeric type (
int64) in one DataFrame and
object in the other. Looking in more detail:
In : city_names_df.ix[0, 'city_id'] Out: 1 In : type(city_names_df.ix[0, 'city_id']) Out: numpy.int64 In : city_pop_df.ix[0, 'city_id'] Out: '1' In : type(city_pop_df.ix[0, 'city_id']) Out: str In : city_names_df.ix[0, 'city_id'] == city_pop_df.ix[0, 'city_id'] Out: False
Which option is best will depend on the specifics of how your data is dirty.
Option 1. Apply a parsing function to parse the data:
import numpy as np def parse(x): try: return int(x) except ValueError: return np.nan In : city_pop_df['city_id'] = city_pop_df['city_id'].apply(parse) In : city_pop_df.dtypes Out: city_id float64 city_population int64 dtype: object In : city_names_df.merge(city_pop_df) Out: city_id city_name city_population 0 1 Dublin 52105 1 2 Pleasanton 74110 2 3 Millbrae 22424 3 4 Richmond 107571
Option 2. Search for non-number string and replace when with NaN:
In : import re In : replaced_with_nan = city_pop_df['city_id'].replace(re.compile('\D+'), np.nan) In : city_pop_df['city_id'] = replaced_with_nan.astype(np.float) In : city_names_df.merge(city_pop_df) Out: city_id city_name city_population 0 1 Dublin 52105 1 2 Pleasanton 74110 2 3 Millbrae 22424 3 4 Richmond 107571
Option 3. Make a new file with only the good rows:
$ # keep the column headers $ head -n1 city-populations.csv > cleaned-city-populations.csv $ grep -E '^[0-9]+,' city-populations.csv >> cleaned-city-populations.csv
Load new the file:
In : clean_pop_df = pd.read_csv('cleaned-city-populations.csv') In : city_names_df.merge(clean_pop_df) Out: city_id city_name city_population 0 1 Dublin 52105 1 2 Pleasanton 74110 2 3 Millbrae 22424 3 4 Richmond 107571
Option 4. Make a new file using csvkit
csvkit provides a familiar command line interface designed to work with csv files. The grep command in option 3 above would be difficult to change to apply to the 31st column in a 50 column wide csv file, but
csvgrep allows columns to be identified by name or number.
$ csvgrep -r '[0-9]+' -c city_id city-populations.csv > cleaned-city-populations2.csv
Load the new file:
In : clean_pop_df = pd.read_csv('cleaned-city-populations2.csv') In : city_names_df.merge(clean_pop_df) Out: city_id city_name city_population 0 1 Dublin 52105 1 2 Pleasanton 74110 2 3 Millbrae 22424 3 4 Richmond 107571
A note on good data science
If you do use shell commands to alter the files, make sure you either keep the commands in a shell script file or call them directly from your python script so that your work can be exactly reproduced given the raw data. Shell commands can be called from IPython either by prefacing the line with
! or with the
%%bash cell magic function.
- Pandas date parsing performance, Score: 0.979
- Pandas 0.16.0 released, Score: 0.953
- Saving time and space by working with gzip and bzip2 compressed files in python, Score: 0.945
- Using sed to make specific text lowercase in place, Score: 0.865
- First Look at AWS Machine Learning, Score: 0.860