Saving time and space by working with gzip and bzip2 compressed files in python

File compression tools like gzip and bzip2 can compress text files into a fraction of their size, often to as little as 20% of the original. Data files often come compressed to save storage space and network bandwidth. A typical workflow is to uncompress the file before analysis, but it can be more convenient to leave the file in its compressed form, especially if the uncompressed file would take up a significant amount of space. In this post I'll show how to work directly with compressed files in python.

Compression Ratios

Lets look at a small csv file containing data on National Parks (originally from wikipedia). The uncompressed file is 3.1 kb.

In [1]:
!head -n4 data/nationalparks.csv
Name,Location,Year Established,Area
Acadia National Park,Maine,1919,48876.58
National Park of American Samoa,American Samoa,1988,8256.67
Arches National Park,Utah,1971,76678.98
In [2]:
!ls -lh data/nationalparks.csv
-rw-r--r--    1 Frank    Administ     3.1k Mar  7 21:56 data/nationalparks.csv

These commands zip up the files to see the difference in size. The -k option keeps the original file for bzip2, only recent versions of gzip support this so the gzip command is written a bit differently.

In [3]:
# Clean file before creating it again
!rm -f data/nationalparks.csv.bz2
In [4]:
%%bash
gzip < data/nationalparks.csv > data/nationalparks.csv.gz
bzip2 -k data/nationalparks.csv
ls -lh data/nationalparks*
-rw-r--r--    1 Frank    Administ     3.1k Mar  7 21:56 data/nationalparks.csv
-rw-r--r--    1 Frank    Administ     1.2k Mar  7 21:56 data/nationalparks.csv.bz2
-rw-r--r--    1 Frank    Administ     1.3k Mar 15 15:07 data/nationalparks.csv.gz

In general bzip2 compresses slightly more than gzip, but is significantly slower. For general use I find gzip to be preferable. Now on to python! The function below prints the total area of all the National Parks using the uncompressed file.

In [5]:
import csv

def sum_area(f):
    reader = csv.reader(f.readlines()[1:])  # exclude header line
    total_area = sum([float(row[3]) for row in reader])
    return total_area

def total_area_uncompressed(filename):
    with open(filename) as f:
        return sum_area(f)

total = total_area_uncompressed('data/nationalparks.csv')
print 'Total National Park Area = {:,} acres'.format(total)
Total National Park Area = 52,096,299.61 acres

To accomplish the same thing with compressed files, we can use the gzip and bz2 libraries:

In [6]:
import bz2
import gzip

def total_area_gzip(filename):
    with gzip.GzipFile(filename) as f:
        return sum_area(f)

def total_area_bz2(filename):
    with bz2.BZ2File(filename) as f:
        return sum_area(f)

print 'Total National Park Area = {:,} acres'.format(
    total_area_gzip('data/nationalparks.csv.gz')
)
print 'Total National Park Area = {:,} acres'.format(
    total_area_bz2('data/nationalparks.csv.bz2')
)
Total National Park Area = 52,096,299.61 acres
Total National Park Area = 52,096,299.61 acres

Now our code operates on the compressed files directly. The downside is that it takes longer for the work on compressed files (for larger files this will be more significant). Keep in mind that the time to decompress the file must be taken at some point (either before or during analysis) so as long as you're not running the analysis many times the cost is minimal.

In [7]:
%%timeit
total_area_uncompressed('data/nationalparks.csv')
1000 loops, best of 3: 285 µs per loop
In [8]:
%%timeit
total_area_gzip('data/nationalparks.csv.gz')
1000 loops, best of 3: 510 µs per loop
In [9]:
%%timeit
total_area_bz2('data/nationalparks.csv.bz2')
1000 loops, best of 3: 347 µs per loop

We could also write a function that deals with the file appropriately based on its extension, saving us from having three separate functions.

In [10]:
def opener(filename):
    if filename.endswith('.gz'):
        return gzip.GzipFile(filename)
    elif filename.endswith('.bz2'):
        return bz2.BZ2File(filename)
    else:
        return open(filename)

for extension in ['', '.gz', '.bz2']:
    filename = 'data/nationalparks.csv' + extension
    print 'Reading {}'.format(filename)
    with opener(filename) as f:
        print 'Total National Park Area = {:,} acres'.format(sum_area(f))
Reading data/nationalparks.csv
Total National Park Area = 52,096,299.61 acres
Reading data/nationalparks.csv.gz
Total National Park Area = 52,096,299.61 acres
Reading data/nationalparks.csv.bz2
Total National Park Area = 52,096,299.61 acres

Of course using library functions is preferred when possible. Happily pandas supports reading compressed files with the compression= parameter of read_csv().

In [11]:
import pandas as pd

npdf = pd.read_csv('data/nationalparks.csv.bz2', compression='bz2')
npdf.head()
Out[11]:
Name Location Year Established Area
0 Acadia National Park Maine 1919 48876.58
1 National Park of American Samoa American Samoa 1988 8256.67
2 Arches National Park Utah 1971 76678.98
3 Badlands National Park South Dakota 1978 242755.94
4 Big Bend National Park Texas 1944 801163.21
In [12]:
print 'Total National Park Area = {:,} acres'.format(npdf['Area'].sum())
Total National Park Area = 52,096,299.61 acres

Similar Posts



Comments

Links

Social