Reading gzip files in Python – fast!

I have been doing lots of parsing of Apache log files recently with Python. Speed is of the essence, so I was interested to know if the speed issues mentioned in http://bugs.python.org/issue7471 are still relevant. So I set out to time it, under Python (v2.6.4 and v3.1.2, both on Fedora 13).

My test data consists of 950,000 lines in 5 gzipped Apache log files. I compared the times using the Python gzip module, and the zcat method suggested in http://bugs.python.org/issue7471.

A quick summary of results: yes, the issue is still relevant. The average times are:

For Python 2.6.4

  • gzip.open – 9.5 seconds
  • zcat and pipe – 2.3 seconds

For Python 3.1.2

  • gzip.open – 11.6 seconds
  • zcat and pipe – 2.7 seconds

Below are the details. First, the source code to time the tests:

import os
import sys

if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

import gzip
import subprocess
import time

dirname = "test"
fl = os.listdir(dirname)

fl.sort()

ttime = [0, 0]
runs = 5

for i in range(2 * runs):
    st = time.time()

    for fn in fl:
        if not fn.endswith(".gz"):
            continue

        cc = 0
        lc = 0
        sz = 0

        fullfn = os.path.join(dirname, fn)
        sz += os.stat(fullfn)[6]
        if i % 2 == 0:
            fh = gzip.GzipFile(fullfn, "r")
        else:
            p = subprocess.Popen(["zcat", fullfn], stdout = subprocess.PIPE)
            fh = io_method(p.communicate()[0])
            assert p.returncode == 0

        for line in fh:
            lc += 1
            cc += len(line)

    et = time.time()
    dt = et - st

    ttime[i % 2] += dt
    print("time-taken = %0.2f seconds, 1000 characters per second = %0.0f, file size per second = %0.0f, character count=%s, line count=%s file size = %s" % (dt, 0.001 * cc / dt, 0.001 * sz / dt, cc, lc, sz))

print("\nAverages")
print("  gzip.open - %0.1f seconds" % (ttime[0] / runs))
print("  zcat and pipe - %0.1f seconds" % (ttime[1] / runs))

Timings for Python 2.6.4:

time-taken = 9.71 seconds, 1000 characters per second = 5237, file size per second = 480, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.28 seconds, 1000 characters per second = 22300, file size per second = 2044, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.48 seconds, 1000 characters per second = 5363, file size per second = 492, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.24 seconds, 1000 characters per second = 22750, file size per second = 2085, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.44 seconds, 1000 characters per second = 5389, file size per second = 494, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.30 seconds, 1000 characters per second = 22156, file size per second = 2031, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.42 seconds, 1000 characters per second = 5397, file size per second = 495, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.26 seconds, 1000 characters per second = 22516, file size per second = 2064, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.61 seconds, 1000 characters per second = 5293, file size per second = 485, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.31 seconds, 1000 characters per second = 22033, file size per second = 2019, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 9.5 seconds
  zcat and pipe - 2.3 seconds

Timings for Python 3.1.2:

time-taken = 11.65 seconds, 1000 characters per second = 4364, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.87 seconds, 1000 characters per second = 17691, file size per second = 1621, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.66 seconds, 1000 characters per second = 4362, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.56 seconds, 1000 characters per second = 19834, file size per second = 1818, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.54 seconds, 1000 characters per second = 4408, file size per second = 404, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.64 seconds, 1000 characters per second = 19295, file size per second = 1768, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.58 seconds, 1000 characters per second = 4393, file size per second = 403, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.59 seconds, 1000 characters per second = 19659, file size per second = 1802, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.68 seconds, 1000 characters per second = 4354, file size per second = 399, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.97 seconds, 1000 characters per second = 17123, file size per second = 1569, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 11.6 seconds
  zcat and pipe - 2.7 seconds
About these ads
This entry was posted in Python, Software. Bookmark the permalink.

7 Responses to Reading gzip files in Python – fast!

  1. insider says:

    cool, i’m doing some researches in parsing syslog logs, do you know any syslog parse module like in Perl Parse::Syslog?

  2. me says:

    I’d:

    – avoid reading the entire file into memory using subprocess.communicate() and cStringIO
    – initialize the cc, lc and sz outside of the internal loop

  3. “me”, thanks for pointing me to subprocess.communicate.

  4. Pingback: How to test a directory of files for gzip and uncompress gzipped files in Python using zcat? | BlogoSfera

  5. devboell says:

    very interesting. How would this work for a remote .gz file? would you call curl from Python?

  6. Note, when you do zcat and pipe (zcat file | python program.py or subprocess) you will use two independent processes and therefore two processor cores will be used if available, while running Python program with gzip.open() will be bound to single core.
    Of course using two independent processes can not explain 3-4 fold time difference ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s