I have been doing lots of parsing of Apache log files recently with Python. Speed is of the essence, so I was interested to know if the speed issues mentioned in http://bugs.python.org/issue7471 are still relevant. So I set out to time it, under Python (v2.6.4 and v3.1.2, both on Fedora 13).
My test data consists of 950,000 lines in 5 gzipped Apache log files. I compared the times using the Python gzip module, and the zcat method suggested in http://bugs.python.org/issue7471.
A quick summary of results: yes, the issue is still relevant. The average times are:
For Python 2.6.4
- gzip.open – 9.5 seconds
- zcat and pipe – 2.3 seconds
For Python 3.1.2
- gzip.open – 11.6 seconds
- zcat and pipe – 2.7 seconds
Below are the details. First, the source code to time the tests:
import os import sys if sys.version.startswith("3"): import io io_method = io.BytesIO else: import cStringIO io_method = cStringIO.StringIO import gzip import subprocess import time dirname = "test" fl = os.listdir(dirname) fl.sort() ttime = [0, 0] runs = 5 for i in range(2 * runs): st = time.time() for fn in fl: if not fn.endswith(".gz"): continue cc = 0 lc = 0 sz = 0 fullfn = os.path.join(dirname, fn) sz += os.stat(fullfn)[6] if i % 2 == 0: fh = gzip.GzipFile(fullfn, "r") else: p = subprocess.Popen(["zcat", fullfn], stdout = subprocess.PIPE) fh = io_method(p.communicate()[0]) assert p.returncode == 0 for line in fh: lc += 1 cc += len(line) et = time.time() dt = et - st ttime[i % 2] += dt print("time-taken = %0.2f seconds, 1000 characters per second = %0.0f, file size per second = %0.0f, character count=%s, line count=%s file size = %s" % (dt, 0.001 * cc / dt, 0.001 * sz / dt, cc, lc, sz)) print("\nAverages") print(" gzip.open - %0.1f seconds" % (ttime[0] / runs)) print(" zcat and pipe - %0.1f seconds" % (ttime[1] / runs))
Timings for Python 2.6.4:
time-taken = 9.71 seconds, 1000 characters per second = 5237, file size per second = 480, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.28 seconds, 1000 characters per second = 22300, file size per second = 2044, character count=50859394, line count=194151 file size = 4661207 time-taken = 9.48 seconds, 1000 characters per second = 5363, file size per second = 492, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.24 seconds, 1000 characters per second = 22750, file size per second = 2085, character count=50859394, line count=194151 file size = 4661207 time-taken = 9.44 seconds, 1000 characters per second = 5389, file size per second = 494, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.30 seconds, 1000 characters per second = 22156, file size per second = 2031, character count=50859394, line count=194151 file size = 4661207 time-taken = 9.42 seconds, 1000 characters per second = 5397, file size per second = 495, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.26 seconds, 1000 characters per second = 22516, file size per second = 2064, character count=50859394, line count=194151 file size = 4661207 time-taken = 9.61 seconds, 1000 characters per second = 5293, file size per second = 485, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.31 seconds, 1000 characters per second = 22033, file size per second = 2019, character count=50859394, line count=194151 file size = 4661207 Averages gzip.open - 9.5 seconds zcat and pipe - 2.3 seconds
Timings for Python 3.1.2:
time-taken = 11.65 seconds, 1000 characters per second = 4364, file size per second = 400, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.87 seconds, 1000 characters per second = 17691, file size per second = 1621, character count=50859394, line count=194151 file size = 4661207 time-taken = 11.66 seconds, 1000 characters per second = 4362, file size per second = 400, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.56 seconds, 1000 characters per second = 19834, file size per second = 1818, character count=50859394, line count=194151 file size = 4661207 time-taken = 11.54 seconds, 1000 characters per second = 4408, file size per second = 404, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.64 seconds, 1000 characters per second = 19295, file size per second = 1768, character count=50859394, line count=194151 file size = 4661207 time-taken = 11.58 seconds, 1000 characters per second = 4393, file size per second = 403, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.59 seconds, 1000 characters per second = 19659, file size per second = 1802, character count=50859394, line count=194151 file size = 4661207 time-taken = 11.68 seconds, 1000 characters per second = 4354, file size per second = 399, character count=50859394, line count=194151 file size = 4661207 time-taken = 2.97 seconds, 1000 characters per second = 17123, file size per second = 1569, character count=50859394, line count=194151 file size = 4661207 Averages gzip.open - 11.6 seconds zcat and pipe - 2.7 seconds
Advertisements
cool, i’m doing some researches in parsing syslog logs, do you know any syslog parse module like in Perl Parse::Syslog?
I’d:
– avoid reading the entire file into memory using subprocess.communicate() and cStringIO
– initialize the cc, lc and sz outside of the internal loop
“me”, thanks for pointing me to subprocess.communicate.
However, I just read the following:
http://docs.python.org/2/library/subprocess.html#subprocess.Popen.communicate
“The data read is buffered in memory, so do not use this method if the data size is large or unlimited.”
Instead I opted for this solution by Spencer Rathbun:
http://superuser.com/questions/381394/unix-split-a-huge-gz-file-by-line
Pingback: How to test a directory of files for gzip and uncompress gzipped files in Python using zcat? | BlogoSfera
very interesting. How would this work for a remote .gz file? would you call curl from Python?
Note, when you do zcat and pipe (zcat file | python program.py or subprocess) you will use two independent processes and therefore two processor cores will be used if available, while running Python program with gzip.open() will be bound to single core.
Of course using two independent processes can not explain 3-4 fold time difference 😉