Linear Algebra Review and numpy

I’ve signed up for the Machine Learning course from Stanford. One of the first week’s subject areas is a Linear Algebra Review, and the recommended software is GNU Octave. However, I’d prefer to use numpy and Python. Here’s my notes so far.

After installing numpy, you can define a matrix in one of the following ways:

from numpy import matrix

A = matrix( ( (3,4),(2,16)))
# or
A = matrix("3 4; 2 16")
# or even
A = matrix( (3,4,2,16) )
A.resize( (2,2) )

You can produce the transpose of A, written AT with:

print("transpose of A = \n%s" % A.transpose() )

producing output:

transpose of A =
[[ 3  2]
 [ 4 16]]

or if you already have a numpy.array, you can create it from that. We’ll see later why you need to use numpy.matrix and not numpy.array:

from numpy import matrix, array

A2 = array( ( (3,4),(2,16)))
A = matrix(A2)

print("A[1,0]=%s" % A[1,0])

This produces

A[1,0]=2

Note that, when referring to individual elements of the matrix, numpy is 0-based, so A[1,0] is the 2nd row, 1st column.

Adding two matrices and multiplying a matrix by scalar is straightforward. Taking the inverse of a matrix is a little less obvious. The example uses matrix A as defined above:

from numpy import linalg
inverseOfA = linalg.inv(A)
print("inverseOfA =\n%s" % inverseOfA)
print("A * inverseOfA =\n%s" % (A * inverseOfA) )

shows the inverse, written A-1, and shows that multiplying A and A-1 produces the identity matrix:

inverseOfA =
[[ 0.4   -0.1  ]
 [-0.05   0.075]]
A * inverseOfA =
[[ 1.  0.]
 [ 0.  1.]]

Finally, matrix multiplication is the reason to use numpy.matrix, and numpy.array. Here’s an example:

arrayA = array( ( (3,4),(2,16)))
matrixA = matrix(arrayA)

print "Multiplication as array = \n%s" % (arrayA * arrayA)
print "Multiplication as matrix = \n%s" % (matrixA * matrixA)

outputs:

Multiplication as array =
[[  9  16]
 [  4 256]]
Multiplication as matrix =
[[ 17  76]
 [ 38 264]]

In the array multiplication, Resultij = Aij * Aij, which is not what is needed for matrix multiplication!

Posted in Python | 5 Comments

Tuple Trivia

A couple of minor points regarding tuples. When you take a slice of a tuple, there are 3 parameters:
– the start index
– the end index
– the step

aTuple = (1, 2, 3, 4, 5)
print (aTuple[1:3])
print (aTuple[0:7])
print(aTuple[3:1:-1])

which produces:

(2, 3)
(1, 2, 3, 4, 5)
(4, 3)

It doesn’t matter that, in the second instance, the end index is beyond the last index of the string. Also, the end index is excluded.

If you want to go to the start of the tuple, then you need to set the end index to -1:

print(aTuple[3:-1:-1])

But this doesn’t produce the expected result.

( )

This is because negative indexes are from the end of the tuple. So -1 is the last element in the tuple.

print(aTuple[2:-1])

produces:

(3, 4)

Again, this excludes the final index. To include either the start or the end of the tuple as the final element of the result, the end index should be left empty:

print(aTuple[2:])
print(aTuple[2::-1])

giving:

(3, 4, 5)
(3, 2, 1)

Finally, in most cases where you can use a tuple, you can use a list instead:

aList = list(aTuple)

print (aList[1:3])
print (aList[0:7])
print(aList[3:1:-1])
print(aList[3:-1:-1])
print(aList[2:-1])
print(aList[2:])
print(aList[2::-1])

producing similar results as before, but lists instead of tuples:

[2, 3]
[1, 2, 3, 4, 5]
[4, 3]
[]
[3, 4]
[3, 4, 5]
[3, 2, 1]

But this doesn’t work for the % operator:

print("aTuple[0]=%s, aTuple[1]=%s, aTuple[2]=%s, aTuple[3]=%s, aTuple[4]=%s" % aTuple)
print("aList[0]=%s, aList[1]=%s, aList[2]=%s, aList[3]=%s, aList[4]=%s" % aList)
aTuple[0]=1, aTuple[1]=2, aTuple[2]=3, aTuple[3]=4, aTuple[4]=5
Traceback (most recent call last):
  File "c:/prr/cgibin/data/prr/codebright/tuple.py", line 29, in <module>
    print("aList[0]=%s, aList[1]=%s, aList[2]=%s, aList[3]=%s, aList[4]=%s" % aList)
TypeError: not enough arguments for format string
Posted in Python, Software | Leave a comment

globals and __main__: a Python gotcha

One of the good features is the lack of gotchas. I define gotchas as traps for the unwary programmer, where something unexpected happens. Thankfully these are rare in Python. But the following is one that is worth knowing about. Consider two Python modules, named main_module.py and sub_module.py.

First main_module.py:

g_main_value = None

def get_main_value():
    return g_main_value


def main_test():
    print("sub_value = %s" % get_sub_value())


from sub_module import get_sub_value, sub_test, sub_init

def main_init():
    global g_main_value

    g_main_value = 23


if __name__ == "__main__":
    main_init()
    sub_init()

    main_test()
    sub_test()

    print("main_value (in main_module) = %s" % get_main_value() )

Then sub_module.py:

g_sub_value = None

def get_sub_value():
    return g_sub_value


def sub_test():
    print("main_value (in sub_module) = %s" % get_main_value())


def sub_init():
    global g_sub_value

    g_sub_value = 31


from main_module import get_main_value

The output is:

sub_value = 31
main_value (in sub_module) = None
main_value (in main_module) = 23

What’s happening? A global with 2 different values? Add the following code to the bottom of main_module:

def show_globals(module_name):
    mod = __import__(module_name)
    print(module_name)
    for k in dir(mod):
        if k.startswith("g_"):
            print("  %s: %s" % (k, getattr(mod, k)))
    print("")


if __name__ == "__main__":
    print("")
    show_globals("__main__")
    show_globals("main_module")
    show_globals("sub_module")

The output is:

__main__
  g_main_value: 23

main_module
  g_main_value: None

sub_module
  g_sub_value: 31

This gives us a clue as to what’s happening: the __main__ module may be loaded twice if used by another module. If you wanted to set a global in the __main__ module that is usable by other modules, then one way to do it is:

Add the following to main_module.py

mod = __import__("main_module")
setattr(mod, "g_main_value2", 34)

if __name__ == "__main__":
    print("")
    show_globals("__main__")
    show_globals("main_module")
    show_globals("sub_module")

And the output will be


__main__
  g_main_value: 23
  g_main_value2: None

main_module
  g_main_value: None
  g_main_value2: 34

sub_module
  g_sub_value: 31
Posted in Python, Software | Leave a comment

Edge case for split / explode

During recent coding, using PHP, I found that the explode function behaves in a non-ideal way (for me) when trying to split an empty string:

<?php

$res = explode(",", "");
print_r($res);

?>

produces:

Array
(
    [0] =>
)

I was expecting (needing?) an empty array. After thinking about it a bit more, it is clear that returning a single empty string is a logical return value. However, different languages have different returns for this special case. Python (2.6.4 and 3.1.2) are both like PHP (5.3.5), but Perl (5.10.1) behaves differently:

For PHP:

<?php
function test_explode($txt)
{
    print("Exploding '$txt' to give [" );
    $sep = "";
    $res = explode(",", $txt);
    foreach ($res as $elt)
    {
        print "$sep'$elt'";
        $sep = ", ";
    }
    print "]\n";

}

test_explode("a,b");
test_explode(",");
test_explode("");

?>

produces:

Exploding 'a,b' to give ['a', 'b']
Exploding ',' to give ['', '']
Exploding '' to give ['']

For Python:

def test_split(txt):
    print("Splitting '%s' to give %s" % (txt, txt.split(",")))

test_split("a,b")
test_split(",")
test_split("")

produces:

Splitting 'a,b' to give ['a', 'b']
Splitting ',' to give ['', '']
Splitting '' to give ['']

For Perl:


sub test_split
{
    my $txt = shift;
    my @res = split(",", $txt);
    print("Splitting '$txt' to give [");
    my $sep = "";
    foreach my $elt (@res)
    {
        print "$sep'$elt'";
        $sep = ", ";
    }
    print "]\n";
}


test_split("a,b");
test_split(",");
test_split("");

produces:

Splitting 'a,b' to give ['a', 'b']
Splitting ',' to give []
Splitting '' to give []
Posted in Python, Software | Leave a comment

Reading gzip files in Python – fast!

I have been doing lots of parsing of Apache log files recently with Python. Speed is of the essence, so I was interested to know if the speed issues mentioned in http://bugs.python.org/issue7471 are still relevant. So I set out to time it, under Python (v2.6.4 and v3.1.2, both on Fedora 13).

My test data consists of 950,000 lines in 5 gzipped Apache log files. I compared the times using the Python gzip module, and the zcat method suggested in http://bugs.python.org/issue7471.

A quick summary of results: yes, the issue is still relevant. The average times are:

For Python 2.6.4

  • gzip.open – 9.5 seconds
  • zcat and pipe – 2.3 seconds

For Python 3.1.2

  • gzip.open – 11.6 seconds
  • zcat and pipe – 2.7 seconds

Below are the details. First, the source code to time the tests:

import os
import sys

if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

import gzip
import subprocess
import time

dirname = "test"
fl = os.listdir(dirname)

fl.sort()

ttime = [0, 0]
runs = 5

for i in range(2 * runs):
    st = time.time()

    for fn in fl:
        if not fn.endswith(".gz"):
            continue

        cc = 0
        lc = 0
        sz = 0

        fullfn = os.path.join(dirname, fn)
        sz += os.stat(fullfn)[6]
        if i % 2 == 0:
            fh = gzip.GzipFile(fullfn, "r")
        else:
            p = subprocess.Popen(["zcat", fullfn], stdout = subprocess.PIPE)
            fh = io_method(p.communicate()[0])
            assert p.returncode == 0

        for line in fh:
            lc += 1
            cc += len(line)

    et = time.time()
    dt = et - st

    ttime[i % 2] += dt
    print("time-taken = %0.2f seconds, 1000 characters per second = %0.0f, file size per second = %0.0f, character count=%s, line count=%s file size = %s" % (dt, 0.001 * cc / dt, 0.001 * sz / dt, cc, lc, sz))

print("\nAverages")
print("  gzip.open - %0.1f seconds" % (ttime[0] / runs))
print("  zcat and pipe - %0.1f seconds" % (ttime[1] / runs))

Timings for Python 2.6.4:

time-taken = 9.71 seconds, 1000 characters per second = 5237, file size per second = 480, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.28 seconds, 1000 characters per second = 22300, file size per second = 2044, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.48 seconds, 1000 characters per second = 5363, file size per second = 492, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.24 seconds, 1000 characters per second = 22750, file size per second = 2085, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.44 seconds, 1000 characters per second = 5389, file size per second = 494, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.30 seconds, 1000 characters per second = 22156, file size per second = 2031, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.42 seconds, 1000 characters per second = 5397, file size per second = 495, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.26 seconds, 1000 characters per second = 22516, file size per second = 2064, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.61 seconds, 1000 characters per second = 5293, file size per second = 485, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.31 seconds, 1000 characters per second = 22033, file size per second = 2019, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 9.5 seconds
  zcat and pipe - 2.3 seconds

Timings for Python 3.1.2:

time-taken = 11.65 seconds, 1000 characters per second = 4364, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.87 seconds, 1000 characters per second = 17691, file size per second = 1621, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.66 seconds, 1000 characters per second = 4362, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.56 seconds, 1000 characters per second = 19834, file size per second = 1818, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.54 seconds, 1000 characters per second = 4408, file size per second = 404, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.64 seconds, 1000 characters per second = 19295, file size per second = 1768, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.58 seconds, 1000 characters per second = 4393, file size per second = 403, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.59 seconds, 1000 characters per second = 19659, file size per second = 1802, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.68 seconds, 1000 characters per second = 4354, file size per second = 399, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.97 seconds, 1000 characters per second = 17123, file size per second = 1569, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 11.6 seconds
  zcat and pipe - 2.7 seconds
Posted in Python, Software | 7 Comments

Python Standard Library

There are some gems in the Python standard library which, once found, get regular use – at least by me. Here are some of them.

1. enumerate

A built-in function, which allows you to replace code like this:

day_list = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day_index in range(len(day_list)):
    day_name = day_list[day_index]
    print(day_index, day_name)

with the more succinct:

day_list = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day_index, day_name in enumerate(day_list):
    print(day_index, day_name)

 

2. gzip.GzipFile

I used this whilst processing Apache log files last week. It offers compression to less than 10% of the original uncompressed size, but seems to take the same length of time to read/write, implying a good trade-off between processor use and hard-disk activity. Having changed the open code from

fh = open(<logfile>, "r")

to:

import gzip
...
fh = gzip.GzipFile(<logfile>, "r")

the rest of the code runs unchanged.

Note: if you are running on Linux, then the issue of speed, documented at http://bugs.python.org/issue7471, is relevant, and the zcat approach is faster.
 

3. collections.defaultdict

This provides a dictionary, which initialises a key on first read using a method.

For instance:

import collections
...
zerodict = collections.defaultdict(int)
print(zerodict["key"], zerodict[4])

produces:

0 0
Posted in Python, Software | Leave a comment

Python regular expression surprise

I came across this regular expression anomaly recently in Python. The regular expression was more complex, but this example illustrates the idea.

Try and predict the outcome (should work with Python 2.5 – Python 3.1)

import re

abcd = "abcd"
ad = "ad"

cre = re.compile(r"^(a)(bc)?(d)$")

mos1 = cre.match(abcd)
print(cre.sub(r"\1, \2, \3", abcd))

mos2 = cre.match(ad)
print(cre.sub(r"\1, \2, \3", ad))

This produces:

a, bc, d
Traceback (most recent call last): 
  File "<filename>", line <linenumber>, in <module>
    print(cre.sub(r"\1, \2, \3", ad))
  File "C:\programs\Python25\lib\re.py", line 266, in filter
    return sre_parse.expand_template(template, match)
  File "C:\programs\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

(Line numbers vary with different versions of Python – this is Python 2.5)

Adding the following code illustrates the problem:

print(mos1.group(1), mos1.group(2), mos1.group(3) )
print(mos2.group(1), mos2.group(2), mos2.group(3) )

For Python 2.5, this outputs

('a', 'bc', 'd')
('a', None, 'd')

So now it is clear why it is failing, what is the solution. First fix was:

cre2 = re.compile(r"^(a)((bc)?)(d)$")
print(cre2.sub(r"\1, \2, \4", ad))

This fix works, but it has side-effects – the last group number has changed. So a better solution might be:

cre3 = re.compile(r"^(a)((?:bc)?)(d)$")
print(cre3.sub(r"\1, \2, \3", ad))

Using the non-capturing group notation, (?:…) the problem is fixed without side-effects.

I’m not entirely happy that the problem occurs in the first place, but it does.

Posted in Python, Software | 1 Comment