Linear Algebra Review and numpy

I’ve signed up for the Machine Learning course from Stanford. One of the first week’s subject areas is a Linear Algebra Review, and the recommended software is GNU Octave. However, I’d prefer to use numpy and Python. Here’s my notes so far.

After installing numpy, you can define a matrix in one of the following ways:

from numpy import matrix

A = matrix( ( (3,4),(2,16)))
# or
A = matrix("3 4; 2 16")
# or even
A = matrix( (3,4,2,16) )
A.resize( (2,2) )

You can produce the transpose of A, written AT with:

print("transpose of A = \n%s" % A.transpose() )

producing output:

transpose of A =
[[ 3  2]
 [ 4 16]]

or if you already have a numpy.array, you can create it from that. We’ll see later why you need to use numpy.matrix and not numpy.array:

from numpy import matrix, array

A2 = array( ( (3,4),(2,16)))
A = matrix(A2)

print("A[1,0]=%s" % A[1,0])

This produces

A[1,0]=2

Note that, when referring to individual elements of the matrix, numpy is 0-based, so A[1,0] is the 2nd row, 1st column.

Adding two matrices and multiplying a matrix by scalar is straightforward. Taking the inverse of a matrix is a little less obvious. The example uses matrix A as defined above:

from numpy import linalg
inverseOfA = linalg.inv(A)
print("inverseOfA =\n%s" % inverseOfA)
print("A * inverseOfA =\n%s" % (A * inverseOfA) )

shows the inverse, written A-1, and shows that multiplying A and A-1 produces the identity matrix:

inverseOfA =
[[ 0.4   -0.1  ]
 [-0.05   0.075]]
A * inverseOfA =
[[ 1.  0.]
 [ 0.  1.]]

Finally, matrix multiplication is the reason to use numpy.matrix, and numpy.array. Here’s an example:

arrayA = array( ( (3,4),(2,16)))
matrixA = matrix(arrayA)

print "Multiplication as array = \n%s" % (arrayA * arrayA)
print "Multiplication as matrix = \n%s" % (matrixA * matrixA)

outputs:

Multiplication as array =
[[  9  16]
 [  4 256]]
Multiplication as matrix =
[[ 17  76]
 [ 38 264]]

In the array multiplication, Resultij = Aij * Aij, which is not what is needed for matrix multiplication!

Posted in Python | 5 Comments

Tuple Trivia

A couple of minor points regarding tuples. When you take a slice of a tuple, there are 3 parameters:
– the start index
– the end index
– the step

aTuple = (1, 2, 3, 4, 5)
print (aTuple[1:3])
print (aTuple[0:7])
print(aTuple[3:1:-1])

which produces:

(2, 3)
(1, 2, 3, 4, 5)
(4, 3)

It doesn’t matter that, in the second instance, the end index is beyond the last index of the string. Also, the end index is excluded.

If you want to go to the start of the tuple, then you need to set the end index to -1:

print(aTuple[3:-1:-1])

But this doesn’t produce the expected result.

( )

This is because negative indexes are from the end of the tuple. So -1 is the last element in the tuple.

print(aTuple[2:-1])

produces:

(3, 4)

Again, this excludes the final index. To include either the start or the end of the tuple as the final element of the result, the end index should be left empty:

print(aTuple[2:])
print(aTuple[2::-1])

giving:

(3, 4, 5)
(3, 2, 1)

Finally, in most cases where you can use a tuple, you can use a list instead:

aList = list(aTuple)

print (aList[1:3])
print (aList[0:7])
print(aList[3:1:-1])
print(aList[3:-1:-1])
print(aList[2:-1])
print(aList[2:])
print(aList[2::-1])

producing similar results as before, but lists instead of tuples:

[2, 3]
[1, 2, 3, 4, 5]
[4, 3]
[]
[3, 4]
[3, 4, 5]
[3, 2, 1]

But this doesn’t work for the % operator:

print("aTuple[0]=%s, aTuple[1]=%s, aTuple[2]=%s, aTuple[3]=%s, aTuple[4]=%s" % aTuple)
print("aList[0]=%s, aList[1]=%s, aList[2]=%s, aList[3]=%s, aList[4]=%s" % aList)
aTuple[0]=1, aTuple[1]=2, aTuple[2]=3, aTuple[3]=4, aTuple[4]=5
Traceback (most recent call last):
  File "c:/prr/cgibin/data/prr/codebright/tuple.py", line 29, in <module>
    print("aList[0]=%s, aList[1]=%s, aList[2]=%s, aList[3]=%s, aList[4]=%s" % aList)
TypeError: not enough arguments for format string
Posted in Python, Software | Leave a comment

globals and __main__: a Python gotcha

One of the good features is the lack of gotchas. I define gotchas as traps for the unwary programmer, where something unexpected happens. Thankfully these are rare in Python. But the following is one that is worth knowing about. Consider two Python modules, named main_module.py and sub_module.py.

First main_module.py:

g_main_value = None

def get_main_value():
    return g_main_value


def main_test():
    print("sub_value = %s" % get_sub_value())


from sub_module import get_sub_value, sub_test, sub_init

def main_init():
    global g_main_value

    g_main_value = 23


if __name__ == "__main__":
    main_init()
    sub_init()

    main_test()
    sub_test()

    print("main_value (in main_module) = %s" % get_main_value() )

Then sub_module.py:

g_sub_value = None

def get_sub_value():
    return g_sub_value


def sub_test():
    print("main_value (in sub_module) = %s" % get_main_value())


def sub_init():
    global g_sub_value

    g_sub_value = 31


from main_module import get_main_value

The output is:

sub_value = 31
main_value (in sub_module) = None
main_value (in main_module) = 23

What’s happening? A global with 2 different values? Add the following code to the bottom of main_module:

def show_globals(module_name):
    mod = __import__(module_name)
    print(module_name)
    for k in dir(mod):
        if k.startswith("g_"):
            print("  %s: %s" % (k, getattr(mod, k)))
    print("")


if __name__ == "__main__":
    print("")
    show_globals("__main__")
    show_globals("main_module")
    show_globals("sub_module")

The output is:

__main__
  g_main_value: 23

main_module
  g_main_value: None

sub_module
  g_sub_value: 31

This gives us a clue as to what’s happening: the __main__ module may be loaded twice if used by another module. If you wanted to set a global in the __main__ module that is usable by other modules, then one way to do it is:

Add the following to main_module.py

mod = __import__("main_module")
setattr(mod, "g_main_value2", 34)

if __name__ == "__main__":
    print("")
    show_globals("__main__")
    show_globals("main_module")
    show_globals("sub_module")

And the output will be


__main__
  g_main_value: 23
  g_main_value2: None

main_module
  g_main_value: None
  g_main_value2: 34

sub_module
  g_sub_value: 31
Posted in Python, Software | Leave a comment

Edge case for split / explode

During recent coding, using PHP, I found that the explode function behaves in a non-ideal way (for me) when trying to split an empty string:

<?php

$res = explode(",", "");
print_r($res);

?>

produces:

Array
(
    [0] =>
)

I was expecting (needing?) an empty array. After thinking about it a bit more, it is clear that returning a single empty string is a logical return value. However, different languages have different returns for this special case. Python (2.6.4 and 3.1.2) are both like PHP (5.3.5), but Perl (5.10.1) behaves differently:

For PHP:

<?php
function test_explode($txt)
{
    print("Exploding '$txt' to give [" );
    $sep = "";
    $res = explode(",", $txt);
    foreach ($res as $elt)
    {
        print "$sep'$elt'";
        $sep = ", ";
    }
    print "]\n";

}

test_explode("a,b");
test_explode(",");
test_explode("");

?>

produces:

Exploding 'a,b' to give ['a', 'b']
Exploding ',' to give ['', '']
Exploding '' to give ['']

For Python:

def test_split(txt):
    print("Splitting '%s' to give %s" % (txt, txt.split(",")))

test_split("a,b")
test_split(",")
test_split("")

produces:

Splitting 'a,b' to give ['a', 'b']
Splitting ',' to give ['', '']
Splitting '' to give ['']

For Perl:


sub test_split
{
    my $txt = shift;
    my @res = split(",", $txt);
    print("Splitting '$txt' to give [");
    my $sep = "";
    foreach my $elt (@res)
    {
        print "$sep'$elt'";
        $sep = ", ";
    }
    print "]\n";
}


test_split("a,b");
test_split(",");
test_split("");

produces:

Splitting 'a,b' to give ['a', 'b']
Splitting ',' to give []
Splitting '' to give []
Posted in Python, Software | Leave a comment

Reading gzip files in Python – fast!

I have been doing lots of parsing of Apache log files recently with Python. Speed is of the essence, so I was interested to know if the speed issues mentioned in http://bugs.python.org/issue7471 are still relevant. So I set out to time it, under Python (v2.6.4 and v3.1.2, both on Fedora 13).

My test data consists of 950,000 lines in 5 gzipped Apache log files. I compared the times using the Python gzip module, and the zcat method suggested in http://bugs.python.org/issue7471.

A quick summary of results: yes, the issue is still relevant. The average times are:

For Python 2.6.4

  • gzip.open – 9.5 seconds
  • zcat and pipe – 2.3 seconds

For Python 3.1.2

  • gzip.open – 11.6 seconds
  • zcat and pipe – 2.7 seconds

Below are the details. First, the source code to time the tests:

import os
import sys

if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

import gzip
import subprocess
import time

dirname = "test"
fl = os.listdir(dirname)

fl.sort()

ttime = [0, 0]
runs = 5

for i in range(2 * runs):
    st = time.time()

    for fn in fl:
        if not fn.endswith(".gz"):
            continue

        cc = 0
        lc = 0
        sz = 0

        fullfn = os.path.join(dirname, fn)
        sz += os.stat(fullfn)[6]
        if i % 2 == 0:
            fh = gzip.GzipFile(fullfn, "r")
        else:
            p = subprocess.Popen(["zcat", fullfn], stdout = subprocess.PIPE)
            fh = io_method(p.communicate()[0])
            assert p.returncode == 0

        for line in fh:
            lc += 1
            cc += len(line)

    et = time.time()
    dt = et - st

    ttime[i % 2] += dt
    print("time-taken = %0.2f seconds, 1000 characters per second = %0.0f, file size per second = %0.0f, character count=%s, line count=%s file size = %s" % (dt, 0.001 * cc / dt, 0.001 * sz / dt, cc, lc, sz))

print("\nAverages")
print("  gzip.open - %0.1f seconds" % (ttime[0] / runs))
print("  zcat and pipe - %0.1f seconds" % (ttime[1] / runs))

Timings for Python 2.6.4:

time-taken = 9.71 seconds, 1000 characters per second = 5237, file size per second = 480, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.28 seconds, 1000 characters per second = 22300, file size per second = 2044, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.48 seconds, 1000 characters per second = 5363, file size per second = 492, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.24 seconds, 1000 characters per second = 22750, file size per second = 2085, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.44 seconds, 1000 characters per second = 5389, file size per second = 494, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.30 seconds, 1000 characters per second = 22156, file size per second = 2031, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.42 seconds, 1000 characters per second = 5397, file size per second = 495, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.26 seconds, 1000 characters per second = 22516, file size per second = 2064, character count=50859394, line count=194151 file size = 4661207
time-taken = 9.61 seconds, 1000 characters per second = 5293, file size per second = 485, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.31 seconds, 1000 characters per second = 22033, file size per second = 2019, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 9.5 seconds
  zcat and pipe - 2.3 seconds

Timings for Python 3.1.2:

time-taken = 11.65 seconds, 1000 characters per second = 4364, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.87 seconds, 1000 characters per second = 17691, file size per second = 1621, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.66 seconds, 1000 characters per second = 4362, file size per second = 400, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.56 seconds, 1000 characters per second = 19834, file size per second = 1818, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.54 seconds, 1000 characters per second = 4408, file size per second = 404, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.64 seconds, 1000 characters per second = 19295, file size per second = 1768, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.58 seconds, 1000 characters per second = 4393, file size per second = 403, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.59 seconds, 1000 characters per second = 19659, file size per second = 1802, character count=50859394, line count=194151 file size = 4661207
time-taken = 11.68 seconds, 1000 characters per second = 4354, file size per second = 399, character count=50859394, line count=194151 file size = 4661207
time-taken = 2.97 seconds, 1000 characters per second = 17123, file size per second = 1569, character count=50859394, line count=194151 file size = 4661207

Averages
  gzip.open - 11.6 seconds
  zcat and pipe - 2.7 seconds
Posted in Python, Software | 7 Comments

Python Standard Library

There are some gems in the Python standard library which, once found, get regular use – at least by me. Here are some of them.

1. enumerate

A built-in function, which allows you to replace code like this:

day_list = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day_index in range(len(day_list)):
    day_name = day_list[day_index]
    print(day_index, day_name)

with the more succinct:

day_list = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
for day_index, day_name in enumerate(day_list):
    print(day_index, day_name)

 

2. gzip.GzipFile

I used this whilst processing Apache log files last week. It offers compression to less than 10% of the original uncompressed size, but seems to take the same length of time to read/write, implying a good trade-off between processor use and hard-disk activity. Having changed the open code from

fh = open(<logfile>, "r")

to:

import gzip
...
fh = gzip.GzipFile(<logfile>, "r")

the rest of the code runs unchanged.

Note: if you are running on Linux, then the issue of speed, documented at http://bugs.python.org/issue7471, is relevant, and the zcat approach is faster.
 

3. collections.defaultdict

This provides a dictionary, which initialises a key on first read using a method.

For instance:

import collections
...
zerodict = collections.defaultdict(int)
print(zerodict["key"], zerodict[4])

produces:

0 0
Posted in Python, Software | Leave a comment

Python regular expression surprise

I came across this regular expression anomaly recently in Python. The regular expression was more complex, but this example illustrates the idea.

Try and predict the outcome (should work with Python 2.5 – Python 3.1)

import re

abcd = "abcd"
ad = "ad"

cre = re.compile(r"^(a)(bc)?(d)$")

mos1 = cre.match(abcd)
print(cre.sub(r"\1, \2, \3", abcd))

mos2 = cre.match(ad)
print(cre.sub(r"\1, \2, \3", ad))

This produces:

a, bc, d
Traceback (most recent call last): 
  File "<filename>", line <linenumber>, in <module>
    print(cre.sub(r"\1, \2, \3", ad))
  File "C:\programs\Python25\lib\re.py", line 266, in filter
    return sre_parse.expand_template(template, match)
  File "C:\programs\Python25\lib\sre_parse.py", line 793, in expand_template
    raise error, "unmatched group"
sre_constants.error: unmatched group

(Line numbers vary with different versions of Python – this is Python 2.5)

Adding the following code illustrates the problem:

print(mos1.group(1), mos1.group(2), mos1.group(3) )
print(mos2.group(1), mos2.group(2), mos2.group(3) )

For Python 2.5, this outputs

('a', 'bc', 'd')
('a', None, 'd')

So now it is clear why it is failing, what is the solution. First fix was:

cre2 = re.compile(r"^(a)((bc)?)(d)$")
print(cre2.sub(r"\1, \2, \4", ad))

This fix works, but it has side-effects – the last group number has changed. So a better solution might be:

cre3 = re.compile(r"^(a)((?:bc)?)(d)$")
print(cre3.sub(r"\1, \2, \3", ad))

Using the non-capturing group notation, (?:…) the problem is fixed without side-effects.

I’m not entirely happy that the problem occurs in the first place, but it does.

Posted in Python, Software | 1 Comment

PHP and Python

My new job involves a lot more PHP, which seems closely related to Perl. There are a number of observations that I would make in the comparison between PHP and Python.

1. PHP is more verbose. In particular, PHP requires braces around code blocks, semi-colons, $ on variable names. Python requires only the colon at the end of certain lines.

2. Both have good libraries. However, the ability of Python to easily interface to shared libraries written in any language, and not specifically written for Python, is a big advantage, in my opinion.

3. PHP has more unexpected gotchas. I hadn’t noticed this when writing in Python (the absence of a problem is less noticeable than the presence of the same problem). One example I came across today:

is_file() tests for the existence of a file, as you would expect by the name. But it is cached. So if you are waiting for a file to disappear, and you repeatedly call is_file to test this, then you will wait for a long time. You need to call clearstatcache() between each call to get an honest answer.

Of course, the documentation states this. But it also states that file_exists (an equivalent function) is also cached. But it appears not to be (PHP v5.3.3):

<?php

print "Waiting for test.txt to appear\n";
while (!is_file("test.txt"))
{
}

print "Waiting for test.txt to disappear\n";
while (True)
{
    sleep(.2);
    if (!is_file("test.txt"))
    {
        print "is_file detected disappearance\n";
        exit(0);
    }
    elseif (!file_exists("test.txt"))
    {
        print "file_exists detected disappearance\n";
        exit(0);
    }
}

?>

Running this code (and creating, then deleting file test.txt) invariably results in the detection by file_exists, even though I have biased it to let is_file test it first.

So ideally you need accurate documentation. But I think even better is to write in a way that requires minimal documentation.

Posted in Python, Uncategorized | Leave a comment

SendKeys in Linux

SendKeys is a useful function in the Windows API which enables a program to send keystrokes. I was looking for an equivalent in Linux, first in Ubuntu at home, and then in Fedora 13 at work.

For Ubuntu (Hardy Heron, 8.04), I found xautomation, which worked fine. For many (but not all) of the non alpha-numeric keystrokes, you need to use the name of the key – there is a helpful list here.

I tried xautomation with Fedora 13, but it was not happy. So I had a look around for an alternative. One possibility (I found them hard to find) was xdotool. Of course, the syntax for sending keystrokes is slightly different! It is slightly more consistent in that non-alphanumerics all need to be sent using the name of the key, so the helpful list (above) is still helpful!

I’ll try to post some examples at a later date.

Posted in Software | 1 Comment

Strange behaviour using ctypes in IronPython

I came across some slightly unexpected behaviour with ctypes in IronPython. Maybe this is the wrong way to look at it – I should be amazed that ctypes works at all in IronPython. But I thought I would document it here.

This deals with DLLs, and so is Windows-centric. I need to find out how / if ctypes works on Linux.

Start with a trivial DLL. It contains a single function, nAdd, which takes an array of integers, and a length of the array, and returns the sum of the array elements:

#ifdef __cplusplus
extern &quot;C&quot;
{
#endif


int nAdd(int * pnData, int nDataLen)
{
    int nTotal = 0;
    for (int i = 0; i &lt; nDataLen; ++i)
    {
        nTotal = nTotal + pnData[i];
    }

    return nTotal;
}


#ifdef __cplusplus
}
#endif

and build it using MSYS / MinGW:

gcc -c strange.cpp
gcc -fPIC -shared -o strange.dll strange.o

Now lets call the function in the DLL, using ctypes:

import ctypes
import sys

# define an array of 3 integers 
array3type = ctypes.c_int * 3
array3 = array3type()
array3[0] = 4
array3[1] = 5
array3[2] = 21

# define an array of 2 integers 
array2type = ctypes.c_int * 2
array2 = array2type()
array2[0] = 65
array2[1] = 34


def call3then2_noTypes():
    print(&quot;call3then2_noTypes&quot;)
    dll = ctypes.cdll.LoadLibrary(&quot;strange.dll&quot;)

    dll.nAdd.restype = ctypes.c_int
    #dll.nAdd.argtypes = (ctypes.POINTER(ctypes.c_int), ctypes.c_int)

    try:
        print(&quot;  %d&quot; % dll.nAdd(array3, 3))
        print(&quot;  %d&quot; % dll.nAdd(array2, 2))
    except Exception:
        e = sys.exc_info()[1]
        print(&quot;  %s&quot; % e)
    print(&quot;&quot;)


def call2then3_noTypes():
    print(&quot;call2then3_noTypes&quot;)
    dll = ctypes.cdll.LoadLibrary(&quot;strange.dll&quot;)

    #dll.nAdd.restype = ctypes.c_int
    #dll.nAdd.argtypes = (ctypes.POINTER(ctypes.c_int), ctypes.c_int)

    try:
        print(&quot;  %d&quot; % dll.nAdd(array2, 2))
        print(&quot;  %d&quot; % dll.nAdd(array3, 3))
    except Exception:
        e = sys.exc_info()[1]
        print(&quot;  %s&quot; % e)
    print(&quot;&quot;)
    
    
call3then2_noTypes()
call2then3_noTypes()

For Python 2.5+ and Python 3.x, this correctly prints the following:

call3then2_noTypes
  30
  99

call2then3_noTypes
  99
  30

So the question is, what will it do in IronPython 2.6?

The answer is unexpected (at least, to me):

call3then2_noTypes
  30
  expected c_long_Array_3, got c_long_Array_2

call2then3_noTypes
  99
  expected c_long_Array_2, got c_long_Array_3

It seems to be trying to learn the signature of the function from the first call, but unfortunately, in this case, getting it wrong. Note that the first call to the function was successful in each case – it is the second call that is failing.

Two of the following three methods work for Python 2.5+, Python 3.x and IronPython 2.6:

import ctypes
import sys

# define an array of 3 integers 
array3type = ctypes.c_int * 3
array3 = array3type()
array3[0] = 4
array3[1] = 5
array3[2] = 21

# define an array of 2 integers 
array2type = ctypes.c_int * 2
array2 = array2type()
array2[0] = 65
array2[1] = 34

def call():
    print(&quot;call&quot;)
    dll = ctypes.cdll.LoadLibrary(&quot;strange.dll&quot;)

    dll.nAdd.restype = ctypes.c_int
    dll.nAdd.argtypes = (ctypes.POINTER(ctypes.c_int), ctypes.c_int)

    try:
        print(&quot;  %d&quot; % dll.nAdd(array3, 3))
        print(&quot;  %d&quot; % dll.nAdd(array2, 2))
    except Exception:
        e = sys.exc_info()[1]
        print(&quot;  %s&quot; % e)
    print(&quot;&quot;)


def call_voidPointer():
    print(&quot;call_voidPointer&quot;)
    dll = ctypes.cdll.LoadLibrary(&quot;strange.dll&quot;)

    dll.nAdd.restype = ctypes.c_int
    dll.nAdd.argtypes = (ctypes.c_void_p, ctypes.c_int)

    print(&quot;  %d&quot; % dll.nAdd(array3, 3))
    print(&quot;  %d&quot; % dll.nAdd(array2, 2))
    print(&quot;&quot;)


def call_voidPointer_addressOf():
    print(&quot;call_voidPointer_addressOf&quot;)
    dll = ctypes.cdll.LoadLibrary(&quot;strange.dll&quot;)

    dll.nAdd.restype = ctypes.c_int
    dll.nAdd.argtypes = (ctypes.c_void_p, ctypes.c_int)

    print(&quot;  %d&quot; % dll.nAdd(ctypes.addressof(array3), 3))
    print(&quot;  %d&quot; % dll.nAdd(ctypes.addressof(array2), 2))
    print(&quot;&quot;)


call()
call_voidPointer()
call_voidPointer_addressOf()

Extra hint: the output for IronPython 2.6 is as follows:

call3then2
  expected c_long, got c_long_Array_3

call_voidPointer
  30
  99

call_voidPointer_addressOf
  30
  99
Posted in Python, Uncategorized | Leave a comment