Why is consistency so difficult to achieve?

  1 Python 2.5.2 (r252:60911, Oct  5 2008, 19:29:17)
  2 [GCC 4.3.2] on linux2
  3 Type "help", "copyright", "credits" or "license" for more information.
  4 
  5  >>> def f(*args):
  6  ...   print ' '.join([str(a) for a in args])
  7 ...
  8  >>> f(1,2,3)
  9 1 2 3
 10  >>> f(1,*(2,3))
 11 1 2 3
 12  >>> f(*(1,2),3)
 13   File "<stdin>", line 1
 14     f(*(1,2),3)
 15              ^
 16 SyntaxError: invalid syntax

Sigh.

Minor scriptutil enhancements

I have cleaned up the documentation for the scriptutil module which is available on the web now. If you happen to run ubuntu you can also install it as a package straight from my PPA.

Please have a look at this tutorial in case you’re interested in scriptutil usage examples.

Enjoy!

Text filtering with erlang

Introduction

After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.

A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.

Just so you know where I am coming from:

  • my main/favourite programming language is Python
  • my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
  • I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.

Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below 🙂

File processing with erlang’s regexp module

The initial implementation of the URL filter in erlang used its regexp library.

  1 -module(regex).
  2 -export([main/1]).
  3 
  4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true;
  5 isalphanum(_) -> false.
  6 
  7 %% Generate a temporary file name of length N
  8 genname(0, L) -> L;
  9 genname(N, L) ->
 10     R = random:uniform(123),
 11     case isalphanum(R) of
 12         true -> genname(N-1, [R|L]);
 13         false -> genname(N, L)
 14     end.
 15 
 16 %% Returns a randomly generated temporary file path where the basename is
 17 %% of length N
 18 mktemppath(Prefix, N) -> Prefix ++ "/" ++ genname(N, []).
 19 

Please note how I had to implement functionality absent from the standard library above.

 20 %% Removes passwords embedded in URLs from a log file.
 21 scrub_file(Tmpdir, F) ->
 22     %% make a temporary directory if it does not exist yet.
 23     case file:make_dir(Tmpdir) of
 24         ok -> ok;
 25         {error,eexist} -> ok;
 26         _ -> exit({error, failed_to_make_tmpdir})
 27     end,
 28 
 29     %% Move the original file out of the way.
 30     T = mktemppath(Tmpdir, 16),
 31     case file:rename(F, T) of
 32         ok -> ok;
 33         _ -> exit({error, failed_to_move_file})
 34     end,
 35 
 36     %% Now open it for reading.
 37     {_, In} = file:open([T], read),
 38     %% Open the original path for writing.
 39     {_, Out} = file:open([F], write),
 40 
 41     %% Call the function that will scrub the lines.
 42     scrub_lines(In, Out),
 43 
 44     %% Close the file handles and return the path to the original file.
 45     file:close(Out),
 46     file:close(In),
 47     T.
 48 

The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.

 49 %% This is where the log file is actually read linewise and where
 50 %% the scrubbing function is invoked for lines that contain URLs.
 51 scrub_lines(In, Out) ->
 52     L = io:get_line(In, ''),
 53     case L of
 54         eof -> ok;
 55         _ ->
 56             %% Does the line contain URLs?
 57             case string:str(L, "://") of
 58                 0 -> io:format(Out, "~s", [L]);
 59                 _ ->
 60                     case regexp:gsub(L, "://[^:]+:[^@]+@", "://") of
 61                         {ok, S, _} -> io:format(Out, "~s", [S]);
 62                         {R, S, _} -> io:format("Failed: {~p,~p}", [R,S])
 63                     end
 64             end,
 65             %% Continue with next line.
 66             scrub_lines(In, Out)
 67     end.
 68 
 69 %% Main function.
 70 main([A]) ->
 71     {A1,A2,A3} = now(),
 72     random:seed(A1, A2, A3),
 73 
 74     %% A single argument (the name of the file to be scrubbed) is expected.
 75     F = atom_to_list(A),
 76     T = scrub_file("tmp", F),
 77 
 78     %% The scrubbed file content will be written to a new file that's
 79     %% in the place of the original file. Where was the latter moved to?
 80     io:format("~s~n", [T]),
 81 
 82     init:stop().

The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).

Log file size Python times Erlang times
1 MB 0m0.230s 0m1.896s
10 MB 0m1.510s 0m8.766s
100 MB 0m14.793s 1m17.662s
1 GB 2m55.012s 13m54.588s

The do-it-yourself construction

Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.

As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.

Log file size Python times Erlang regexp Erlang do-it-yourself
1 MB 0m0.230s 0m1.896s 0m1.969s
10 MB 0m1.510s 0m8.766s 0m8.459s
100 MB 0m14.793s 1m17.662s 1m12.448s
1 GB 2m55.012s 13m54.588s 13m3.360s

In conclusion

Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.

I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.

I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.

Ah, and no, you were not supposed to grok the sentence above unless you read it three times 🙂

Fun with PostgreSQL, Psycopg2 and Bytea arrays

Introduction

Despite being a great DBMS, PostgreSQL has a few wrinkles that can cause quite a bit of pain 🙂

One such wrinkle is the insertion of database rows with Bytea arrays.

If you’re not dealing with PostgreSQL or don’t need to wrangle with Bytea arrays feel free to skip this article. If you are, however, this article may save you a lot of time and frustration.

The code

Since the correct number of backslashes in the code below is important but unlikely to be displayed properly in your web browser please view it syntax highlighted here or in plain format here.

The code below consists of roughly two sections:

  • the actual code of interest dealing with the backslash plague and the quoting orgy (prepareByteaString(), lines 49-82)
  • a test harness (lines 84-118) merely facilitating the testing of the function of interest

The problem on hand is that the PostgreSQL Array Value Input syntax requires that the Bytea array literal be enclosed in single quotes. Psycopg2 however quotes individual byte strings using single quotes as well.

Please note: a byte string or array corresponds to one Bytea variable. A Bytea array is hence an array of byte arrays or strings.

The resulting Bytea array literals are a mess and cause syntax errors when used in INSERT statements etc. The approach chosen is to “re-quote” the Bytea literals returned by the Psycopg2 Binary() function from single to double quotes, For more detail please see the comments on lines 68-80.

  36 # created: Thu Oct 25 21:35:50 2007
  37 __version__ = "$Id:$"
  38 # $HeadURL $
  39 
  40 import psycopg2 as ps2
  41 
  42 class PGT(object):
  43     """Utility functions for using a PostgreSQL database with python"""
  44 
  45     def __init__(self):
  46         """initialiser"""
  47         # super(NEW_CLASS, self).__init__()
  48 
  49     @staticmethod
  50     def prepareByteaString(byteaSeq):
  51         """
  52         Given a sequence of byte arrays this function prepares a properly
  53         quoted string to be used for inserting database rows with Bytea
  54         arrays.
  55 
  56         Given e.g. a table like the following
  57             Create table baex(byteaa Bytea ARRAY[16]);
  58         the resulting string ('bas') can be used as follows:
  59             cursor.execute("INSERT INTO baex(byteaa) values('{%s}')" % bas))
  60 
  61         Parameters:
  62         - byteaSeq: a sequence of byte arrays each corresponding to a Bytea
  63                     value in the database
  64         Returns:
  65         string: containing all the byte arrays from the 'byteaSeq' properly
  66                 quoted for utilisation in an INSERT statement
  67         """
  68         # in a first step
  69         #   1. quote all the byte arrays (using the psycopg2 Binary()
  70         #      function)
  71         #   2. strip away the single quotes on the left and right side
  72         #   3. escape any double quote characters with a backslash
  73         # The last step is necessary because we will use the double quote as
  74         # the quote/delimiter for the byte arrays
  75         baSeq = [str(ps2.Binary(ba))[1:-1].replace(r'"', r'\"') for ba in byteaSeq]
  76         # join the prep'ed byte arrays into a single string
  77         bas = "\"%s\"" % '","'.join(baSeq)
  78         # double the number of backslashes (needed because we're inserting a
  79         # Bytea array as opposed to a single Bytea)
  80         bas = bas.replace('\\', '\\\\')
  81         # done!
  82         return(bas)

The code below is only executed when you invoke the Python file directly but will not run if you import it.

  84 if __name__ == '__main__':
  85     ### TEST code ********************************************************
  86     import os, sys
  87     from random import random as rand
  88 
  89     # connect to test database
  90     db = ps2.connect("dbname='test' user='postgres'")
  91     cursor = db.cursor()
  92     # create test table
  93     cursor.execute('Create Table public.baex(id Serial, byteaa Bytea ARRAY[16])')

For test purposes I am connecting to my test database (line 90) and creating a test table (baex, line 93).

  95     sys.stdout.flush()
  96     print "\n******************** Bytea data generated: ********************"

Then I generate Bytea data for three database rows and insert it into the database (loop on lines 98-109).

  97     # generate 3 byte array sequences
  98     for rowId in range(1, 4):
  99         byteaSeq = []

Each row is populated with a Bytea array holding of up to seven elements. I am using the random() function to generate the bytes.

 100         # generate 1-7 random byte arrays
 101         for numOfSstrings in range(1, int(rand()*8)):
 102             bytea = ''.join([chr(int(rand()*256)) for x in range(int(rand()*5))])
 103             byteaSeq.append(bytea)
 104         print byteaSeq
 105         # get the INSERT string for the byte array sequence generated
 106         bas = PGT.prepareByteaString(byteaSeq)
 107         # insert the row into the table
 108         cursor.execute("INSERT INTO public.baex(id, byteaa) VALUES(%s, '{%s}')" \
 109                        % (rowId, bas))
 110     db.commit()
 111     sys.stdout.flush()

Once the data is inserted into the database I run psql to check whether everything worked properly..

 113     print "\n******************** Bytea data inserted: ********************"
 114     sys.stdout.flush()
 115     # show the data inserted
 116     os.system("psql -d test -U postgres -c 'SELECT * FROM public.baex'")

.. and finally drop the test table.

 117     cursor.execute('DROP TABLE public.baex')
 118     db.commit()

A few test runs

I have selected the two test runs below because they show some interesting (edge) cases.

The first one shows that single quotes are dealt with properly (last byte in last byte string of first row and first byte string of the third row)

******************** Bytea data generated: ********************
['R\xb7', '\xd7\xcd\xbb', 'u', "\xa3[\x97'"]
['\xfbT\x84g', '\xfa', '']
["'", '\x8d', '', '\xc5\n5', 'A', '\xa4P*']

******************** Bytea data inserted: ********************
 id |                    byteaa
----+-----------------------------------------------
  1 | {"R\\267","\\327\\315\\273",u,"\\243[\\227'"}
  2 | {"\\373T\\204g","\\372",""}
  3 | {',"\\215","","\\305\125",A,"\\244P*"}
(3 rows)

The second test run demonstrates that double quotes are handled correctly (fist byte of first byte string in first row)

******************** Bytea data generated: ********************
['"\xa5\x13', '\x9a`', '', '\xf4\x98']
[]
['u', '?7', '1\xff', '.\x16\xcf', '\xcbe}', '']

******************** Bytea data inserted: ********************
 id |                   byteaa
----+--------------------------------------------
  1 | {"\"\\245\23","\\232`","","\\364\\230"}
  2 | {""}
  3 | {u,?7,"1\\377",".\26\\317","\\313e}",""}
(3 rows)

Conclusion

Once you bite the bullet and invest the time to think about the problem, code the function and test it, it’s simple 🙂

More syntactic sugar please!

Despite enjoying my hacking experience in Python 🙂 there are a few rare constructs from other languages (mostly falling into the syntactic sugar category) that I am missing on occasion.

One such facility are Perl’s quote operators.

I hate typing too much 🙂 and Perl’s quote operator relieves me of typing all the quotes and commas normally needed to instantiate a list of words (see line 2).

  1 bbox33:mhr $ perl -e '
  2 > @list = qw(The quick brown fox jumps over the lazy dog);
  3 > print join('_', @list), "\\n";
  4 > '
  5 The_quick_brown_fox_jumps_over_the_lazy_dog

Conversely, initialising a list of strings in Python is messy since I have to type all that cruft (see line 10).

  6 bbox33:mhr $ python
  7 Python 2.5.1 Stackless 3.1b3 060516 (python-2.51:55546, May 24 2007, 08:50:09)
  8 [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
  9 Type "help", "copyright", "credits" or "license" for more information.
 10  >>> list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
 11  >>> print '_'.join(list)
 12 The_quick_brown_fox_jumps_over_the_lazy_dog

The best shortcut I have found so far is to type all the words needed in a long string and split it (see line 13).

 13  >>> list2 = 'The quick brown fox jumps over the lazy dog'.split()
 14  >>> print '_'.join(list2)
 15 The_quick_brown_fox_jumps_over_the_lazy_dog

Just wondering: did I miss some piece of Python magic here? Are there any other/better ways to initialise sequences of strings in Python?