My new favourite book

I was attending the UDS in Mountain View last week and it proved to be one of those fascinating albeit somewhat exhausting events (please see either of these resources for UDS reports and commentary).

Anyway, while being there I managed to get my hands on a paper copy of Real World Haskell. Being busy with the UDS I only started reading it on the plane back to Frankfurt. Despite being very, very tired I enjoyed it thoroughly and have to say it’s one of the best technical books I have ever perused.

All successful (technical) projects have a vibrant community and great documentation. This book makes Haskell so accessible, it may well be the last bit that’s needed for a great break-through for Haskell and functional programming in general!

Whether you are interested in functional programming or merely seeking to broaden your horizon, don’t delay, go out and grab a copy of this book. You won’t be disappointed.

Text filtering with erlang

Introduction

After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.

A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.

Just so you know where I am coming from:

  • my main/favourite programming language is Python
  • my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
  • I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.

Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below :-)

File processing with erlang’s regexp module

The initial implementation of the URL filter in erlang used its regexp library.

  1 -module(regex).
  2 -export([main/1]).
  3 
  4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true;
  5 isalphanum(_) -> false.
  6 
  7 %% Generate a temporary file name of length N
  8 genname(0, L) -> L;
  9 genname(N, L) ->
 10     R = random:uniform(123),
 11     case isalphanum(R) of
 12         true -> genname(N-1, [R|L]);
 13         false -> genname(N, L)
 14     end.
 15 
 16 %% Returns a randomly generated temporary file path where the basename is
 17 %% of length N
 18 mktemppath(Prefix, N) -> Prefix ++ "/" ++ genname(N, []).
 19 

Please note how I had to implement functionality absent from the standard library above.

 20 %% Removes passwords embedded in URLs from a log file.
 21 scrub_file(Tmpdir, F) ->
 22     %% make a temporary directory if it does not exist yet.
 23     case file:make_dir(Tmpdir) of
 24         ok -> ok;
 25         {error,eexist} -> ok;
 26         _ -> exit({error, failed_to_make_tmpdir})
 27     end,
 28 
 29     %% Move the original file out of the way.
 30     T = mktemppath(Tmpdir, 16),
 31     case file:rename(F, T) of
 32         ok -> ok;
 33         _ -> exit({error, failed_to_move_file})
 34     end,
 35 
 36     %% Now open it for reading.
 37     {_, In} = file:open([T], read),
 38     %% Open the original path for writing.
 39     {_, Out} = file:open([F], write),
 40 
 41     %% Call the function that will scrub the lines.
 42     scrub_lines(In, Out),
 43 
 44     %% Close the file handles and return the path to the original file.
 45     file:close(Out),
 46     file:close(In),
 47     T.
 48 

The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.

 49 %% This is where the log file is actually read linewise and where
 50 %% the scrubbing function is invoked for lines that contain URLs.
 51 scrub_lines(In, Out) ->
 52     L = io:get_line(In, ''),
 53     case L of
 54         eof -> ok;
 55         _ ->
 56             %% Does the line contain URLs?
 57             case string:str(L, "://") of
 58                 0 -> io:format(Out, "~s", [L]);
 59                 _ ->
 60                     case regexp:gsub(L, "://[^:]+:[^@]+@", "://") of
 61                         {ok, S, _} -> io:format(Out, "~s", [S]);
 62                         {R, S, _} -> io:format("Failed: {~p,~p}", [R,S])
 63                     end
 64             end,
 65             %% Continue with next line.
 66             scrub_lines(In, Out)
 67     end.
 68 
 69 %% Main function.
 70 main([A]) ->
 71     {A1,A2,A3} = now(),
 72     random:seed(A1, A2, A3),
 73 
 74     %% A single argument (the name of the file to be scrubbed) is expected.
 75     F = atom_to_list(A),
 76     T = scrub_file("tmp", F),
 77 
 78     %% The scrubbed file content will be written to a new file that's
 79     %% in the place of the original file. Where was the latter moved to?
 80     io:format("~s~n", [T]),
 81 
 82     init:stop().

The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).

Log file size Python times Erlang times
1 MB 0m0.230s 0m1.896s
10 MB 0m1.510s 0m8.766s
100 MB 0m14.793s 1m17.662s
1 GB 2m55.012s 13m54.588s

The do-it-yourself construction

Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.

As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.

Log file size Python times Erlang regexp Erlang do-it-yourself
1 MB 0m0.230s 0m1.896s 0m1.969s
10 MB 0m1.510s 0m8.766s 0m8.459s
100 MB 0m14.793s 1m17.662s 1m12.448s
1 GB 2m55.012s 13m54.588s 13m3.360s

In conclusion

Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.

I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.

I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.

Ah, and no, you were not supposed to grok the sentence above unless you read it three times :-)

Learn the Python standard library

One of the big benefits of making code available publicly is the feedback received. Sometimes, the feedback points to python standard library modules I was unaware of.

A case in point is a post on Linux Questions in which the author is pointing out the fileinput module.

Python indeed comes with the batteries included :-)

I will have a more thorough look at the fileinput module and probably refactor the scriptutil.freplace() function (described here) to make use of it.

Python: find files using Unix shell-style wildcards

Introduction

In the article that follows I will show how the scriptutil.py module (syntax highlighted code here) can be used

  • to find files using Unix shell-style wildcards
  • to search inside the found files and to perform in-place search & substitute operations on them

The examples below all operate on the django project source code tree.

Finding files

In the following example I am using the scriptutil.ffind() function to find files that start either with 'README*' or 'AUTH*' (line 6 below). On the subsequent line a helper function is invoked to pretty-print the search results which are then displayed on lines 8-17.

  1 Python 2.4.4 (#1, May  9 2007, 11:05:23)
  2 [GCC 4.0.1 (Apple Computer, Inc. build 5367)] on darwin
  3 Type "help", "copyright", "credits" or "license" for more information.
  4  >>> import scriptutil as SU
  5  >>> import re
  6  >>> flist = SU.ffind('.', shellglobs=('README*', 'AUTH*'))
  7  >>> SU.printr(flist)
  8 ./.svn/text-base/AUTHORS.svn-base
  9 ./.svn/text-base/README.svn-base
 10 ./AUTHORS
 11 ./README
 12 ./django/contrib/flatpages/.svn/text-base/README.TXT.svn-base
 13 ./django/contrib/flatpages/README.TXT
 14 ./django/contrib/redirects/.svn/text-base/README.TXT.svn-base
 15 ./django/contrib/redirects/README.TXT
 16 ./extras/.svn/text-base/README.TXT.svn-base
 17 ./extras/README.TXT

In most cases I will not be interested in any files that are internal to the subversion revision control system (lines 8, 9, 12, 14 and 16). Hence the filter function on line 19 (below) that rids me of these.

 18  >>> flist = SU.ffind('.', shellglobs=('README*', 'AUTH*'),
 19  ...                   namefs=(lambda s: '.svn' not in s,))
 20  >>> flist
 21 ['./README', './AUTHORS', './django/contrib/flatpages/README.TXT',
 22  './django/contrib/redirects/README.TXT', './extras/README.TXT']
 23  >>> SU.printr(flist)
 24 ./AUTHORS
 25 ./README
 26 ./django/contrib/flatpages/README.TXT
 27 ./django/contrib/redirects/README.TXT
 28 ./extras/README.TXT

As we can see on lines 24-28 the subversion-internal files are not part of the result set any more.

The example above also points out how shell-style wildcards operate on file names whereas the filter functions passed through the 'namefs' parameter match on the file path.

Please note: this article provides slightly more detail and additional scriptutil.ffind() examples you may want to explore.

Finding files and searching inside them

The brief scriptutil.ffindgrep() example below shows how one can search inside the files found.

 29  >>> flist = SU.ffindgrep('.', shellglobs=('README*', 'AUTH*'),
 30  ...                      namefs=(lambda s: '.svn' not in s,),
 31  ...                      regexl=(('Django', re.I), 'dist'))
 32  >>> flist
 33 {'./django/contrib/redirects/README.TXT':
 34  '    * The file django/docs/redirects.txt in the Django distribution',
 35  './django/contrib/flatpages/README.TXT':
 36  '    * The file docs/flatpages.txt in the Django distribution'}
 37  >>> SU.printr(flist)
 38     * The file docs/flatpages.txt in the Django distribution
 39 ./django/contrib/flatpages/README.TXT
 40     * The file django/docs/redirects.txt in the Django distribution
 41 ./django/contrib/redirects/README.TXT

The 'regexl' parameter (see line 31 above) contains two search items:

  1. the string ‘Django’, to be searched in case insensitive fashion
  2. the string ‘dist’, to be searched as is (i.e. in lower case)

The results returned by the function are displayed on lines 33-36 and pretty-printed on lines 38-41 respectively. For more detail on the scriptutil.ffindgrep() function please see one of my previous articles.

In-place search/substitute on files

Last but not least here’s an example of how the scriptutil.freplace() function can be utilised to search for strings in files and substitute them.

The 'regexl' parameter (passed on line 44 below) is a sequence of 3-tuples, each having the following elements:

  • search string (Python regex syntax)
  • replace string (Python regex syntax)
  • regex compilation flags or ‘None’ (re.compile syntax)

The 'bext' parameter specified on the subsequent line is the file name suffix to be used for backup copies of the modified files.

 42  >>> flist = SU.freplace('.', shellglobs=('README*',),
 43  ...                     namefs=(lambda s: '.svn' not in s,),
 44  ...                     regexl=(('distribution', '**package**', None),),
 45  ...                     bext='.bakk')

The function call above will search all occurence of the string 'distribution' and replace them with the string '**package**'. Please note that only files that passed the name filters (specified on lines 43-44) will be considered.

By searching for the backup files (line 46) we can see that the function call above resulted in two modified files.

 46  >>> flist = SU.ffind('.', shellglobs=('*.bakk',))
 47  >>> SU.printr(flist)
 48 ./django/contrib/flatpages/README.TXT.bakk
 49 ./django/contrib/redirects/README.TXT.bakk

Finally, I am searching for the replacement string '**package**' (line 50) to check that the substitution worked.

 50  >>> flist = SU.ffindgrep('.', regexl=('\\*\\*package\\*\\*',))
 51  >>> SU.printr(flist)
 52     * The file docs/flatpages.txt in the Django **package**
 53 ./django/contrib/flatpages/README.TXT
 54     * The file django/docs/redirects.txt in the Django **package**
 55 ./django/contrib/redirects/README.TXT

In conclusion

Again, I hope you liked this (brief) overview of the scriptutil.py module. Just in case that more detailed documentation is required I would like to mention that the functions presented above are documented quite extensively through documentation strings. Please check these out in the syntax highlighted source.