Text filtering with erlang

April 30, 2008 by muharem

Introduction

After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.

A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.

Just so you know where I am coming from:

  • my main/favourite programming language is Python
  • my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
  • I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.

Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below :-)

File processing with erlang’s regexp module

The initial implementation of the URL filter in erlang used its regexp library.

  1 -module(regex).
  2 -export([main/1]).
  3 
  4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true;
  5 isalphanum(_) -> false.
  6 
  7 %% Generate a temporary file name of length N
  8 genname(0, L) -> L;
  9 genname(N, L) ->
 10     R = random:uniform(123),
 11     case isalphanum(R) of
 12         true -> genname(N-1, [R|L]);
 13         false -> genname(N, L)
 14     end.
 15 
 16 %% Returns a randomly generated temporary file path where the basename is
 17 %% of length N
 18 mktemppath(Prefix, N) -> Prefix ++ “/” ++ genname(N, []).
 19 

Please note how I had to implement functionality absent from the standard library above.

 20 %% Removes passwords embedded in URLs from a log file.
 21 scrub_file(Tmpdir, F) ->
 22     %% make a temporary directory if it does not exist yet.
 23     case file:make_dir(Tmpdir) of
 24         ok -> ok;
 25         {error,eexist} -> ok;
 26         _ -> exit({error, failed_to_make_tmpdir})
 27     end,
 28 
 29     %% Move the original file out of the way.
 30     T = mktemppath(Tmpdir, 16),
 31     case file:rename(F, T) of
 32         ok -> ok;
 33         _ -> exit({error, failed_to_move_file})
 34     end,
 35 
 36     %% Now open it for reading.
 37     {_, In} = file:open([T], read),
 38     %% Open the original path for writing.
 39     {_, Out} = file:open([F], write),
 40 
 41     %% Call the function that will scrub the lines.
 42     scrub_lines(In, Out),
 43 
 44     %% Close the file handles and return the path to the original file.
 45     file:close(Out),
 46     file:close(In),
 47     T.
 48 

The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.

 49 %% This is where the log file is actually read linewise and where
 50 %% the scrubbing function is invoked for lines that contain URLs.
 51 scrub_lines(In, Out) ->
 52     L = io:get_line(In, ),
 53     case L of
 54         eof -> ok;
 55         _ ->
 56             %% Does the line contain URLs?
 57             case string:str(L, “://”) of
 58                 0 -> io:format(Out, ~s, [L]);
 59                 _ ->
 60                     case regexp:gsub(L, “://[^:]+:[^@]+@”, “://”) of
 61                         {ok, S, _} -> io:format(Out, ~s, [S]);
 62                         {R, S, _} -> io:format(“Failed: {~p,~p}”, [R,S])
 63                     end
 64             end,
 65             %% Continue with next line.
 66             scrub_lines(In, Out)
 67     end.
 68 
 69 %% Main function.
 70 main([A]) ->
 71     {A1,A2,A3} = now(),
 72     random:seed(A1, A2, A3),
 73 
 74     %% A single argument (the name of the file to be scrubbed) is expected.
 75     F = atom_to_list(A),
 76     T = scrub_file(“tmp”, F),
 77 
 78     %% The scrubbed file content will be written to a new file that’s
 79     %% in the place of the original file. Where was the latter moved to?
 80     io:format(~s~n, [T]),
 81 
 82     init:stop().

The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).

Log file size Python times Erlang times
1 MB 0m0.230s 0m1.896s
10 MB 0m1.510s 0m8.766s
100 MB 0m14.793s 1m17.662s
1 GB 2m55.012s 13m54.588s

The do-it-yourself construction

Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.

As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.

Log file size Python times Erlang regexp Erlang do-it-yourself
1 MB 0m0.230s 0m1.896s 0m1.969s
10 MB 0m1.510s 0m8.766s 0m8.459s
100 MB 0m14.793s 1m17.662s 1m12.448s
1 GB 2m55.012s 13m54.588s 13m3.360s

In conclusion

Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.

I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.

I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.

Ah, and no, you were not supposed to grok the sentence above unless you read it three times :-)

My new “baby laptop”

January 26, 2008 by muharem

I am using a MacBook Pro for roughly 10 months now and I am generally very happy with it. Nevertheless, I like experimenting and playing with other operating systems, primarily with various linux distributions and with members of the *BSD family.

Running Ubuntu or FreeBSD in a virtual machine (using VMWare’s fusion or Parallels desktop) is a good way to get a first impression of such a system but at some point you will want real hardware. I reached that point with Ubuntu when I took an interest in hard disk encryption (why you should encrypt your hard disk).

It just so happened that around that time Aldi (a German retailer) had a nice 12 inch subnotebook on offer. Aldi was targeting the female clientele (the notebook was even adorned with rhinestones etc.) which resulted in a 100 Euro price markup. The system was still a very good value for the money, however, and I even managed to get my hands on the black model :-)

I booted the system up (it had Windows Vista Home Premium pre-installed) in order to take a backup of the system in case I had to restore the original state (prior to sending in it for service for example).

Vista was atrocious, totally crippling the poor thing. I mean, it was horrible, even the simplest interactions or commands took a ridiculous amount of time to complete.
Anyway, I managed to take a snapshot of the hard disk, wiped it clean and installed Ubuntu-7.10.

After running Vista on the box for a few hours, Ubuntu was a relief :-) it felt very snappy and was generally performing very well. I could do all my development work on the notebook without any problems.

I thought that was very cool :-) I finally had a second box, a “baby laptop”, I could use for my experiments. Having two laptops may sound a little bit excessive unless you tamper frequently with all sorts of operating systems. You can keep one machine in a “stable” condition and do all the proper (money generating) work on it while totally rebuilding the other.

I am a slightly paranoid person :-) and like to have as much security on my machines as I can get. I hence wanted to install Ubuntu with full hard disk encryption but more on that topic in a forthcoming article..

MacFUSE

January 3, 2008 by muharem

Hello there,

for anybody owning a Mac and having an interest in file systems I just wanted to point to MacFUSE (a FUSE-Compliant File System Implementation Mechanism for Mac OS X) which is another brilliant piece of work done by Amit Singh (of “Mac OS X Internals” fame).

There is also a technical presentation given by Amit on the MacFUSE project (YouTube video).

Fun with PostgreSQL, Psycopg2 and Bytea arrays

October 27, 2007 by muharem

Introduction

Despite being a great DBMS, PostgreSQL has a few wrinkles that can cause quite a bit of pain :-)

One such wrinkle is the insertion of database rows with Bytea arrays.

If you’re not dealing with PostgreSQL or don’t need to wrangle with Bytea arrays feel free to skip this article. If you are, however, this article may save you a lot of time and frustration.

The code

Since the correct number of backslashes in the code below is important but unlikely to be displayed properly in your web browser please view it syntax highlighted here or in plain format here.

The code below consists of roughly two sections:

  • the actual code of interest dealing with the backslash plague and the quoting orgy (prepareByteaString(), lines 49-82)
  • a test harness (lines 84-11 8) merely facilitating the testing of the function of interest

The problem on hand is that the PostgreSQL Array Value Input syntax requires that the Bytea array literal be enclosed in single quotes. Psycopg2 however quotes individual byte strings using single quotes as well.

Please note: a byte string or array corresponds to one Bytea variable. A Bytea array is hence an array of byte arrays or strings.

The resulting Bytea array literals are a mess and cause syntax errors when used in INSERT statements etc. The approach chosen is to “re-quote” the Bytea literals returned by the Psycopg2 Binary() function from single to double quotes, For more detail please see the comments on lines 68-80.

  36 # created: Thu Oct 25 21:35:50 2007
  37 __version__ = “$Id:$ 38 # $HeadURL $
  39 
  40 import psycopg2 as ps2
  41 
  42 class PGT(object):
  43     “”"Utility functions for using a PostgreSQL database with python“”"
  44 
  45     def __init__(self):
  46         “”"initialiser“”"
  47         # super(NEW_CLASS, self).__init__()
  48 
  49     @staticmethod
  50     def prepareByteaString(byteaSeq):
  51         “”"
  52         Given a sequence of byte arrays this function prepares a properly
  53         quoted string to be used for inserting database rows with Bytea
  54         arrays.
  55 
  56         Given e.g. a table like the following
  57             Create table baex(byteaa Bytea ARRAY[16]);
  58         the resulting string (’bas’) can be used as follows:
  59             cursor.execute(”INSERT INTO baex(byteaa) values(’{%s}’)” % bas))
  60 
  61         Parameters:
  62         - byteaSeq: a sequence of byte arrays each corresponding to a Bytea
  63                     value in the database
  64         Returns:
  65         string: containing all the byte arrays from the ‘byteaSeq’ properly
  66                 quoted for utilisation in an INSERT statement
  67         “”"
  68         # in a first step
  69         #   1. quote all the byte arrays (using the psycopg2 Binary()
  70         #      function)
  71         #   2. strip away the single quotes on the left and right side
  72         #   3. escape any double quote characters with a backslash
  73         # The last step is necessary because we will use the double quote as
  74         # the quote/delimiter for the byte arrays
  75         baSeq = [str(ps2.Binary(ba))[1:-1].replace(r’‘, r’\”‘) for ba in byteaSeq]
  76         # join the prep’ed byte arrays into a single string
  77         bas = “\”%s\”” % ‘“,”‘.join(baSeq)
  78         # double the number of backslashes (needed because we’re inserting a
  79         # Bytea array as opposed to a single Bytea)
  80         bas = bas.replace(’\\‘, ‘\\\\‘)
  81         # done!
  82         return(bas)

The code below is only executed when you invoke the Python file directly but will not run if you import it.

  84 if __name__ == ‘__main__‘:
  85     ### TEST code ********************************************************
  86     import os, sys
  87     from random import random as rand
  88 
  89     # connect to test database
  90     db = ps2.connect(”dbname=’test’ user=’postgres’“)
  91     cursor = db.cursor()
  92     # create test table
  93     cursor.execute(’Create Table public.baex(id Serial, byteaa Bytea ARRAY[16])‘)

For test purposes I am connecting to my test database (line 90) and creating a test table (baex, line 93).

  95     sys.stdout.flush()
  96     print\n******************** Bytea data generated: ********************

Then I generate Bytea data for three database rows and insert it into the database (loop on lines 98-109).

  97     # generate 3 byte array sequences
  98     for rowId in range(1, 4):
  99         byteaSeq = []

Each row is populated with a Bytea array holding of up to seven elements. I am using the random() function to generate the bytes.

 100         # generate 1-7 random byte arrays
 101         for numOfSstrings in range(1, int(rand()*8)):
 102             bytea = ”.join([chr(int(rand()*256)) for x in range(int(rand()*5))])
 103             byteaSeq.append(bytea)
 104         print byteaSeq
 105         # get the INSERT string for the byte array sequence generated
 106         bas = PGT.prepareByteaString(byteaSeq)
 107         # insert the row into the table
 108         cursor.execute(”INSERT INTO public.baex(id, byteaa) VALUES(%s, ‘{%s}’)\
 109                        % (rowId, bas))
 110     db.commit()
 111     sys.stdout.flush()

Once the data is inserted into the database I run psql to check whether everything worked properly..

 113     print\n******************** Bytea data inserted: ********************114     sys.stdout.flush()
 115     # show the data inserted
 116     os.system(”psql -d test -U postgres -c ‘SELECT * FROM public.baex’“)

.. and finally drop the test table.

 117     cursor.execute(’DROP TABLE public.baex‘)
 118     db.commit()

A few test runs

I have selected the two test runs below because they show some interesting (edge) cases.

The first one shows that single quotes are dealt with properly (last byte in last byte string of first row and first byte string of the third row)

******************** Bytea data generated: ********************
['R\xb7', '\xd7\xcd\xbb', 'u', "\xa3[\x97'"]
['\xfbT\x84g', '\xfa', '']
["'", '\x8d', '', '\xc5\n5', 'A', '\xa4P*']

******************** Bytea data inserted: ********************
 id |                    byteaa
—-+———————————————–
  1 | {”R\\267″,”\\327\\315\\273″,u,”\\243[\\227'"}
  2 | {"\\373T\\204g","\\372",""}
  3 | {',"\\215","","\\305\125",A,"\\244P*"}
(3 rows)

The second test run demonstrates that double quotes are handled correctly (fist byte of first byte string in first row)

******************** Bytea data generated: ********************
['"\xa5\x13', '\x9a`', '', '\xf4\x98']
[]
['u', '?7', '1\xff', '.\x16\xcf', '\xcbe}', '']

******************** Bytea data inserted: ********************
 id |                   byteaa
—-+——————————————–
  1 | {”\”\\245\23″,”\\232`”,”",”\\364\\230″}
  2 | {”"}
  3 | {u,?7,”1\\377″,”.\26\\317″,”\\313e}”,”"}
(3 rows)

Conclusion

Once you bite the bullet and invest the time to think about the problem, code the function and test it, it’s simple :-)

Scrape the web with ruby

September 4, 2007 by muharem

Introduction

In the last few months I have taken some time to play with a number of dynamic languages. My experiments were mostly in the “web hacks” category e.g. fetching files from the web and extracting data of interest from these. For my most recent hack (get wordpress weblog statistics) I used Ruby.

The task at hand

The task at hand consists of fetching the weblog statistics for my wordpress weblog and displaying them in the terminal window.
This includes the handling of possible redirections to the wordpress.com login page, the parsing of the HTML file to be obtained and the extraction of the various weblog statistics.

The tools used

After briefly surveying the tools and libraries available in Ruby-land I settled for WWW::Mechanize a Ruby implementation of Perl’s venerable WWW-Mechanize CPAN module.

Under the hood WWW::Mechanize uses the Hpricot HTML parser.

The approach

The worpress.com weblog statistics pages have a URL with the following structure:

    http://#{user}.wordpress.com/wp-admin/index.php?page=stats

They contain the following statistics for today and yesterday respectively:

  • Referrers: people clicked links from these pages to get to your weblog
  • Top posts: these posts on your weblog got the most traffic
  • Search engine terms: these are terms people used to find your weblog
  • Clicks: your visitors clicked these links on your weblog

Each of the above are structured as follows:

<div class="statsdiv">
<h3><a href="7-day page URL">statistics type</a></h3>
<p>..explanatory text..</p>
<h4>Today</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
<h4>Yesterday</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
</div>

The Ruby code below first finds the <div class="statsdiv"> sub-trees and then extracts today’s data from them.

The code

  1 #!/usr/bin/env ruby
  2 
  3 require 'rubygems'
  4 require 'mechanize'
  5 
  6 HELP_STRING =<<EOS
  7 
  8 Tool for fetching wordpress.com weblog statistics. Usage:
  9 
 10     wls.rb [username] [pwd]
 11 
 12 where 'user' is your wordpress user name and 'pwd' is your
 13 password respectively.
 14 
 15 EOS
 16 
 17 if not ARGV.grep(/-h|–help/).empty?
 18     puts HELP_STRING
 19     exit(0)
 20 end
 21 
 22 # try to access the weblog statistics page
 23 user = 'muharem'
 24 password = nil  # set your password here if you dislike being prompted for it
 25 
 26 if ARGV[0]
 27     user = ARGV[0]
 28 end
 29 if ARGV[1]
 30     password = ARGV[1]
 31 end
 32 
 33 stats_url = "http://#{user}.wordpress.com/wp-admin/index.php?page=stats"
 34 
 35 # instantiate/initialise web agent ..
 36 agent = WWW::Mechanize.new
 37 agent.user_agent_alias = 'Mac Safari'
 38 # .. and get the weblog statistics page
 39 page = agent.get(stats_url)
 40 
 41 # did we get back the login form?
 42 if (page.title.strip.split[-1] == 'Login')
 43     # yes, fill it in and submit it
 44     loginf = page.form('loginform')
 45     loginf.log = user
 46     if not password
 47         print "Enter your wordpress.com password: "
 48         password = $stdin.gets.chomp
 49     end
 50     loginf.pwd = password
 51     agent.submit(loginf, loginf.buttons.first)
 52 end
 53 
 54 # now get the actual weblog statistics page
 55 page = agent.get_file(stats_url)
 56 # parse it!
 57 doc = Hpricot(page)
 58 
 59 # search for the div elements that contain the statistics data
 60 stats_divs = doc.search("//div[@class='statsdiv']")
 61 stats_divs.each do |div|
 62     heading = div.search("h3/a/text()")
 63     # we are only interested in the statistics for today
 64     day = div.search("h4/text()").first
 65     if (heading and day)
 66         heading = "==== #{heading} (#{day.inner_text.downcase}) ====".center(50)
 67         puts "\\n#{heading}\\n"
 68         # find the table with today's statistics data
 69         tab = div.search("table").first
 70         if tab
 71             # extract the statistics data from the <tr> elements
 72             tab.search("tr").each do |tr|
 73                 what = tr.search("td[@class='label']")
 74                 views = tr.search("td[@class='views']")
 75                 whats = what.inner_text.strip()
 76                 if not whats.empty?
 77                     views = views.inner_text.strip()
 78                     printf("%s — %5s\\n", whats.center(45), views)
 79                 end
 80             end
 81         end
 82     end
 83 end
 84 # grab the div with the general (weblog level) statistics data
 85 gbdiv = doc.search("//div[@id='generalblog']")
 86 # find the <p> element with the number of views today
 87 vtoday =  gbdiv.search("p").find { |p| p.inner_text.index('Views today') }
 88 if vtoday
 89     printf("\\n%s\\n\\n", "=> #{vtoday.inner_text.strip} <=".center(45))
 90 else
 91     puts "\\n\\n!! No weblog statistics data found."
 92     puts "   Did you enter a wrong user name and/or password?"
 93 end

Example output

  1            ==== Referrers (today) ====
  2    stumbleupon.com/refer.php?url=http%3A?     —    16
  3    stumbleupon.com/refer.php?url=http%3A?     —     3
  4    planeterlang.org/story.php?title=Erla?     —     2
  5    linuxquestions.org/questions/showthre?     —     2
  6        del.icio.us/jdkimball/stackless        —     1
  7    rodenas.org/blog/2007/08/27/erlang-ri?     —     1
  8    intertwingly.net/blog/2007/08/14/Long?     —     1
  9              ozone.wordpress.com              —     1
 10    programming.reddit.com/search?q=erlan?     —     1
 11 
 12            ==== Top Posts (today) ====
 13           Processing XML in Erlang            —    22
 14   Erlang vs. Stackless python: a first ben    —    18
 15   Python: file find, grep and in-line repl    —     4
 16   Python decorator mini-study (part 1 of 3    —     2
 17   Code refactoring with python's functoo      —     2
 18   Python: find files using Unix shell-styl    —     2
 19   Determine order of execution by (re-)seq    —     2
 20            A first look at Groovy             —     1
 21   Python decorator mini-study (part 2 of 3    —     1
 22    Turn on line numbers while searching in    —     1
 23 
 24       ==== Search Engine Terms (today) ====
 25               erlang benchmark                —     3
 26              stackless vs erlang              —     2
 27          python decorators argument           —     2
 28       source code of an execution path        —     2
 29           python functools partial            —     2
 30                  python grep                  —     2
 31          python string replace 2.4.4          —     1
 32               erlang vs C speed               —     1
 33     erlang command line arguments getopt      —     1
 34    Python + parsing command line arguments    —     1
 35 
 36              ==== Clicks (today) ====
 37         hpcwire.com/hpc/1295541.html          —     2
 38    pragmaticprogrammer.com/titles/jaerla?     —     1
 39    hrnjad.net/src/6/scriptutil.py.html#f?     —     1
 40 
 41             => Views today: 68 <=

Conclusion

I am a total Ruby beginner but have a lot of experience with Python and Perl. It took approximately 2 hours (and frequent look-ups in the pickaxe book) to write the tool above and I enjoyed it :-)

Being the kind of person that stays away from all things over-hyped I ignored Ruby for the last two years or so but I have to say it’s a cool language after all.

Click here to download the code.

Processing XML in Erlang

August 21, 2007 by muharem

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version=“1.0″ encoding=“utf-8″?>
  2 <rss version=“2.0″>
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url=http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3 />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib(“xmerl/include/xmerl.hrl”).
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, “.+.rss$”, true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format(~n>> ~p~n, [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle ‘xmlElement’ tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date (’pubDate’ tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date (’dc:date’ tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % ‘main’ function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.

A first look at Groovy

August 15, 2007 by muharem

Introduction

Recently I started playing with Groovy, a dynamic language that — according to the first chapter of the Groovy in Action book — is Python-inspired.

One of the reasons why I find Groovy attractive is that it can be compiled to Java byte code (despite being a dynamic language) i.e. you gain access to all the Java libraries and the capability to deploy on the Java platform without actually having to write Java code.

Groovy is even supported by the Spring Framework. So, if you’re “forced” to work in a Java project in order to make a living but would rather be using a dynamic language then Groovy is definitely worth a look.

Example

Some time ago I wrote a small tool (using Python) that manages my audiocasts. Based on the RSS feeds I am interested in it synchronises

  1. the MP3 files on my MP3 player with the ones available on the local hard disk of my computer
  2. the MP3 files on my computer (local hard disk) with the ones available on the web

It is also cognizant of the date of an audiocast and will only consider MP3 files that were published on a particular channel N days ago.

I thought building a similar tool in Groovy would be a nice exercise and a good way to get to know the language.

I started playing with the code that parses the RSS (XML) files first and was pleasantly surprised how quickly I was able to get my hands on the data required.

For an example of how an RSS file looks like see e.g. the MacBreak’s weekly RSS file. In essence the top level tag is <channel> with a number of embedded <item> tags. Each item is an audiocast and the data I needed for my purpose is as follows:

  • audiocast episode title (<title> tag)
  • publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag (where dc is an XML name space pointing to http://purl.org/dc/elements/1.1/).

Likewise, the MP3 file URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files on my system are here:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \\*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

The code

The Groovy code that finds the RSS files listed above and parses them is as follows:

  1 def audioDir = new File("/Users/mhr/dl/audio/audiocasts")
  2 
  3 // initialise RSS files list
  4 files = []
  5 // find RSS files underneath 'audioDir'
  6 audioDir.eachFileRecurse { if (it =~ /.*\.(xml|rss)$/) { files << it } }
  7 
  8 println("\\\n——– RSS files ——–")
  9 println files.join('\\\n')
 10 
 11 // iterate over RSS files found
 12 for (rssf in files) {
 13     println("\\\n——– $rssf ——–")
 14     // parse the RSS file
 15     def d = new XmlSlurper().parse(rssf)
 16     d.declareNamespace(dc:"http://purl.org/dc/elements/1.1/")
 17 
 18     // iterate over item tags in RSS file (take only the first two)
 19     d.channel.item[0..1].each {
 20         println "==> ${it.title}"
 21         if (it.pubDate.toString().trim()) {
 22             println "pubDate: ${it.pubDate}"
 23         } else {
 24             println "dc:date: ${it.'dc:date'}"
 25         }
 26         if (it.link.toString().trim()) {
 27             println "link: ${it.link}"
 28         } else {
 29             println "url: ${it.enclosure.@url}"
 30         }
 31     }
 32 }

Here’s the output generated by the code above:

  1 bbox33:groovy $ groovy xml.groovy
  2 
  3 ——– RSS files ——–
  4 /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss
  5 /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss
  6 /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss
  7 /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss
  8 /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss
  9 /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss
 10 
 11 ——– /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss ——–
 12 ==> CC2 - 62. Folge
 13 pubDate: Mon, 13 Aug 2007 20:00:00 +0200 +0200
 14 url: http://www.media01-live.de/CC-Zwei-62.mp3
 15 ==> CC2 - 61. Folge
 16 pubDate: Mon, 06 Aug 2007 20:00:00 +0200 +0200
 17 url: http://www.media01-live.de/CC-Zwei-61.mp3
 18 
 19 ——– /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss ——–
 20 ==> 36: Lutz Schmitt ^_ber Machinima
 21 dc:date: 2007-08-05T21:30:00+01:00
 22 link: http://www.elektrischer-reporter.de/index.php/site/film/48/
 23 ==> 35: Peter Schaar ^_ber Vorratsdatenspeicherung und Online-Durchsuchungen
 24 dc:date: 2007-07-22T21:30:00+01:00
 25 link: http://www.elektrischer-reporter.de/index.php/site/film/47/
 26 
 27 ——– /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss ——–
 28 ==> Security Now 104: SteveOs Questions, Your Answers 22 - sponsored by Astaro Corp.
 29 pubDate: Thu, 09 Aug 2007 08:22:49 -0700
 30 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-104.mp3
 31 ==> Security Now 103: Paypal Security Key - sponsored by Astaro Corp.
 32 pubDate: Thu, 02 Aug 2007 15:17:58 -0700
 33 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-103.mp3
 34 
 35 ——– /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss ——–
 36 ==> Scott Lemon, Ben Galbraith - Technometria
 37 pubDate: Tue, 14 Aug 2007 00:00:00 CDT
 38 link: http://www.itconversations.com/shows/detail1892.html
 39 ==> Drew Major - Technometria
 40 pubDate: Tue, 7 Aug 2007 00:00:00 CDT
 41 link: http://www.itconversations.com/shows/detail1886.html
 42 
 43 ——– /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss ——–
 44 ==> TWiT 109: The Numinous From The Quotidian
 45 pubDate: Sun, 12 Aug 2007 22:39:47 -0700
 46 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0109H.mp3
 47 ==> TWiT 108: The Crash of 2007
 48 pubDate: Sun, 05 Aug 2007 20:48:59 -0700
 49 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0108H.mp3
 50 
 51 ——– /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss ——–
 52 ==> Windows Weekly 32: Neener Neener Neener
 53 pubDate: Fri, 27 Jul 2007 18:49:31 -0700
 54 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-032.mp3
 55 ==> Windows Weekly 31: Computing In The Clouds
 56 pubDate: Fri, 20 Jul 2007 08:57:52 -0700
 57 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-031.mp3

Conclusion

I have not written any substantial code in Groovy yet but my first impressions of the language are favourable. Groovy should be an interesting addition to your arsenal (particularly if you know the Java libraries fairly well).

Renovating Jython: a strategic imperative

August 9, 2007 by muharem

I always found Java boring, .. never really understood how people could get all fired up about it :-)

When it comes to programming, however, Java is the undisputed present day “king of the hill”. It has a huge community and a lot of cool tools (e.g. Lucene and ANTLR just to mention a few) that I would love to use in my own projects.

Problem is: I don’t fancy the prospect of writing Java code. Having used Python almost exclusively for the last two years I came to appreciate dynamic languages and loathe going back to statically typed programming.

Maybe Groovy can fill the gap: it appears to be dynamic in nature while allowing easy access to all the Java goodies. It has some potential and I am still exploring it.

All things considered Jython would clearly be the best contender if it were not for the fact that it is quite outdated (when compared to CPython).

My sense is that a lot of developers and companies would embrace Jython (and thus Python) without any hesitation if provided with a modern and more up to date implementation. The Jython renovation effort is thus a no-brainer and a strategic imperative for the entire Python community.

We should not allow others to outrun us in this important “market segment”. Let’s get our act together and lend the Jython folks a helping hand!

Erlang vs. Stackless python: a first benchmark

July 31, 2007 by muharem

Introduction

I obtained a copy of Joe Armstrong’s Erlang book recently. After reading through the first half I have found it to be a very commendable work, making a strong case for message passing concurrency — a different approach to building large scale and highly concurrent systems.

After finishing chapter 8, I came across the following exercise:

Write a ring benchmark. Create N processes in a ring. Send a message round the ring M times so that a total of N*M messages get sent. Time how long this takes [..]

Write a similar program in some other programming language you are familiar with. Compare the results. Write a blog, and publish the results on the internet!

I liked it, it looked like a great opportunity to get my feet wet with Erlang and message passing concurrency :-)

So, here we are, that’s precisely what this article is about.

The “other programming language” is of course Python, in this particular case Stackless Python featuring tasklets and channels to counter Erlang’s concurrency machinery.

The source code

Both the Erlang and the Stackless python code set up a ring of N processes so that each of the latter knows its successor. The resulting “ring” has the shape of an open necklace:

  • there is a first and a last bead in the chain
  • each message
    • is sent to the first bead
    • is relayed by beads number [1 .. N-1] to their immediate neighbour
    • stops at bead number N

The Erlang program

This is my very first Erlang program, so please bear with me. If there’s anything that could be done better, feel free to comment on it (in a polite way).

  1 % sets up a ring of N processes, a message is passed M times around the ring.
  2 -module(oringb).
  3 -export([run_benchmark/2, loop/2, mloop/1, main/1]).
  4 % builds a ring of N processes and send a message around the ring M times.
  5 run_benchmark(N, M) ->
  6     io:format(“>> Erlang R11B-5 here (N=~w, M=~w)!~n~n, [N, M]),
  7     % spawn the other N-1 processes needed for the ring
  8     LastP = lists:foldl(fun(S, Pid) -> spawn(?MODULE, loop, [N-S, Pid]) end,
  9                         self(), lists:seq(1, N-1)),
 10     spawn(fun() -> [ LastP ! R || R <- lists:reverse(lists:seq(0, M-1)) ] end),
 11     mloop(N).

The function above sets up the ring. The lists:foldl construct is equivalent to Python’s built-in reduce() function.

The process spawned on line 10 sends the M messages around the ring. The messages are just integers taken from the [M-1 .. 0] sequence.

The main process acts as process number N and invokes the function mloop() thus entering the appropriate receive loop on line 11.

 12 % receive loop executed by the first N-1 processes in the ring
 13 %   - ‘S’ is the sequence number allowing us to keep track of a
 14 %     process’ position in the ring
 15 %   - ‘NextP’ is the next process in the ring
 16 %   - ‘R’ is the round/message number, zero indicates that the processes
 17 %      should terminate;
 18 loop(S, NextP) ->
 19     receive
 20         % the message number is above zero => forward the message to the next
 21         % process in the ring
 22         R when R > 0 -> NextP ! R,
 23             io:format(“: Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n, [self(), S, R]),
 24             loop(S, NextP);
 25         % the message number is zero => forward message and terminate
 26         R when R =:= 0 -> NextP ! R,
 27             io:format(“* Proc: ~8w, Seq#: ~w, Msg#: terminate!~n, [self(), S]);
 28         % error: the message number is below zero => raise exception
 29         R when R < 0 ->
 30             erlang:error({“internal error”, “invalid message number”})
 31     end.

The loop() function is executed by the first N-1 processes in the ring. These know their position in the ring (S) as well as their immediate neighbour (NextP) to which they forward all messages received.

The messages (R) are just integers. Values above zero keep the ring going, a zero value (the last message) results in the termination of the ring.

 32 % receive loop executed by the last (Nth) process;
 33 % it won’t forward any messages
 34 mloop(S) ->
 35     receive
 36         R when R > 0 ->
 37             io:format(“> Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n, [self(), S, R]),
 38             mloop(S);
 39         0 ->
 40             io:format(“@ Proc: ~8w, Seq#: ~w, ring terminated.~n, [self(), S])
 41     end.

The mloop() function acts as the receive loop of the process number N. It “terminates” the messages that go around the ring. Its main purpose is to produce diagnostic output.

 42 % ‘main’ function allowing the invocation from the shell as well as the
 43 % passing of command line arguments
 44 main(A) ->
 45     Args = [ list_to_integer(Litem) ||
 46              Litem <- [ atom_to_list(Atom) || Atom <- A ]],
 47     [N, M] = Args,
 48     run_benchmark(N, M).

What follows is an example invocation of the oringb module (a ring of 4 processes with 3 messages going around):

  1 bbox33:ring $ erl -noshell -s oringb main 4 3 -s init stop
  2 >> Erlang R11B-5 here (N=4, M=3)!
  3 
  4 : Proc: <0.29.0>, Seq#: 1, Msg#: 2 ..
  5 : Proc: <0.28.0>, Seq#: 2, Msg#: 2 ..
  6 : Proc: <0.27.0>, Seq#: 3, Msg#: 2 ..
  7 : Proc: <0.29.0>, Seq#: 1, Msg#: 1 ..
  8 : Proc: <0.28.0>, Seq#: 2, Msg#: 1 ..
  9 > Proc:  <0.1.0>, Seq#: 4, Msg#: 2 ..
 10 : Proc: <0.27.0>, Seq#: 3, Msg#: 1 ..
 11 * Proc: <0.29.0>, Seq#: 1, Msg#: terminate!
 12 * Proc: <0.28.0>, Seq#: 2, Msg#: terminate!
 13 > Proc:  <0.1.0>, Seq#: 4, Msg#: 1 ..
 14 * Proc: <0.27.0>, Seq#: 3, Msg#: terminate!
 15 @ Proc:  <0.1.0>, Seq#: 4, ring terminated.

Lines 4-6,9 are generated by the first message, lines 7,8,10 and 13 show the second message going around the ring — lines 11,12,14 and 15 are the result of the zero-valued (terminate ring!) message.

The Stackless Python program

I have made an attempt to keep the python code as similar as possible to its Erlang counterpart. The run_benchmark() function below builds the ring of N tasklets (lines 9-1 8) and sends the M messages around it (lines 19-21).

One minor difference to note: the main process/tasklet here is not the last process in the ring but sends the messages (after spawning the last process on lines 16-18).

  1 #!/Library/Frameworks/Python.framework/Versions/2.5/bin/python
  2 # encoding: utf-8
  3 import sys
  4 import stackless as SL
  5 
  6 def run_benchmark(n, m):
  7     print(”>> Python 2.5.1, stackless 3.1b3 here (N=%d, M=%d)!\\n” % (n, m))
  8     firstP = cin = SL.channel()
  9     for s in xrange(1, n):
 10         seqn = s
 11         cout = SL.channel()
 12         # print(”*> s = %d” % (seqn, ))
 13         t = SL.tasklet(loop)(seqn, cin, cout)
 14         cin = cout
 15     else:
 16         seqn = s+1
 17         # print(”$> s = %d” % (seqn, ))
 18         t = SL.tasklet(mloop)(seqn, cin)
 19     for r in xrange(m-1, -1, -1):
 20         # print(”+ sending Msg#  %d” % r)
 21         firstP.send(r)
 22     SL.schedule()

The loop() function is executed by the first N-1 processes in the ring as in the Erlang code. Apart from relaying the messages received, it mainly produces diagnostic output.

Each tasklet knows its position in the ring (S) and is given an input channel from which to read its messages (cin) as well as an output channel to which to write all the messages received (cout).

 23 def loop(s, cin, cout):
 24     while True:
 25         r = cin.receive()
 26         cout.send(r)
 27         if r > 0:
 28             print(”: Proc: <%s>, Seq#: %s, Msg#: %s ..” % (pid(), s, r))
 29         else:
 30             print(”* Proc: <%s>, Seq#: %s, Msg#: terminate!” % (pid(), s))
 31             break

The last tasklet runs the mloop() function which mainly produces diagnostic output.

 32 def mloop(s, cin):
 33     while True:
 34         r = cin.receive()
 35         if r > 0:
 36             print(”> Proc: <%s>, Seq#: %s, Msg#: %s ..” % (pid(), s, r))
 37         else:
 38             print(”@ Proc: <%s>, Seq#: %s, ring terminated.” % (pid(), s))
 39             break
 40 
 41 def pid(): return repr(SL.getcurrent()).split()[-1][2:-1]
 42 
 43 if __name__ == ‘__main__‘:
 44     run_benchmark(int(sys.argv[1]), int(sys.argv[2]))

The following is an example invocation of the python code (a chain of 4 processes with 3 messages going around (like above)):

  1 bbox33:ring $ ./oringb.py 4 3
  2 >> Python 2.5.1, stackless 3.1b3 here (N=4, M=3)!
  3 
  4 > Proc: <6b870>, Seq#: 4, Msg#: 2 ..
  5 : Proc: <6b730>, Seq#: 1, Msg#: 2 ..
  6 : Proc: <6b770>, Seq#: 2, Msg#: 2 ..
  7 : Proc: <6b7f0>, Seq#: 3, Msg#: 2 ..
  8 > Proc: <6b870>, Seq#: 4, Msg#: 1 ..
  9 : Proc: <6b730>, Seq#: 1, Msg#: 1 ..
 10 : Proc: <6b770>, Seq#: 2, Msg#: 1 ..
 11 : Proc: <6b7f0>, Seq#: 3, Msg#: 1 ..
 12 @ Proc: <6b870>, Seq#: 4, ring terminated.
 13 * Proc: <6b730>, Seq#: 1, Msg#: terminate!
 14 * Proc: <6b770>, Seq#: 2, Msg#: terminate!
 15 * Proc: <6b7f0>, Seq#: 3, Msg#: terminate!

Line 4-7 show the first message going around, lines 8-11 stem from the second message and lines 12-1