Text filtering with erlang

Introduction

After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.

A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.

Just so you know where I am coming from:

  • my main/favourite programming language is Python
  • my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
  • I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.

Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below :-)

File processing with erlang’s regexp module

The initial implementation of the URL filter in erlang used its regexp library.

  1 -module(regex).
  2 -export([main/1]).
  3 
  4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true;
  5 isalphanum(_) -> false.
  6 
  7 %% Generate a temporary file name of length N
  8 genname(0, L) -> L;
  9 genname(N, L) ->
 10     R = random:uniform(123),
 11     case isalphanum(R) of
 12         true -> genname(N-1, [R|L]);
 13         false -> genname(N, L)
 14     end.
 15 
 16 %% Returns a randomly generated temporary file path where the basename is
 17 %% of length N
 18 mktemppath(Prefix, N) -> Prefix ++ "/" ++ genname(N, []).
 19 

Please note how I had to implement functionality absent from the standard library above.

 20 %% Removes passwords embedded in URLs from a log file.
 21 scrub_file(Tmpdir, F) ->
 22     %% make a temporary directory if it does not exist yet.
 23     case file:make_dir(Tmpdir) of
 24         ok -> ok;
 25         {error,eexist} -> ok;
 26         _ -> exit({error, failed_to_make_tmpdir})
 27     end,
 28 
 29     %% Move the original file out of the way.
 30     T = mktemppath(Tmpdir, 16),
 31     case file:rename(F, T) of
 32         ok -> ok;
 33         _ -> exit({error, failed_to_move_file})
 34     end,
 35 
 36     %% Now open it for reading.
 37     {_, In} = file:open([T], read),
 38     %% Open the original path for writing.
 39     {_, Out} = file:open([F], write),
 40 
 41     %% Call the function that will scrub the lines.
 42     scrub_lines(In, Out),
 43 
 44     %% Close the file handles and return the path to the original file.
 45     file:close(Out),
 46     file:close(In),
 47     T.
 48 

The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.

 49 %% This is where the log file is actually read linewise and where
 50 %% the scrubbing function is invoked for lines that contain URLs.
 51 scrub_lines(In, Out) ->
 52     L = io:get_line(In, ''),
 53     case L of
 54         eof -> ok;
 55         _ ->
 56             %% Does the line contain URLs?
 57             case string:str(L, "://") of
 58                 0 -> io:format(Out, "~s", [L]);
 59                 _ ->
 60                     case regexp:gsub(L, "://[^:]+:[^@]+@", "://") of
 61                         {ok, S, _} -> io:format(Out, "~s", [S]);
 62                         {R, S, _} -> io:format("Failed: {~p,~p}", [R,S])
 63                     end
 64             end,
 65             %% Continue with next line.
 66             scrub_lines(In, Out)
 67     end.
 68 
 69 %% Main function.
 70 main([A]) ->
 71     {A1,A2,A3} = now(),
 72     random:seed(A1, A2, A3),
 73 
 74     %% A single argument (the name of the file to be scrubbed) is expected.
 75     F = atom_to_list(A),
 76     T = scrub_file("tmp", F),
 77 
 78     %% The scrubbed file content will be written to a new file that's
 79     %% in the place of the original file. Where was the latter moved to?
 80     io:format("~s~n", [T]),
 81 
 82     init:stop().

The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).

Log file size Python times Erlang times
1 MB 0m0.230s 0m1.896s
10 MB 0m1.510s 0m8.766s
100 MB 0m14.793s 1m17.662s
1 GB 2m55.012s 13m54.588s

The do-it-yourself construction

Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.

As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.

Log file size Python times Erlang regexp Erlang do-it-yourself
1 MB 0m0.230s 0m1.896s 0m1.969s
10 MB 0m1.510s 0m8.766s 0m8.459s
100 MB 0m14.793s 1m17.662s 1m12.448s
1 GB 2m55.012s 13m54.588s 13m3.360s

In conclusion

Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.

I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.

I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.

Ah, and no, you were not supposed to grok the sentence above unless you read it three times :-)

Processing XML in Erlang

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version="1.0" encoding="utf-8"?>
  2 <rss version="2.0">
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url="http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3" />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib("xmerl/include/xmerl.hrl").
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, ".+.rss$", true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format("~n>> ~p~n", [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle 'xmlElement' tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date ('pubDate' tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date ('dc:date' tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % 'main' function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.

Erlang vs. Stackless python: a first benchmark

Change of Heart

Mon Aug 17 17:51:58 CEST 2009

PLEASE NOTE:

I have had more time to study erlang since writing the article below and
have come to realize that the conclusions I drew are skewed.

Erlang

  - is a language and a high-quality runtime system optimized for
    concurrent and distributed execution. The examples below do not
    really exercise any of these strengths.
  - has proven itself in quite a number of high-profile projects and
    deployments (ejabberd, CouchDb and RabbitMQ just to mention a few).
  - has a strong, vibrant and competent community that's driving the
    erlang system and improving it continuously.

For all these reasons stackless python -- although being an interesting
piece of technology in its own right -- is certainly no match for
erlang.

Please see http://pseudogreen.org/blog/erlang_vs_stackless_vs_multitask.html
for what appears to be a more even-handed comparison.

Introduction

I obtained a copy of Joe Armstrong’s Erlang book recently. After reading through the first half I have found it to be a very commendable work, making a strong case for message passing concurrency — a different approach to building large scale and highly concurrent systems.

After finishing chapter 8, I came across the following exercise:

Write a ring benchmark. Create N processes in a ring. Send a message round the ring M times so that a total of N*M messages get sent. Time how long this takes [..]

Write a similar program in some other programming language you are familiar with. Compare the results. Write a blog, and publish the results on the internet!

I liked it, it looked like a great opportunity to get my feet wet with Erlang and message passing concurrency :-)

So, here we are, that’s precisely what this article is about.

The “other programming language” is of course Python, in this particular case Stackless Python featuring tasklets and channels to counter Erlang’s concurrency machinery.

The source code

Both the Erlang and the Stackless python code set up a ring of N processes so that each of the latter knows its successor. The resulting “ring” has the shape of an open necklace:

  • there is a first and a last bead in the chain
  • each message
    • is sent to the first bead
    • is relayed by beads number [1 .. N-1] to their immediate neighbour
    • stops at bead number N

The Erlang program

This is my very first Erlang program, so please bear with me. If there’s anything that could be done better, feel free to comment on it (in a polite way).

  1 % sets up a ring of N processes, a message is passed M times around the ring.
  2 -module(oringb).
  3 -export([run_benchmark/2, loop/2, mloop/1, main/1]).
  4 % builds a ring of N processes and send a message around the ring M times.
  5 run_benchmark(N, M) ->
  6     io:format(">> Erlang R11B-5 here (N=~w, M=~w)!~n~n", [N, M]),
  7     % spawn the other N-1 processes needed for the ring
  8     LastP = lists:foldl(fun(S, Pid) -> spawn(?MODULE, loop, [N-S, Pid]) end,
  9                         self(), lists:seq(1, N-1)),
 10     spawn(fun() -> [ LastP ! R || R <- lists:reverse(lists:seq(0, M-1)) ] end),
 11     mloop(N).

The function above sets up the ring. The lists:foldl construct is equivalent to Python’s built-in reduce() function.

The process spawned on line 10 sends the M messages around the ring. The messages are just integers taken from the [M-1 .. 0] sequence.

The main process acts as process number N and invokes the function mloop() thus entering the appropriate receive loop on line 11.

 12 % receive loop executed by the first N-1 processes in the ring
 13 %   - 'S' is the sequence number allowing us to keep track of a
 14 %     process' position in the ring
 15 %   - 'NextP' is the next process in the ring
 16 %   - 'R' is the round/message number, zero indicates that the processes
 17 %      should terminate;
 18 loop(S, NextP) ->
 19     receive
 20         % the message number is above zero => forward the message to the next
 21         % process in the ring
 22         R when R > 0 -> NextP ! R,
 23             io:format(": Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 24             loop(S, NextP);
 25         % the message number is zero => forward message and terminate
 26         R when R =:= 0 -> NextP ! R,
 27             io:format("* Proc: ~8w, Seq#: ~w, Msg#: terminate!~n", [self(), S]);
 28         % error: the message number is below zero => raise exception
 29         R when R < 0 ->
 30             erlang:error({"internal error", "invalid message number"})
 31     end.

The loop() function is executed by the first N-1 processes in the ring. These know their position in the ring (S) as well as their immediate neighbour (NextP) to which they forward all messages received.

The messages (R) are just integers. Values above zero keep the ring going, a zero value (the last message) results in the termination of the ring.

 32 % receive loop executed by the last (Nth) process;
 33 % it won't forward any messages
 34 mloop(S) ->
 35     receive
 36         R when R > 0 ->
 37             io:format("> Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 38             mloop(S);
 39         0 ->
 40             io:format("@ Proc: ~8w, Seq#: ~w, ring terminated.~n", [self(), S])
 41     end.

The mloop() function acts as the receive loop of the process number N. It “terminates” the messages that go around the ring. Its main purpose is to produce diagnostic output.

 42 % 'main' function allowing the invocation from the shell as well as the
 43 % passing of command line arguments
 44 main(A) ->
 45     Args = [ list_to_integer(Litem) ||
 46              Litem <- [ atom_to_list(Atom) || Atom <- A ]],
 47     [N, M] = Args,
 48     run_benchmark(N, M).

What follows is an example invocation of the oringb module (a ring of 4 processes with 3 messages going around):

  1 bbox33:ring $ erl -noshell -s oringb main 4 3 -s init stop
  2 >> Erlang R11B-5 here (N=4, M=3)!
  3 
  4 : Proc: <0.29.0>, Seq#: 1, Msg#: 2 ..
  5 : Proc: <0.28.0>, Seq#: 2, Msg#: 2 ..
  6 : Proc: <0.27.0>, Seq#: 3, Msg#: 2 ..
  7 : Proc: <0.29.0>, Seq#: 1, Msg#: 1 ..
  8 : Proc: <0.28.0>, Seq#: 2, Msg#: 1 ..
  9 > Proc:  <0.1.0>, Seq#: 4, Msg#: 2 ..
 10 : Proc: <0.27.0>, Seq#: 3, Msg#: 1 ..
 11 * Proc: <0.29.0>, Seq#: 1, Msg#: terminate!
 12 * Proc: <0.28.0>, Seq#: 2, Msg#: terminate!
 13 > Proc:  <0.1.0>, Seq#: 4, Msg#: 1 ..
 14 * Proc: <0.27.0>, Seq#: 3, Msg#: terminate!
 15 @ Proc:  <0.1.0>, Seq#: 4, ring terminated.

Lines 4-6,9 are generated by the first message, lines 7,8,10 and 13 show the second message going around the ring — lines 11,12,14 and 15 are the result of the zero-valued (terminate ring!) message.

The Stackless Python program

I have made an attempt to keep the python code as similar as possible to its Erlang counterpart. The run_benchmark() function below builds the ring of N tasklets (lines 9-18) and sends the M messages around it (lines 19-21).

One minor difference to note: the main process/tasklet here is not the last process in the ring but sends the messages (after spawning the last process on lines 16-18).

  1 #!/Library/Frameworks/Python.framework/Versions/2.5/bin/python
  2 # encoding: utf-8
  3 import sys
  4 import stackless as SL
  5 
  6 def run_benchmark(n, m):
  7     print(">> Python 2.5.1, stackless 3.1b3 here (N=%d, M=%d)!\\n" % (n, m))
  8     firstP = cin = SL.channel()
  9     for s in xrange(1, n):
 10         seqn = s
 11         cout = SL.channel()
 12         # print("*> s = %d" % (seqn, ))
 13         t = SL.tasklet(loop)(seqn, cin, cout)
 14         cin = cout
 15     else:
 16         seqn = s+1
 17         # print("$> s = %d" % (seqn, ))
 18         t = SL.tasklet(mloop)(seqn, cin)
 19     for r in xrange(m-1, -1, -1):
 20         # print("+ sending Msg#  %d" % r)
 21         firstP.send(r)
 22     SL.schedule()

The loop() function is executed by the first N-1 processes in the ring as in the Erlang code. Apart from relaying the messages received, it mainly produces diagnostic output.

Each tasklet knows its position in the ring (S) and is given an input channel from which to read its messages (cin) as well as an output channel to which to write all the messages received (cout).

 23 def loop(s, cin, cout):
 24     while True:
 25         r = cin.receive()
 26         cout.send(r)
 27         if r > 0:
 28             print(": Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 29         else:
 30             print("* Proc: <%s>, Seq#: %s, Msg#: terminate!" % (pid(), s))
 31             break

The last tasklet runs the mloop() function which mainly produces diagnostic output.

 32 def mloop(s, cin):
 33     while True:
 34         r = cin.receive()
 35         if r > 0:
 36             print("> Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 37         else:
 38             print("@ Proc: <%s>, Seq#: %s, ring terminated." % (pid(), s))
 39             break
 40 
 41 def pid(): return repr(SL.getcurrent()).split()[-1][2:-1]
 42 
 43 if __name__ == '__main__':
 44     run_benchmark(int(sys.argv[1]), int(sys.argv[2]))

The following is an example invocation of the python code (a chain of 4 processes with 3 messages going around (like above)):

  1 bbox33:ring $ ./oringb.py 4 3
  2 >> Python 2.5.1, stackless 3.1b3 here (N=4, M=3)!
  3 
  4 > Proc: <6b870>, Seq#: 4, Msg#: 2 ..
  5 : Proc: <6b730>, Seq#: 1, Msg#: 2 ..
  6 : Proc: <6b770>, Seq#: 2, Msg#: 2 ..
  7 : Proc: <6b7f0>, Seq#: 3, Msg#: 2 ..
  8 > Proc: <6b870>, Seq#: 4, Msg#: 1 ..
  9 : Proc: <6b730>, Seq#: 1, Msg#: 1 ..
 10 : Proc: <6b770>, Seq#: 2, Msg#: 1 ..
 11 : Proc: <6b7f0>, Seq#: 3, Msg#: 1 ..
 12 @ Proc: <6b870>, Seq#: 4, ring terminated.
 13 * Proc: <6b730>, Seq#: 1, Msg#: terminate!
 14 * Proc: <6b770>, Seq#: 2, Msg#: terminate!
 15 * Proc: <6b7f0>, Seq#: 3, Msg#: terminate!

Line 4-7 show the first message going around, lines 8-11 stem from the second message and lines 12-15 make the last (terminating) message visible (I guess the funny “order of execution” above arises from the fact that the Stackless python scheduler by default prefers the message receiver tasklets).

The first benchmarks

Now for the actual benchmarks, all run on a machine with the following specs:

  Model Name:	MacBook Pro 15"
  Model Identifier:	MacBookPro2,2
  Processor Name:	Intel Core 2 Duo
  Processor Speed:	2.33 GHz
  Number Of Processors:	1
  Total Number Of Cores:	2
  L2 Cache (per processor):	4 MB
  Memory:	2 GB
  Bus Speed:	667 MHz

I ran two kind of benchmarks:

  1. fixed number of processes (ring size N = 100) and varying numbers of messages (10000 <= M <= 90000 stepped up in 10K increments)
  2. fixed number of messages (M = 100) and varying numbers of processes (10000 <= N <= 90000 stepped up in 10K increments)

In both cases the resulting total number of messages sent was between 1 and 9 million.

I expected that Erlang would clearly outperform Stackless python because this was the former’s “sweet spot” scenario after all.

I was pleasantly surprised by python not only holding up to the challenge but winning it!

As can be seen from the diagrams depicting the results of the two benchmarks, Stackless Python performed much better than Erlang!

100 processes varying message numbers
100 messages varying number of processes

The second round

Eventually I came to believe that Erlang’s I/O library is very slow and that that might be the reason for its sub-optimal performance in the benchmarks above.

I hence modified both sources (Erlang here, Stackless python here) and commented out all of the output.

Following that, I reran the benchmarks with the following results.

100 processes varying message numbers (no output)
100 messages varying number of processes (no output)

Now Erlang performed much more like a system that was built with message passing concurrency in mind BUT with Stackless python still close on its heels.

In conclusion

Even I as a Python aficionado am sometimes baffled in light of the gems that can be found in the Python treasure trove.

The Python community should stop worrying about the language falling out of fashion and/or being overtaken by RoR/Ruby or some other contender of the day.

Python’s potential is so huge, all it takes is attracting notice to it e.g. to the fact that Python is ready for Our Manycore Future.

Last but not least, I find it’s quite a shame that Stackless python seems to be treated like the “unloved stepchild” by the community.

Highly concurrent systems are the future of software and stackless facilitates the construction of such systems today.

Isn’t such a capability worthy of being maintained, improved and expanded upon as opposed to building yet another web framework or whatever ..?