Text filtering with erlang


After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.

A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.

Just so you know where I am coming from:

  • my main/favourite programming language is Python
  • my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
  • I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.

Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below ๐Ÿ™‚

File processing with erlang’s regexp module

The initial implementation of the URL filter in erlang used its regexp library.

  1 -module(regex).
  2 -export([main/1]).
  4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true;
  5 isalphanum(_) -> false.
  7 %% Generate a temporary file name of length N
  8 genname(0, L) -> L;
  9 genname(N, L) ->
 10     R = random:uniform(123),
 11     case isalphanum(R) of
 12         true -> genname(N-1, [R|L]);
 13         false -> genname(N, L)
 14     end.
 16 %% Returns a randomly generated temporary file path where the basename is
 17 %% of length N
 18 mktemppath(Prefix, N) -> Prefix ++ "/" ++ genname(N, []).

Please note how I had to implement functionality absent from the standard library above.

 20 %% Removes passwords embedded in URLs from a log file.
 21 scrub_file(Tmpdir, F) ->
 22     %% make a temporary directory if it does not exist yet.
 23     case file:make_dir(Tmpdir) of
 24         ok -> ok;
 25         {error,eexist} -> ok;
 26         _ -> exit({error, failed_to_make_tmpdir})
 27     end,
 29     %% Move the original file out of the way.
 30     T = mktemppath(Tmpdir, 16),
 31     case file:rename(F, T) of
 32         ok -> ok;
 33         _ -> exit({error, failed_to_move_file})
 34     end,
 36     %% Now open it for reading.
 37     {_, In} = file:open([T], read),
 38     %% Open the original path for writing.
 39     {_, Out} = file:open([F], write),
 41     %% Call the function that will scrub the lines.
 42     scrub_lines(In, Out),
 44     %% Close the file handles and return the path to the original file.
 45     file:close(Out),
 46     file:close(In),
 47     T.

The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.

 49 %% This is where the log file is actually read linewise and where
 50 %% the scrubbing function is invoked for lines that contain URLs.
 51 scrub_lines(In, Out) ->
 52     L = io:get_line(In, ''),
 53     case L of
 54         eof -> ok;
 55         _ ->
 56             %% Does the line contain URLs?
 57             case string:str(L, "://") of
 58                 0 -> io:format(Out, "~s", [L]);
 59                 _ ->
 60                     case regexp:gsub(L, "://[^:]+:[^@]+@", "://") of
 61                         {ok, S, _} -> io:format(Out, "~s", [S]);
 62                         {R, S, _} -> io:format("Failed: {~p,~p}", [R,S])
 63                     end
 64             end,
 65             %% Continue with next line.
 66             scrub_lines(In, Out)
 67     end.
 69 %% Main function.
 70 main([A]) ->
 71     {A1,A2,A3} = now(),
 72     random:seed(A1, A2, A3),
 74     %% A single argument (the name of the file to be scrubbed) is expected.
 75     F = atom_to_list(A),
 76     T = scrub_file("tmp", F),
 78     %% The scrubbed file content will be written to a new file that's
 79     %% in the place of the original file. Where was the latter moved to?
 80     io:format("~s~n", [T]),
 82     init:stop().

The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).

Log file size Python times Erlang times
1 MB 0m0.230s 0m1.896s
10 MB 0m1.510s 0m8.766s
100 MB 0m14.793s 1m17.662s
1 GB 2m55.012s 13m54.588s

The do-it-yourself construction

Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.

As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.

Log file size Python times Erlang regexp Erlang do-it-yourself
1 MB 0m0.230s 0m1.896s 0m1.969s
10 MB 0m1.510s 0m8.766s 0m8.459s
100 MB 0m14.793s 1m17.662s 1m12.448s
1 GB 2m55.012s 13m54.588s 13m3.360s

In conclusion

Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.

I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.

I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.

Ah, and no, you were not supposed to grok the sentence above unless you read it three times ๐Ÿ™‚