Text filtering with erlang
Introduction
After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.
A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.
Just so you know where I am coming from:
- my main/favourite programming language is Python
- my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
- I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.
Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below
File processing with erlang’s regexp module
The initial implementation of the URL filter in erlang used its regexp library.
1 -module(regex). 2 -export([main/1]). 3 4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true; 5 isalphanum(_) -> false. 6 7 %% Generate a temporary file name of length N 8 genname(0, L) -> L; 9 genname(N, L) -> 10 R = random:uniform(123), 11 case isalphanum(R) of 12 true -> genname(N-1, [R|L]); 13 false -> genname(N, L) 14 end. 15 16 %% Returns a randomly generated temporary file path where the basename is 17 %% of length N 18 mktemppath(Prefix, N) -> Prefix ++ “/” ++ genname(N, []). 19
Please note how I had to implement functionality absent from the standard library above.
20 %% Removes passwords embedded in URLs from a log file. 21 scrub_file(Tmpdir, F) -> 22 %% make a temporary directory if it does not exist yet. 23 case file:make_dir(Tmpdir) of 24 ok -> ok; 25 {error,eexist} -> ok; 26 _ -> exit({error, failed_to_make_tmpdir}) 27 end, 28 29 %% Move the original file out of the way. 30 T = mktemppath(Tmpdir, 16), 31 case file:rename(F, T) of 32 ok -> ok; 33 _ -> exit({error, failed_to_move_file}) 34 end, 35 36 %% Now open it for reading. 37 {_, In} = file:open([T], read), 38 %% Open the original path for writing. 39 {_, Out} = file:open([F], write), 40 41 %% Call the function that will scrub the lines. 42 scrub_lines(In, Out), 43 44 %% Close the file handles and return the path to the original file. 45 file:close(Out), 46 file:close(In), 47 T. 48
The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.
49 %% This is where the log file is actually read linewise and where 50 %% the scrubbing function is invoked for lines that contain URLs. 51 scrub_lines(In, Out) -> 52 L = io:get_line(In, ”), 53 case L of 54 eof -> ok; 55 _ -> 56 %% Does the line contain URLs? 57 case string:str(L, “://”) of 58 0 -> io:format(Out, “~s“, [L]); 59 _ -> 60 case regexp:gsub(L, “://[^:]+:[^@]+@”, “://”) of 61 {ok, S, _} -> io:format(Out, “~s“, [S]); 62 {R, S, _} -> io:format(“Failed: {~p,~p}”, [R,S]) 63 end 64 end, 65 %% Continue with next line. 66 scrub_lines(In, Out) 67 end. 68 69 %% Main function. 70 main([A]) -> 71 {A1,A2,A3} = now(), 72 random:seed(A1, A2, A3), 73 74 %% A single argument (the name of the file to be scrubbed) is expected. 75 F = atom_to_list(A), 76 T = scrub_file(“tmp”, F), 77 78 %% The scrubbed file content will be written to a new file that’s 79 %% in the place of the original file. Where was the latter moved to? 80 io:format(“~s~n“, [T]), 81 82 init:stop().
The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).
| Log file size | Python times | Erlang times |
|---|---|---|
| 1 MB | 0m0.230s | 0m1.896s |
| 10 MB | 0m1.510s | 0m8.766s |
| 100 MB | 0m14.793s | 1m17.662s |
| 1 GB | 2m55.012s | 13m54.588s |
The do-it-yourself construction
Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.
As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.
| Log file size | Python times | Erlang regexp | Erlang do-it-yourself |
|---|---|---|---|
| 1 MB | 0m0.230s | 0m1.896s | 0m1.969s |
| 10 MB | 0m1.510s | 0m8.766s | 0m8.459s |
| 100 MB | 0m14.793s | 1m17.662s | 1m12.448s |
| 1 GB | 2m55.012s | 13m54.588s | 13m3.360s |
In conclusion
Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.
I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.
I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.
Ah, and no, you were not supposed to grok the sentence above unless you read it three times ![]()
Tags: functional programming
April 30, 2008 at 5:39 pm
Well, Erlang was never really meant to do this sort of thing, and it’s not a language meant to *replace* anything as dynamic as Python. Python is good at a lot of things at the same time, including parsing text, but you must also remember that regexps in Python are being handled in C. Erlang has its own VM and yadda yadda — bottom line is that you can’t expect the same kind of performance.
If you were to stack Python and Erlang up against each other in a task that Erlang was meant for, the benchmarks would look a little different. Try doing 10.000 asynchronous socket connections in Python. Or try building a distributed, fault tolerant system.
I’m not really saying anything that hasn’t been said a dozen times before. Just.. keep Erlang in your arsenal of tools and know its place.
April 30, 2008 at 7:27 pm
io:get_line performance is known to be a problem. Use the binary read methods instead. Here’s some discussion:
http://www.trapexit.org/forum/viewtopic.php?p=31894
May 2, 2008 at 6:18 am
If the order of the log content does not matter, why not try building a more parallel version? Have one process read from the input file, one process for writing out the output file, and have N copies of the regex match/find replace in additional processes. If the string matching/processing is the bottleneck, you can at least improve performance by handling multiple lines at a time. If having just a single output log file didn’t matter, you could probably parallelize the writing out to multiple files too.
May 3, 2008 at 10:14 pm
1. regexp:parse instead of passing string to gsub
2. io:get_line is bad, use binaries, or at least read
3. handle block-read requests in one process, regex everything in second one (or a pool, but I guess that won’t be needed here), write in third
That should improve your time A LOT!
(check widefinder solution for dispatching lines & reading in binary for reference and ideas)
May 28, 2008 at 8:23 pm
Take a look at
http://paste.lisp.org/display/61384
I wonder if you can share your data file to make some tests.