Erlang vs. Stackless python: a first benchmark

Change of Heart

Mon Aug 17 17:51:58 CEST 2009

PLEASE NOTE:

I have had more time to study erlang since writing the article below and
have come to realize that the conclusions I drew are skewed.

Erlang

  - is a language and a high-quality runtime system optimized for
    concurrent and distributed execution. The examples below do not
    really exercise any of these strengths.
  - has proven itself in quite a number of high-profile projects and
    deployments (ejabberd, CouchDb and RabbitMQ just to mention a few).
  - has a strong, vibrant and competent community that's driving the
    erlang system and improving it continuously.

For all these reasons stackless python -- although being an interesting
piece of technology in its own right -- is certainly no match for
erlang.

Please see http://pseudogreen.org/blog/erlang_vs_stackless_vs_multitask.html
for what appears to be a more even-handed comparison.

Introduction

I obtained a copy of Joe Armstrong’s Erlang book recently. After reading through the first half I have found it to be a very commendable work, making a strong case for message passing concurrency — a different approach to building large scale and highly concurrent systems.

After finishing chapter 8, I came across the following exercise:

Write a ring benchmark. Create N processes in a ring. Send a message round the ring M times so that a total of N*M messages get sent. Time how long this takes [..]

Write a similar program in some other programming language you are familiar with. Compare the results. Write a blog, and publish the results on the internet!

I liked it, it looked like a great opportunity to get my feet wet with Erlang and message passing concurrency :-)

So, here we are, that’s precisely what this article is about.

The “other programming language” is of course Python, in this particular case Stackless Python featuring tasklets and channels to counter Erlang’s concurrency machinery.

The source code

Both the Erlang and the Stackless python code set up a ring of N processes so that each of the latter knows its successor. The resulting “ring” has the shape of an open necklace:

  • there is a first and a last bead in the chain
  • each message
    • is sent to the first bead
    • is relayed by beads number [1 .. N-1] to their immediate neighbour
    • stops at bead number N

The Erlang program

This is my very first Erlang program, so please bear with me. If there’s anything that could be done better, feel free to comment on it (in a polite way).

  1 % sets up a ring of N processes, a message is passed M times around the ring.
  2 -module(oringb).
  3 -export([run_benchmark/2, loop/2, mloop/1, main/1]).
  4 % builds a ring of N processes and send a message around the ring M times.
  5 run_benchmark(N, M) ->
  6     io:format(">> Erlang R11B-5 here (N=~w, M=~w)!~n~n", [N, M]),
  7     % spawn the other N-1 processes needed for the ring
  8     LastP = lists:foldl(fun(S, Pid) -> spawn(?MODULE, loop, [N-S, Pid]) end,
  9                         self(), lists:seq(1, N-1)),
 10     spawn(fun() -> [ LastP ! R || R <- lists:reverse(lists:seq(0, M-1)) ] end),
 11     mloop(N).

The function above sets up the ring. The lists:foldl construct is equivalent to Python’s built-in reduce() function.

The process spawned on line 10 sends the M messages around the ring. The messages are just integers taken from the [M-1 .. 0] sequence.

The main process acts as process number N and invokes the function mloop() thus entering the appropriate receive loop on line 11.

 12 % receive loop executed by the first N-1 processes in the ring
 13 %   - 'S' is the sequence number allowing us to keep track of a
 14 %     process' position in the ring
 15 %   - 'NextP' is the next process in the ring
 16 %   - 'R' is the round/message number, zero indicates that the processes
 17 %      should terminate;
 18 loop(S, NextP) ->
 19     receive
 20         % the message number is above zero => forward the message to the next
 21         % process in the ring
 22         R when R > 0 -> NextP ! R,
 23             io:format(": Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 24             loop(S, NextP);
 25         % the message number is zero => forward message and terminate
 26         R when R =:= 0 -> NextP ! R,
 27             io:format("* Proc: ~8w, Seq#: ~w, Msg#: terminate!~n", [self(), S]);
 28         % error: the message number is below zero => raise exception
 29         R when R < 0 ->
 30             erlang:error({"internal error", "invalid message number"})
 31     end.

The loop() function is executed by the first N-1 processes in the ring. These know their position in the ring (S) as well as their immediate neighbour (NextP) to which they forward all messages received.

The messages (R) are just integers. Values above zero keep the ring going, a zero value (the last message) results in the termination of the ring.

 32 % receive loop executed by the last (Nth) process;
 33 % it won't forward any messages
 34 mloop(S) ->
 35     receive
 36         R when R > 0 ->
 37             io:format("> Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 38             mloop(S);
 39         0 ->
 40             io:format("@ Proc: ~8w, Seq#: ~w, ring terminated.~n", [self(), S])
 41     end.

The mloop() function acts as the receive loop of the process number N. It “terminates” the messages that go around the ring. Its main purpose is to produce diagnostic output.

 42 % 'main' function allowing the invocation from the shell as well as the
 43 % passing of command line arguments
 44 main(A) ->
 45     Args = [ list_to_integer(Litem) ||
 46              Litem <- [ atom_to_list(Atom) || Atom <- A ]],
 47     [N, M] = Args,
 48     run_benchmark(N, M).

What follows is an example invocation of the oringb module (a ring of 4 processes with 3 messages going around):

  1 bbox33:ring $ erl -noshell -s oringb main 4 3 -s init stop
  2 >> Erlang R11B-5 here (N=4, M=3)!
  3 
  4 : Proc: <0.29.0>, Seq#: 1, Msg#: 2 ..
  5 : Proc: <0.28.0>, Seq#: 2, Msg#: 2 ..
  6 : Proc: <0.27.0>, Seq#: 3, Msg#: 2 ..
  7 : Proc: <0.29.0>, Seq#: 1, Msg#: 1 ..
  8 : Proc: <0.28.0>, Seq#: 2, Msg#: 1 ..
  9 > Proc:  <0.1.0>, Seq#: 4, Msg#: 2 ..
 10 : Proc: <0.27.0>, Seq#: 3, Msg#: 1 ..
 11 * Proc: <0.29.0>, Seq#: 1, Msg#: terminate!
 12 * Proc: <0.28.0>, Seq#: 2, Msg#: terminate!
 13 > Proc:  <0.1.0>, Seq#: 4, Msg#: 1 ..
 14 * Proc: <0.27.0>, Seq#: 3, Msg#: terminate!
 15 @ Proc:  <0.1.0>, Seq#: 4, ring terminated.

Lines 4-6,9 are generated by the first message, lines 7,8,10 and 13 show the second message going around the ring — lines 11,12,14 and 15 are the result of the zero-valued (terminate ring!) message.

The Stackless Python program

I have made an attempt to keep the python code as similar as possible to its Erlang counterpart. The run_benchmark() function below builds the ring of N tasklets (lines 9-18) and sends the M messages around it (lines 19-21).

One minor difference to note: the main process/tasklet here is not the last process in the ring but sends the messages (after spawning the last process on lines 16-18).

  1 #!/Library/Frameworks/Python.framework/Versions/2.5/bin/python
  2 # encoding: utf-8
  3 import sys
  4 import stackless as SL
  5 
  6 def run_benchmark(n, m):
  7     print(">> Python 2.5.1, stackless 3.1b3 here (N=%d, M=%d)!\\n" % (n, m))
  8     firstP = cin = SL.channel()
  9     for s in xrange(1, n):
 10         seqn = s
 11         cout = SL.channel()
 12         # print("*> s = %d" % (seqn, ))
 13         t = SL.tasklet(loop)(seqn, cin, cout)
 14         cin = cout
 15     else:
 16         seqn = s+1
 17         # print("$> s = %d" % (seqn, ))
 18         t = SL.tasklet(mloop)(seqn, cin)
 19     for r in xrange(m-1, -1, -1):
 20         # print("+ sending Msg#  %d" % r)
 21         firstP.send(r)
 22     SL.schedule()

The loop() function is executed by the first N-1 processes in the ring as in the Erlang code. Apart from relaying the messages received, it mainly produces diagnostic output.

Each tasklet knows its position in the ring (S) and is given an input channel from which to read its messages (cin) as well as an output channel to which to write all the messages received (cout).

 23 def loop(s, cin, cout):
 24     while True:
 25         r = cin.receive()
 26         cout.send(r)
 27         if r > 0:
 28             print(": Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 29         else:
 30             print("* Proc: <%s>, Seq#: %s, Msg#: terminate!" % (pid(), s))
 31             break

The last tasklet runs the mloop() function which mainly produces diagnostic output.

 32 def mloop(s, cin):
 33     while True:
 34         r = cin.receive()
 35         if r > 0:
 36             print("> Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 37         else:
 38             print("@ Proc: <%s>, Seq#: %s, ring terminated." % (pid(), s))
 39             break
 40 
 41 def pid(): return repr(SL.getcurrent()).split()[-1][2:-1]
 42 
 43 if __name__ == '__main__':
 44     run_benchmark(int(sys.argv[1]), int(sys.argv[2]))

The following is an example invocation of the python code (a chain of 4 processes with 3 messages going around (like above)):

  1 bbox33:ring $ ./oringb.py 4 3
  2 >> Python 2.5.1, stackless 3.1b3 here (N=4, M=3)!
  3 
  4 > Proc: <6b870>, Seq#: 4, Msg#: 2 ..
  5 : Proc: <6b730>, Seq#: 1, Msg#: 2 ..
  6 : Proc: <6b770>, Seq#: 2, Msg#: 2 ..
  7 : Proc: <6b7f0>, Seq#: 3, Msg#: 2 ..
  8 > Proc: <6b870>, Seq#: 4, Msg#: 1 ..
  9 : Proc: <6b730>, Seq#: 1, Msg#: 1 ..
 10 : Proc: <6b770>, Seq#: 2, Msg#: 1 ..
 11 : Proc: <6b7f0>, Seq#: 3, Msg#: 1 ..
 12 @ Proc: <6b870>, Seq#: 4, ring terminated.
 13 * Proc: <6b730>, Seq#: 1, Msg#: terminate!
 14 * Proc: <6b770>, Seq#: 2, Msg#: terminate!
 15 * Proc: <6b7f0>, Seq#: 3, Msg#: terminate!

Line 4-7 show the first message going around, lines 8-11 stem from the second message and lines 12-15 make the last (terminating) message visible (I guess the funny “order of execution” above arises from the fact that the Stackless python scheduler by default prefers the message receiver tasklets).

The first benchmarks

Now for the actual benchmarks, all run on a machine with the following specs:

  Model Name:	MacBook Pro 15"
  Model Identifier:	MacBookPro2,2
  Processor Name:	Intel Core 2 Duo
  Processor Speed:	2.33 GHz
  Number Of Processors:	1
  Total Number Of Cores:	2
  L2 Cache (per processor):	4 MB
  Memory:	2 GB
  Bus Speed:	667 MHz

I ran two kind of benchmarks:

  1. fixed number of processes (ring size N = 100) and varying numbers of messages (10000 <= M <= 90000 stepped up in 10K increments)
  2. fixed number of messages (M = 100) and varying numbers of processes (10000 <= N <= 90000 stepped up in 10K increments)

In both cases the resulting total number of messages sent was between 1 and 9 million.

I expected that Erlang would clearly outperform Stackless python because this was the former’s “sweet spot” scenario after all.

I was pleasantly surprised by python not only holding up to the challenge but winning it!

As can be seen from the diagrams depicting the results of the two benchmarks, Stackless Python performed much better than Erlang!

100 processes varying message numbers
100 messages varying number of processes

The second round

Eventually I came to believe that Erlang’s I/O library is very slow and that that might be the reason for its sub-optimal performance in the benchmarks above.

I hence modified both sources (Erlang here, Stackless python here) and commented out all of the output.

Following that, I reran the benchmarks with the following results.

100 processes varying message numbers (no output)
100 messages varying number of processes (no output)

Now Erlang performed much more like a system that was built with message passing concurrency in mind BUT with Stackless python still close on its heels.

In conclusion

Even I as a Python aficionado am sometimes baffled in light of the gems that can be found in the Python treasure trove.

The Python community should stop worrying about the language falling out of fashion and/or being overtaken by RoR/Ruby or some other contender of the day.

Python’s potential is so huge, all it takes is attracting notice to it e.g. to the fact that Python is ready for Our Manycore Future.

Last but not least, I find it’s quite a shame that Stackless python seems to be treated like the “unloved stepchild” by the community.

Highly concurrent systems are the future of software and stackless facilitates the construction of such systems today.

Isn’t such a capability worthy of being maintained, improved and expanded upon as opposed to building yet another web framework or whatever ..?