Wrap-up: mergesort in haskell

I have to admit that I didn’t fully understand the apfelmus example code. Nevertheless, I made an effort to address both of his criticisms:

  1. The merge() function does not use an accumulator argument any more and is indeed much simpler now.
  2. The recursive mergesort_ function now does not use the haskell list length operator any more. Instead, the length of the list to be sorted is passed down the recursive chain.

Please see the optimised mergesort implementation for details.

These improvements reduced the RAM utilisation and improved the run-time performance by another 20% respectively.

Nothing to scoff at, eh?

Haskell byte strings to the rescue!

After my (naive) mergesort implementation from yesterday used around 730 MB of RAM to sort a (26 MB) file containing approx. 400,000 strings I consulted the good folks on the #haskell IRC channel.

Their advice was to use byte strings as opposed to normal strings since the former perform much better.

I tried that and observed that the RAM utilisation and run-time went down by approximately 85% !

The reduced RAM utilisation was attributed to the more efficient byte strings and the improved run-time performance to the reduced garbage collection overhead respectively.

The difference between the source files (mergesort.hs, Scaffolding.hs) is minimal and switching over to byte strings was facilitated by the fact that they expose the same interface as normal strings.

Nice!

Finger exercises in haskell

I have been reading about haskell for a while now and felt that it’s time to “get my hands dirty” by (re-)implementing some of the well known algorithms in computer science. For my first exercise I chose merge sort.

The resulting code (sans command line parameter handling, auxiliary I/O functions etc.) is quite neat and was written in little time. Haskell is very expressive and its attractiveness immediately obvious when formulating pseudo-mathematical problem solutions.

  1 module Main(main) where
  2 
  3 import qualified IO
  4 import System(getArgs)
  5 import Monad(mapM_)
  6 import qualified Scaffolding
  7 
  8 mergesort l =
  9     if (length l) <= 1 then l -- The list is already sorted.
 10     else -- Split the list into two halves and sort these.
 11         merge (mergesort lpart) (mergesort rpart) []
 12         where llen_half = (length l) `div` 2
 13               (lpart, rpart) = splitAt llen_half l
 14 
 15 -- Case #1: the right list is empty; just append the left list to the
 16 -- accumulator. The latter is reversed because we were prepending
 17 -- to it in case #3.
 18 merge lpart [] acc = (reverse acc) ++ lpart
 19 
 20 -- Case #2: the left list is empty; just append the right list to the
 21 -- (reversed) accumulator list.
 22 merge [] rpart acc = (reverse acc) ++ rpart
 23 
 24 -- Case #3: neither of the left/right lists to be merged is empty;
 25 -- prepend the lesser head element to the accumulator list.
 26 merge (l:ls) (r:rs) acc =
 27     if l <= r then merge ls (r:rs) (l:acc)
 28     else merge (l:ls) rs (r:acc)
 29 
 30 -- The main method, handles command line arguments, input and output.
 31 main = do
 32     args <- getArgs
 33     fileh <- case head args of
 34                 -- The text to be sorted is to be read from stdin.
 35                 "-" -> do return (Just IO.stdin)
 36                 -- The text to be sorted is to be read from the file
 37                 -- specified.
 38                 "-f" -> do Scaffolding.openFile (head (tail args))
 39                 -- Just sort the command line parameters.
 40                 _ -> do let sorted_args = mergesort args
 41                         IO.putStrLn ("mergesort(" ++ show args ++
 42                                      ") = " ++ show sorted_args)
 43                         return Nothing
 44     case fileh of
 45         -- Read text from file handle, sort it and print to stdout.
 46         Just fileh -> do text_to_sort <- Scaffolding.readLines fileh []
 47                          mapM_ IO.putStrLn (mergesort text_to_sort)
 48         -- We failed to open the file specified (if any).
 49         Nothing -> do return ()

I also liked the fact that the haskell standard library is rich and nicely documented.

Just to get an idea of how my naive mergesort implementation performs I created a file listing all the files on my Dell D630 laptop running Hardy Heron (please ignore the recursive nature of this problem :-) ).

 1 u804: haskell $ ghc --version
 2 The Glorious Glasgow Haskell Compilation System, version 6.8.2
 3 u804: haskell $ ls -l all-files.txt
 4 -rw-r--r-- 1 mhr mhr 26245382 2008-05-31 17:18 all-files.txt
 5 u804: haskell $ wc -l all-files.txt
 6 399754 all-files.txt
 7 u804: haskell $ uname -a
 8 Linux u804 2.6.24-18-generic #1 SMP Wed May 28 20:27:26 UTC 2008 i686 GNU/Linux

I then let loose the compiled haskell binary on that file (all-files.txt) and compared how it performed against /usr/bin/sort.

  1 u804: haskell $ time ./mergesort -f all-files.txt >/dev/null
  2 
  3 real    0m12.994s
  4 user    0m12.349s
  5 sys 0m0.632s
  6 u804: haskell $ time sort all-files.txt >/dev/null
  7 
  8 real    0m10.675s
  9 user    0m10.613s
 10 sys 0m0.052s
 11 u804: haskell $ time sort all-files.txt >/dev/null

The difference in performance was moderate (22%) given that I did not tune the haskell implementation in any way.

However, I gasped when I observed the RAM utilisation while the haskell program was running: it used 726 MB of RAM!

Compare this to 34 MB that were used by the standard sort program.

I guess it’s time to get acquainted with ghc -prof and friends :-)

Last but not least I’d like to point to a great haskell resource, the Real World Haskell beta book.