Revisiting Guile sorting
Guix, Guile, performances?
Benchmarking is really difficult! Do not take all as-is: the numbers are one data point from few experiments on one hardware. There is no guarantee that my observations also apply for you. Let me know!
What is nice with in-person^{1} event as FOSDEM is the boost you get. On mid-December, I
played with Guile and the sort
procedure… it led to messy notes and snippets
of code. Then from a chat with Rob Browning in the FOSDEM cafeteria to
discussions with Guile co-maintainer Ludovic Courtès, here we are: a blog post
that explores the improvements of the sort
procedure.
What is Guile? Guile is an implementation of the Scheme programming language, the one that currently powers Guix. Hence, improving Guile makes Guix better! On my intensive day-to-day use of Guix, some operations appear to me slower than expected. In this post, I give a look to one classical algorithmic piece: sorting list. Diving in Guile plumbing is intimidating, here is a shy naive attempt.
Guile is a optimized Scheme compiler and thus it a complex piece of software. In this post, I am just running trivial examples and the conclusion is: Guile needs more love. Namely,
- the good news: still room for improving the performance of Guile,
- the bad news: still room for improving the performance of Guile.
The bottleneck when speaking of the software optimization is often the underlying algorithm or the data structure. Since we are considering sorting list, it seems hard to do better than the current implementation: merge sort. We expect the performance (complexity) of the builtin sorting algorithm will be about \(O(n~\log~n)\) in time, where \(n\) is the length of the list. What does it mean? It mean that the time for sorting a list will be dominated by: \(a~n~\log~n + b\), where \(a\) and \(b\) are two constants – constant theoretically independent of the number \(n\) of list items. On my recent machine, I get these numbers for real time,
10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} | |
---|---|---|---|---|---|
meas. | 0.007216 | 0.091417 | 1.093392 | 12.512091 | 140.828519 |
est. | ref | 0.0902 | 1.0824 | 12.628 | 144.32 |
where the estimation (est.) reads: \((0.007216/ (10^{4} \log 10^{4}))~n~\log~n\); back to the envelop estimation for the constant \(a\) neglecting \(b\).
Told you! That’s the optimal bound. Done, isn’t it? Really? What about the both hidden constants, especially \(a\)? Well, can we do better? If I write a post about it, the answer is perhaps yes. Implementation matters!
On a side note, we are comparing wall clock time (real time) because that is what the user is facing. And today, most, if not all, machines are multi-cores exploiting multi-threads; user and/or system times would be worth to consider for a finer analysis. Let as an exercise for the reader; please share your findings and point me my potential mistakes.
Back to the basics: merge sorting
For sorting list, Guile implements in C
programming language the classical
merge sort algorithm; give a look to the details of the builtin sort
.
Could we do better with an implementation using Scheme rather than an
implementation using C
? Let scratch our itch.
The merge sorting algorithm is pretty straightforward, in plain English it
sounds as: recursively sort each half of the list – the list of one element
being obviously sorted – and merge these both sorted halves. Using only
Scheme primitives – namely pair?
, car
, cdr
, even?
and if
, let
etc. – a first naive implementation reads,
(define (sort-merge-naive lst less?) (let sort-loop ((lst lst)) (if (pair? lst) (if (pair? (cdr lst)) (let* ((n (len lst)) (n (if (even? n) n (- n 1))) (mid (/ n 2))) (merge (sort-loop (take lst mid)) (sort-loop (drop lst mid)) less?)) lst) '())))
Nothing fancy. One important piece here is the procedure merge
– attentive
reader also notes that the procedure len
, take
and drop
are not builtin
procedures, see next section. Because we are trying to improve the
performances, and because we are challenging Guile compiler, let implement in
Scheme this merge
procedure.
;;; Re-implement merge from libguile/sort.c ;;; merge is NOT part of Scheme standard (define (merge l1 l2 less?) (let merge-loop ((l1 l1) (l2 l2)) (if (pair? l1) (if (pair? l2) (if (less? (car l1) (car l2)) (cons (car l1) (merge-loop (cdr l1) l2)) (cons (car l2) (merge-loop l1 (cdr l2)))) l1) l2)))
Re-implementation of other builtin procedures
Here, since our aim is to challenge Guile Scheme compiler, instead of the
length
builtin we re-implement the length computation of a list in plain
Scheme,
(define (fold proc init lst) (let fold-loop ((lst lst) (accu init)) (if (pair? lst) (fold-loop (cdr lst) (proc (car lst) accu)) accu))) ;;; Re-implement length from libguile/list.c ;;; length is part of Scheme standard (define (len lst) (fold (lambda (_ n) (+ 1 n)) 0 lst))
Similarly, take
and drop
are Scheme re-implementation of list-head
and
list-tail
,
;;; Re-implement list-tail from libguile/list.c ;;; list-tail is part of Scheme standard ;;; Note: module/srfi/srfi-1.scm defines drop re-using list-tail (define (drop lst k) (let loop ((lst lst) (k k)) (if (or (zero? k) (null? lst)) lst (loop (cdr lst) (- k 1))))) ;;; Re-implement list-head from libguile/list.c ;;; list-head is NOT part of Scheme standard ;;; Note: module/srfi/srfi-1.scm defines take re-using list-head (define (take lst k) (let loop ((lst lst) (k k)) (if (or (zero? k) (null? lst)) '() (cons (car lst) (loop (cdr lst) (- k 1))))))
Testing sort-merge-naive
The structure of the list to sort should not matter much for merge sorting –
although we have not done intensive tests. And the merge sort is expected to
be stable but we have not intensively challenged this naive implementation
yet; it should be stable. Comparing both, Scheme implemented
sort-merge-naive
against builtin C
implemented sort
, the number reads:
Guile | 10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} |
---|---|---|---|---|---|
builtin | 0.007216 | 0.091417 | 1.093392 | 12.512091 | 140.828519 |
naive | 0.008441 | 0.080331 | 1.006181 | 10.202522 | 111.302971 |
The good news: the pure Scheme implementation is competitive with the C
builtin implementation. Good job Guile! The bad news: the first Scheme
implementation at hand is comparable to the builtin… Hum?!
Still, could we do better than this naive Scheme implementation?
Less naive merge sorting Scheme implementation
If we scrutinize the implementation of the merge
procedure, then we cannot
be happy with the loop over cons
; it misses some tail-call optimization.
Instead, it seems better to have an accumulator collecting while looping,
however it would mean the collected items would be in reverse. Therefore, we
would need to return (append (reverse rest) accumulator)
, and still
reverse
leads to another unnecessary allocation. Let implement our own
append-reverse
,
;;; Re-implement append-reverse from libguile/srfi-1.c (define (rev-append l1 l2) (fold cons l2 l1))
Instead of merge
, let implement this reverse merge,
(define (rev-merge l1 l2 less?) (let loop ((l1 l1) (l2 l2) (accu '())) (if (pair? l1) (if (pair? l2) (if (less? (car l1) (car l2)) (loop (cdr l1) l2 (cons (car l1) accu)) (loop l1 (cdr l2) (cons (car l2) accu))) (rev-append l1 accu)) (rev-append l2 accu))))
Nothing fancy. In addition, while looping when sorting, instead of descending to empty list and then apply merge for making the effective sorting, we could cut at the list of two elements; or at one list of three elements for the odd case. It saves some cycles.
Last, the call of both take
and drop
procedures is also not optimal
because they also allocate when it could be avoided. Instead, let maintain
the length of the list to sort and pass the whole – thanks to Sylvain Conchon
and Jean-Christophe Filliâtre who exposed me to that in their book « Apprendre
à programmer avec OCaml ».
The whole implementation of sort-merge
reads as below – indeed, less concise!
(define (sort-merge lst less?) (define (rev-merge l1 l2 less?) (let loop ((l1 l1) (l2 l2) (accu '())) (if (pair? l1) (if (pair? l2) (if (less? (car l1) (car l2)) (loop (cdr l1) l2 (cons (car l1) accu)) (loop l1 (cdr l2) (cons (car l2) accu))) (rev-append l1 accu)) (rev-append l2 accu)))) (define (sorting-loop n lst less?) (cond ((= 2 n) (let ((x1 (car lst)) (x2 (cadr lst)) (tl (cddr lst))) (if (less? x1 x2) (cons (list x1 x2) tl) (cons (list x2 x1) tl)))) ((= 3 n) (let ((x1 (car lst)) (x2 (cadr lst)) (x3 (caddr lst)) (tl (cdddr lst))) (if (less? x1 x2) (if (less? x2 x3) (cons (list x1 x2 x3) tl) (if (less? x1 x3) (cons (list x1 x3 x2) tl) (cons (list x3 x1 x2) tl))) (if (less? x1 x3) (cons (list x2 x1 x3) tl) (if (less? x2 x3) (cons (list x2 x3 x1) tl) (cons (list x3 x2 x1) tl)))))) (else (let* ((n1 (if (even? n) ;How to bit-shift-right of 1? (/ n 2) (/ (- n 1) 2))) (n2 (- n n1)) (s1 (sorting-loop n1 lst (lambda (x y) (less? y x)))) (s2 (sorting-loop n2 (cdr s1) (lambda (x y) (less? y x))))) (cons (rev-merge (car s1) (car s2) (lambda (x y) (less? y x))) (cdr s2)))))) (let ((n (len lst))) (if (< n 2) lst (car (sort n lst less?)))))
On a side note, as the comment mentions, the implementation relies on the call
]of even?
procedure implemented in C
and doing more than we need here;
when we could save some cycle by directly using bit shift.
Testing sort-merge
Guile | 10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} |
---|---|---|---|---|---|
builtin | 0.007216 | 0.091417 | 1.093392 | 12.512091 | 140.828519 |
naive | 0.008441 | 0.080331 | 1.006181 | 10.202522 | 111.302971 |
better | 0.004509 | 0.047166 | 0.568273 | 6.096391 | 62.762179 |
Bang! Almost 2x speed-up. Remember hidden constants behind \(O(n~\log~n)\) complexity notation? We just divide these constants by almost two.
Calm down, we are seeing a special case because all the elements of the list
are integers. Nonetheless, that’s always nice to beat C
. Great job Guile!
What about list of user defined type elements?
List of record
Let consider a record as, for instance,
(define-record-type <elem> (make-elem num stuff) elem? (num elem-num) (stuff elem-stuff)) (define (>=-elem? e1 e2) (let ((n1 (+ (elem-num e1) (vector-fold (lambda (i r x) (+ x r)) 0 (elem-stuff e1)))) (n2 (+ (elem-num e2) (vector-fold (lambda (i r x) (+ x r)) 0 (elem-stuff e2))))) (>= n1 n2)))
where num
stores an integer and stuff
stores some vector. The comparison
table provides less impressive numbers,
Guile | 10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} |
---|---|---|---|---|---|
builtin | 0.018883 | 0.235678 | 2.985717 | 38.168515 | 452.859714 |
better | 0.017849 | 0.228074 | 3.152514 | 42.460664 | 528.662832 |
Still! The Scheme implementation of merge sort seems a decent alternative for
the current C
builtin. Well, my machine is multi-cores and Guile exploits
it better for the Scheme implementation than for the C
builtin one. It
would be interesting to test this Scheme implementation of merge sorting on
various hardware.
Since sorting is an essential component, maybe we could help the Guile compiler and see if we could save some cycles, again.
Yet another better Scheme implementation (the best?)
The attentive reader notes the closure (lambda (x y) (less? y x))
in the
recursive call sorting-loop
above. Well,
a
lambda is not necessarily a closure, Guile co-maintainer Andy Wingo said.
Instead of these lambda
, let duplicate the code for removing them and see if
it helps the compiler. The recursive procedure sorting-loop
above is
replaced by two mutual recursive procedures (letrec
). Credit the
inspiration: the merge sorting stable_sort
in the standard library of OCaml.
It leads to these numbers for a list of integers,
better | 0.004509 | 0.047166 | 0.568273 | 6.096391 | 62.762179 |
best | 0.004129 | 0.039957 | 0.560440 | 5.719116 | 59.779422 |
Well, that’s not nothing. Faster by almost 5%. What about a list of records?
builtin | 0.018883 | 0.235678 | 2.985717 | 38.168515 | 452.859714 |
best | 0.015103 | 0.181854 | 2.610716 | 32.605146 | 379.456376 |
Yeah! On my machine, the real time of this optimized Scheme implementation is
faster than the current builtin sort
procedure. That’s on my recent laptop
and any feedback on various hardware is very welcome.
My Scheme implementation for sorting lists is here.
What about (ice-9 match)
?
What if instead of plain car
and cdr
primitives, we would implement all
using (ice-9 match)
pattern matching?
Wow! To be honest, I was expecting a slowdown but not a not-affordable
result. Initially, I wrote some fold*
as,
(define (fold* proc init lst) (let fold-loop ((lst lst) (accu init)) (match lst ((x tl ...) (fold-loop tl (proc x accu))) (_ accu))))
and the computation for a list of \(10^{6}\) elements takes more than 500
seconds; compared to less than 0.01 seconds when equivalently computed using
car
and cdr
. Well, match
is defined by a macro, maybe the source of
bottleneck. I do not know where Guile performs poorly here and if such poor
performance is expected. Maybe between my chair and my keyboard?
If we compare different implementations running on Racket, we get:
Racket | 10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} |
---|---|---|---|---|---|
len-car/cdr | - | - | - | 0.012 | 0.106 |
len-match | - | - | - | 0.414 | 4.686 |
best | - | 0.009 | 0.117 | 2.205 | 44.955 |
best-match | - | 0.043 | 0.555 | 7.267 | 107.726 |
builtin | - | 0.006 | 0.076 | 1.088 | 13.605 |
where dash means that the time
procedure from Racket does not collect
anything. However, the time for starting Racket and then run the compiled
code is much longer and it does not appear here – it does not matter much for
the discussion since our aim is to compare the performance of compiled code.
Well, I will not speak about the damned really good Racket builtin sorting procedure performances. Maybe the builtin sorting strategy implemented by Racket exploits the pre-sorted structure of the list; as Knuth’s natural sorting or highly-tuned timsort strategy.
We see that match
and thus macro-expansion comes with a cost, whatever the
Scheme compiler, although my tests appear to me much worse with Guile. Hum,
since Guix relies heavily on match
(560+ matches of #:use-module (ice-9
match)
), one question could be: could this match
participate to the feeling
of slowness?
TODO for future myself: investigate this penalty when using (ice-9 match)
.
See Ludo's answer:
Re ‘match’ penalty: when using ellipses in patterns, the generated code checks for “proper lists”, which is O(n). The trick is to instead match a pair:
✔ (match lst ((head . tail) …)) ❎ (match lst ((head tail …) …))
Comparing Scheme compilers: Guile vs Racket vs Chez Scheme
Take this rest of the post with some salt.
The Scheme code that we have is portable enough to run it on different Scheme compilers. The aim here is have a rough idea about Guile performances compared to Racket – an engaging Scheme platform – and compared to Chez Scheme – the fastest Scheme compiler around, to my knowledge. These days, Racket re-uses part of Chez Scheme compiling tower and run-time, for instance as explained in this video.
We re-run the experiments, that’s why timings differ a bit of the ones from above.
10^{4} | 10^{5} | 10^{6} | 10^{7} | 10^{8} | |
---|---|---|---|---|---|
naive | |||||
guile | 0.006367 | 0.0805 | 0.881 | 10.04 | 111.4 |
racket | - | 0.017 | 0.363 | 6.28 | 156.4 |
chez | 0.001855 | 0.0219 | 0.280 | 4.55 | 98.1 |
Wow! the naive Scheme implementation showed above and compiled with Chez
Scheme runs 1.4x faster than the C
Guile builtin sort
procedure.
better | |||||
guile | 0.005499 | 0.0402 | 0.509 (0.353) | 5.63 (2.84) | 60.2 (24.1) |
racket | - | 0.010 | 0.170 (0.085) | 2.46 (1.57) | 48.5 (39.1) |
chez | 0.001116 | 0.0128 | 0.145 (0.063) | 2.05 (1.21) | 46.5 (34.3) |
best | |||||
guile | 0.004005 | 0.0370 | 0.457 (0.280) | 5.44 (2.76) | 57.66 (23.88) |
racket | - | 0.010 | 0.129 (0.064) | 2.46 (1.65) | 44.67 (37.67) |
chez | 0.002239 | 0.0108 | 0.132 (0.062) | 1.98 (1.23) | 42.92 (32.59) |
where (numbers) between parenthesis is the time spent for garbage collecting the sorting only.
Here I see two things. 1. Guile is slower than Racket or Chez Scheme for the exact same piece code. And the slowdown is heavily significant. The memory model and garbage collection strategy is probably the bottleneck here. Then 2. the memory model and garbage collection strategy of Racket and Chez Scheme scale poorly for a number of items that the British Library claims to hold. It would be worth to re-play using different type as list element.
Well, let revisit this topic about garbage collection strategy when Whippet would land to Guile – if it is already, then I have missed it.
Opinionated conclusion
After playing with Guile, my feelings are mixed. On one hand, the Scheme compiler is improving. And there is room to still improving some internal aspects. For instance, SRFI-13 about strings contains this comment:
/* FIXME::martin: This should definitely get implemented more efficiently -- maybe with Knuth-Morris-Pratt, like in the reference implementation. */
and a Scheme implementation of Knuth-Morris-Pratt algorithm is illustrated here. Please note that the Guile manual says: « The SRFI-13 procedures are always available ».
On the other hand, is it possible to improve Guile in such way that then it reaches Racket or Chez Scheme performances? What are the bottlenecks? Is it deep design of Guile? Or is it inefficient implementation? Or maybe a bit of both?
Do not take me wrong. The main message here: improving Guile improves Guix!
On a side note, moving from C
implementation to Scheme implementation the
builtin procedures should help with code interacting with concurrency as
Fibers – some days ago, Chris Baines provided an example when dining – or with
WebAssembly as Hoot.
Quick note on benchmarking
When running length
then len
on a list of integers and on a list of
records on two different machines, I see for the real time:
integers | records | ||
---|---|---|---|
machine A | length |
0.190704 | 6.164389 |
len |
0.891214 | 3.029818 | |
machine B | length |
0.248980 | 5.607738 |
len |
0.528639 | 5.392267 |
I am confused by two things: 1. Why is the real time so different on one
machine between C
builtin length
and Scheme custom len
? And 2. Why is
the best for a list of integers not the best for a list of records?
My opinionated conclusion: Guile needs love!
Footnotes:
In-person event… although I keep the eyes closed about the environmental costs (CO2, travel, electricity, extra waste, etc.) of such crowdy in-person events. A typical extension of the no free lunch theorem.