Revisiting Guile sorting

Guix, Guile, performances?

Benchmarking is really difficult! Do not take all as-is: the numbers are one data point from few experiments on one hardware. There is no guarantee that my observations also apply for you. Let me know!

What is nice with in-person¹ event as FOSDEM is the boost you get. On mid-December, I played with Guile and the sort procedure… it led to messy notes and snippets of code. Then from a chat with Rob Browning in the FOSDEM cafeteria to discussions with Guile co-maintainer Ludovic Courtès, here we are: a blog post that explores the improvements of the sort procedure.

What is Guile? Guile is an implementation of the Scheme programming language, the one that currently powers Guix. Hence, improving Guile makes Guix better! On my intensive day-to-day use of Guix, some operations appear to me slower than expected. In this post, I give a look to one classical algorithmic piece: sorting list. Diving in Guile plumbing is intimidating, here is a shy naive attempt.

Guile is a optimized Scheme compiler and thus it a complex piece of software. In this post, I am just running trivial examples and the conclusion is: Guile needs more love. Namely,

the good news: still room for improving the performance of Guile,
the bad news: still room for improving the performance of Guile.

The bottleneck when speaking of the software optimization is often the underlying algorithm or the data structure. Since we are considering sorting list, it seems hard to do better than the current implementation: merge sort. We expect the performance (complexity) of the builtin sorting algorithm will be about \(O(n~\log~n)\) in time, where \(n\) is the length of the list. What does it mean? It mean that the time for sorting a list will be dominated by: \(a~n~\log~n + b\), where \(a\) and \(b\) are two constants – constant theoretically independent of the number \(n\) of list items. On my recent machine, I get these numbers for real time,

	10⁴	10⁵	10⁶	10⁷	10⁸
meas.	0.007216	0.091417	1.093392	12.512091	140.828519
est.	ref	0.0902	1.0824	12.628	144.32

where the estimation (est.) reads: \((0.007216/ (10^{4} \log 10^{4}))~n~\log~n\); back to the envelop estimation for the constant \(a\) neglecting \(b\).

Told you! That’s the optimal bound. Done, isn’t it? Really? What about the both hidden constants, especially \(a\)? Well, can we do better? If I write a post about it, the answer is perhaps yes. Implementation matters!

On a side note, we are comparing wall clock time (real time) because that is what the user is facing. And today, most, if not all, machines are multi-cores exploiting multi-threads; user and/or system times would be worth to consider for a finer analysis. Let as an exercise for the reader; please share your findings and point me my potential mistakes.

Back to the basics: merge sorting

For sorting list, Guile implements in C programming language the classical merge sort algorithm; give a look to the details of the builtin sort. Could we do better with an implementation using Scheme rather than an implementation using C? Let scratch our itch.

The merge sorting algorithm is pretty straightforward, in plain English it sounds as: recursively sort each half of the list – the list of one element being obviously sorted – and merge these both sorted halves. Using only Scheme primitives – namely pair?, car, cdr, even? and if, let etc. – a first naive implementation reads,

(define (sort-merge-naive lst less?)
  (let sort-loop ((lst lst))
    (if (pair? lst)
        (if (pair? (cdr lst))
            (let* ((n (len lst))
                   (n (if (even? n) n (- n 1)))
                   (mid (/ n 2)))
              (merge (sort-loop (take lst mid))
                     (sort-loop (drop lst mid))
                     less?))
            lst)
        '())))

Nothing fancy. One important piece here is the procedure merge – attentive reader also notes that the procedure len, take and drop are not builtin procedures, see next section. Because we are trying to improve the performances, and because we are challenging Guile compiler, let implement in Scheme this merge procedure.

;;; Re-implement merge from libguile/sort.c
;;; merge is NOT part of Scheme standard
(define (merge l1 l2 less?)
  (let merge-loop ((l1 l1) (l2 l2))
    (if (pair? l1)
        (if (pair? l2)
            (if (less? (car l1) (car l2))
                (cons (car l1)
                      (merge-loop (cdr l1) l2))
                (cons (car l2)
                      (merge-loop l1 (cdr l2))))
            l1)
        l2)))

Re-implementation of other builtin procedures

Here, since our aim is to challenge Guile Scheme compiler, instead of the length builtin we re-implement the length computation of a list in plain Scheme,

(define (fold proc init lst)
  (let fold-loop ((lst lst) (accu init))
    (if (pair? lst)
        (fold-loop (cdr lst)
                   (proc (car lst) accu))
        accu)))

;;; Re-implement length from libguile/list.c
;;; length is part of Scheme standard
(define (len lst)
  (fold (lambda (_ n) (+ 1 n)) 0 lst))

Similarly, take and drop are Scheme re-implementation of list-head and list-tail,

;;; Re-implement list-tail from libguile/list.c
;;; list-tail is part of Scheme standard
;;; Note: module/srfi/srfi-1.scm defines drop re-using list-tail
(define (drop lst k)
  (let loop ((lst lst) (k k))
    (if (or (zero? k)
            (null? lst))
        lst
        (loop (cdr lst) (- k 1)))))

;;; Re-implement list-head from libguile/list.c
;;; list-head is NOT part of Scheme standard
;;; Note: module/srfi/srfi-1.scm defines take re-using list-head
(define (take lst k)
  (let loop ((lst lst) (k k))
    (if (or (zero? k)
            (null? lst))
        '()
        (cons (car lst)
              (loop (cdr lst) (- k 1))))))

Testing `sort-merge-naive`

The structure of the list to sort should not matter much for merge sorting – although we have not done intensive tests. And the merge sort is expected to be stable but we have not intensively challenged this naive implementation yet; it should be stable. Comparing both, Scheme implemented sort-merge-naive against builtin C implemented sort, the number reads:

Guile	10⁴	10⁵	10⁶	10⁷	10⁸
builtin	0.007216	0.091417	1.093392	12.512091	140.828519
naive	0.008441	0.080331	1.006181	10.202522	111.302971

The good news: the pure Scheme implementation is competitive with the C builtin implementation. Good job Guile! The bad news: the first Scheme implementation at hand is comparable to the builtin… Hum?!

Still, could we do better than this naive Scheme implementation?

Less naive merge sorting Scheme implementation

If we scrutinize the implementation of the merge procedure, then we cannot be happy with the loop over cons; it misses some tail-call optimization. Instead, it seems better to have an accumulator collecting while looping, however it would mean the collected items would be in reverse. Therefore, we would need to return (append (reverse rest) accumulator), and still reverse leads to another unnecessary allocation. Let implement our own append-reverse,

;;; Re-implement append-reverse from libguile/srfi-1.c
(define (rev-append l1 l2)
  (fold cons l2 l1))

Instead of merge, let implement this reverse merge,

(define (rev-merge l1 l2 less?)
  (let loop ((l1 l1) (l2 l2) (accu '()))
    (if (pair? l1)
        (if (pair? l2)
            (if (less? (car l1) (car l2))
                (loop (cdr l1) l2
                      (cons (car l1) accu))
                (loop l1 (cdr l2)
                      (cons (car l2) accu)))
            (rev-append l1 accu))
        (rev-append l2 accu))))

Nothing fancy. In addition, while looping when sorting, instead of descending to empty list and then apply merge for making the effective sorting, we could cut at the list of two elements; or at one list of three elements for the odd case. It saves some cycles.

Last, the call of both take and drop procedures is also not optimal because they also allocate when it could be avoided. Instead, let maintain the length of the list to sort and pass the whole – thanks to Sylvain Conchon and Jean-Christophe Filliâtre who exposed me to that in their book « Apprendre à programmer avec OCaml ».

The whole implementation of sort-merge reads as below – indeed, less concise!

(define (sort-merge lst less?)

  (define (rev-merge l1 l2 less?)
    (let loop ((l1 l1) (l2 l2) (accu '()))
      (if (pair? l1)
          (if (pair? l2)
              (if (less? (car l1) (car l2))
                  (loop (cdr l1) l2
                        (cons (car l1) accu))
                  (loop l1 (cdr l2)
                        (cons (car l2) accu)))
              (rev-append l1 accu))
          (rev-append l2 accu))))

  (define (sorting-loop n lst less?)
    (cond ((= 2 n)
           (let ((x1 (car lst))
                 (x2 (cadr lst))
                 (tl (cddr lst)))
             (if (less? x1 x2)
                 (cons
                  (list x1 x2) tl)
                 (cons
                  (list x2 x1) tl))))
          ((= 3 n)
           (let ((x1 (car lst))
                 (x2 (cadr lst))
                 (x3 (caddr lst))
                 (tl (cdddr lst)))
             (if (less? x1 x2)
                 (if (less? x2 x3)
                     (cons
                      (list x1 x2 x3) tl)
                     (if (less? x1 x3)
                         (cons
                          (list x1 x3 x2) tl)
                         (cons
                          (list x3 x1 x2) tl)))
                 (if (less? x1 x3)
                     (cons
                      (list x2 x1 x3) tl)
                     (if (less? x2 x3)
                         (cons
                          (list x2 x3 x1) tl)
                         (cons
                          (list x3 x2 x1) tl))))))
          (else
           (let* ((n1 (if (even? n) ;How to bit-shift-right of 1?
                          (/ n 2)
                          (/ (- n 1) 2)))
                  (n2 (- n n1))
                  (s1 (sorting-loop n1 lst
                                    (lambda (x y) (less? y x))))
                  (s2 (sorting-loop n2 (cdr s1)
                                    (lambda (x y) (less? y x)))))
             (cons
              (rev-merge (car s1) (car s2)
                         (lambda (x y) (less? y x)))
              (cdr s2))))))

  (let  ((n (len lst)))
    (if (< n 2)
        lst
        (car (sort n lst less?)))))

On a side note, as the comment mentions, the implementation relies on the call ]of even? procedure implemented in C and doing more than we need here; when we could save some cycle by directly using bit shift.

Testing `sort-merge`

Guile	10⁴	10⁵	10⁶	10⁷	10⁸
builtin	0.007216	0.091417	1.093392	12.512091	140.828519
naive	0.008441	0.080331	1.006181	10.202522	111.302971
better	0.004509	0.047166	0.568273	6.096391	62.762179

Bang! Almost 2x speed-up. Remember hidden constants behind \(O(n~\log~n)\) complexity notation? We just divide these constants by almost two.

Calm down, we are seeing a special case because all the elements of the list are integers. Nonetheless, that’s always nice to beat C. Great job Guile!

What about list of user defined type elements?

List of `record`

Let consider a record as, for instance,

(define-record-type <elem>
  (make-elem num stuff)
  elem?
  (num elem-num)
  (stuff elem-stuff))

(define (>=-elem? e1 e2)
  (let ((n1 (+ (elem-num e1) (vector-fold (lambda (i r x) (+ x r))
                                           0
                                           (elem-stuff e1))))
        (n2 (+ (elem-num e2) (vector-fold (lambda (i r x) (+ x r))
                                           0
                                           (elem-stuff e2)))))
    (>= n1 n2)))

where num stores an integer and stuff stores some vector. The comparison table provides less impressive numbers,

Guile	10⁴	10⁵	10⁶	10⁷	10⁸
builtin	0.018883	0.235678	2.985717	38.168515	452.859714
better	0.017849	0.228074	3.152514	42.460664	528.662832

Still! The Scheme implementation of merge sort seems a decent alternative for the current C builtin. Well, my machine is multi-cores and Guile exploits it better for the Scheme implementation than for the C builtin one. It would be interesting to test this Scheme implementation of merge sorting on various hardware.

Since sorting is an essential component, maybe we could help the Guile compiler and see if we could save some cycles, again.

Yet another better Scheme implementation (the best?)

The attentive reader notes the closure (lambda (x y) (less? y x)) in the recursive call sorting-loop above. Well, a lambda is not necessarily a closure, Guile co-maintainer Andy Wingo said. Instead of these lambda, let duplicate the code for removing them and see if it helps the compiler. The recursive procedure sorting-loop above is replaced by two mutual recursive procedures (letrec). Credit the inspiration: the merge sorting stable_sort in the standard library of OCaml.

It leads to these numbers for a list of integers,

better	0.004509	0.047166	0.568273	6.096391	62.762179
best	0.004129	0.039957	0.560440	5.719116	59.779422

Well, that’s not nothing. Faster by almost 5%. What about a list of records?

builtin	0.018883	0.235678	2.985717	38.168515	452.859714
best	0.015103	0.181854	2.610716	32.605146	379.456376

Yeah! On my machine, the real time of this optimized Scheme implementation is faster than the current builtin sort procedure. That’s on my recent laptop and any feedback on various hardware is very welcome.

My Scheme implementation for sorting lists is here.

What about `(ice-9 match)`?

What if instead of plain car and cdr primitives, we would implement all using (ice-9 match) pattern matching?

Wow! To be honest, I was expecting a slowdown but not a not-affordable result. Initially, I wrote some fold* as,

(define (fold* proc init lst)
  (let fold-loop ((lst lst) (accu init))
    (match lst
      ((x tl ...)
       (fold-loop tl
                  (proc x accu)))
      (_
       accu))))

and the computation for a list of \(10^{6}\) elements takes more than 500 seconds; compared to less than 0.01 seconds when equivalently computed using car and cdr. Well, match is defined by a macro, maybe the source of bottleneck. I do not know where Guile performs poorly here and if such poor performance is expected. Maybe between my chair and my keyboard?

If we compare different implementations running on Racket, we get:

Racket	10⁴	10⁵	10⁶	10⁷	10⁸
len-car/cdr	-	-	-	0.012	0.106
len-match	-	-	-	0.414	4.686
best	-	0.009	0.117	2.205	44.955
best-match	-	0.043	0.555	7.267	107.726
builtin	-	0.006	0.076	1.088	13.605

where dash means that the time procedure from Racket does not collect anything. However, the time for starting Racket and then run the compiled code is much longer and it does not appear here – it does not matter much for the discussion since our aim is to compare the performance of compiled code.

Well, I will not speak about the damned really good Racket builtin sorting procedure performances. Maybe the builtin sorting strategy implemented by Racket exploits the pre-sorted structure of the list; as Knuth’s natural sorting or highly-tuned timsort strategy.

We see that match and thus macro-expansion comes with a cost, whatever the Scheme compiler, although my tests appear to me much worse with Guile. Hum, since Guix relies heavily on match (560+ matches of #:use-module (ice-9 match)), one question could be: could this match participate to the feeling of slowness?

TODO for future myself: investigate this penalty when using (ice-9 match).

See Ludo's answer:

Re ‘match’ penalty: when using ellipses in patterns, the generated code checks for “proper lists”, which is O(n). The trick is to instead match a pair:

✔ (match lst ((head . tail) …)) ❎ (match lst ((head tail …) …))

Comparing Scheme compilers: Guile vs Racket vs Chez Scheme

Take this rest of the post with some salt.

The Scheme code that we have is portable enough to run it on different Scheme compilers. The aim here is have a rough idea about Guile performances compared to Racket – an engaging Scheme platform – and compared to Chez Scheme – the fastest Scheme compiler around, to my knowledge. These days, Racket re-uses part of Chez Scheme compiling tower and run-time, for instance as explained in this video.

We re-run the experiments, that’s why timings differ a bit of the ones from above.

	10⁴	10⁵	10⁶	10⁷	10⁸
naive
guile	0.006367	0.0805	0.881	10.04	111.4
racket	-	0.017	0.363	6.28	156.4
chez	0.001855	0.0219	0.280	4.55	98.1

Wow! the naive Scheme implementation showed above and compiled with Chez Scheme runs 1.4x faster than the C Guile builtin sort procedure.

better
guile	0.005499	0.0402	0.509 (0.353)	5.63 (2.84)	60.2 (24.1)
racket	-	0.010	0.170 (0.085)	2.46 (1.57)	48.5 (39.1)
chez	0.001116	0.0128	0.145 (0.063)	2.05 (1.21)	46.5 (34.3)

best
guile	0.004005	0.0370	0.457 (0.280)	5.44 (2.76)	57.66 (23.88)
racket	-	0.010	0.129 (0.064)	2.46 (1.65)	44.67 (37.67)
chez	0.002239	0.0108	0.132 (0.062)	1.98 (1.23)	42.92 (32.59)

where (numbers) between parenthesis is the time spent for garbage collecting the sorting only.

Here I see two things. 1. Guile is slower than Racket or Chez Scheme for the exact same piece code. And the slowdown is heavily significant. The memory model and garbage collection strategy is probably the bottleneck here. Then 2. the memory model and garbage collection strategy of Racket and Chez Scheme scale poorly for a number of items that the British Library claims to hold. It would be worth to re-play using different type as list element.

Well, let revisit this topic about garbage collection strategy when Whippet would land to Guile – if it is already, then I have missed it.

Opinionated conclusion

After playing with Guile, my feelings are mixed. On one hand, the Scheme compiler is improving. And there is room to still improving some internal aspects. For instance, SRFI-13 about strings contains this comment:

/* FIXME::martin: This should definitely get implemented more
   efficiently -- maybe with Knuth-Morris-Pratt, like in the reference
   implementation.  */

and a Scheme implementation of Knuth-Morris-Pratt algorithm is illustrated here. Please note that the Guile manual says: « The SRFI-13 procedures are always available ».

On the other hand, is it possible to improve Guile in such way that then it reaches Racket or Chez Scheme performances? What are the bottlenecks? Is it deep design of Guile? Or is it inefficient implementation? Or maybe a bit of both?

Do not take me wrong. The main message here: improving Guile improves Guix!

On a side note, moving from C implementation to Scheme implementation the builtin procedures should help with code interacting with concurrency as Fibers – some days ago, Chris Baines provided an example when dining – or with WebAssembly as Hoot.

Quick note on benchmarking

When running length then len on a list of integers and on a list of records on two different machines, I see for the real time:

		integers	records
machine A	`length`	0.190704	6.164389
	`len`	0.891214	3.029818
machine B	`length`	0.248980	5.607738
	`len`	0.528639	5.392267

I am confused by two things: 1. Why is the real time so different on one machine between C builtin length and Scheme custom len? And 2. Why is the best for a list of integers not the best for a list of records?

My opinionated conclusion: Guile needs love!

Join the fun, join Guix, join Guile!

Footnotes:

In-person event… although I keep the eyes closed about the environmental costs (CO2, travel, electricity, extra waste, etc.) of such crowdy in-person events. A typical extension of the no free lunch theorem.