From 414696f9962097eeade51f94314cfd29e365e628 Mon Sep 17 00:00:00 2001 From: David Thompson Date: Thu, 22 Feb 2024 11:39:48 -0500 Subject: Add optimizing guile post. --- posts/2024-02-25-optimizing-guile.md | 986 +++++++++++++++++++++++++++++++++++ 1 file changed, 986 insertions(+) create mode 100644 posts/2024-02-25-optimizing-guile.md diff --git a/posts/2024-02-25-optimizing-guile.md b/posts/2024-02-25-optimizing-guile.md new file mode 100644 index 0000000..f250e6f --- /dev/null +++ b/posts/2024-02-25-optimizing-guile.md @@ -0,0 +1,986 @@ +title: Optimizing Guile Scheme +date: 2024-02-26 08:30:00 +tags: guile, scheme +summary: An overview of how I optimize Guile code with examples +--- + +[Guile](https://gnu.org/software/guile) is a rather niche language +that I love dearly. Guile is a Scheme dialect that features an +advanced optimizing bytecode compiler, a JIT compiler, and a modest +set of developer tools for inspecting and debugging. Through my time +spent developing [Chickadee](/projects/chickadee.html), a game +programming library, I have gotten quite familiar with how to get the +most out of Guile in terms of performance. Every now and then I share +a tip or two with someone on IRC or the fediverse and think “I should +blog about this” so now I’m finally doing that. These tips are quite +simple and apply to optimizing any dynamic language. The only +difference is that there isn’t much in the way of helpful examples +specifically for Guile… until now. + +Scheme is a dynamic language, which means that there is a limited +amount of compile-time information that can be used by Guile to +optimize the resulting bytecode. When we put on our optimizer hat, +our job is to give the compiler a hand so the optimization passes can +do their thing. I should stress that the level of code scrutiny we’re +about to get into is usually unnecessary and the result doesn’t always +look like the beautiful, functional Scheme you may be used to. +However, most programs have some core loop or kernel, a small piece of +the larger program, that would be benefit from being optimized to its +fullest. In Chickadee, the most performance sensitive code is in the +graphics layer, where lots of floating point math happens. + +### Rule 1: Don’t allocate + +If you can avoid allocation, you will probably have at least decent +throughput without doing much else. Some allocations are explicit; +`(vector 1 2 3)` clearly allocates a vector. Other allocations are +implicit; `(+ x 1)` may or may not allocate depending on the value of +`x`. + +If `x` is `42` then there is no allocation because the result, `43`, +is in the fixnum range (which is `[-2^63, 2^63)` on 64-bit machines) +which Guile stores as an “immediate” value which is not heap +allocated. However, if `x` is `42.0` then Guile will allocate a float +on the heap to store the result `43.0`. Did you know that floats were +heap allocated in Guile? I didn’t when I was getting started! All +numbers besides fixnums are heap allocated. + +Now that you know the hard truth about Guile’s floats, you might think +that math is doomed to be slow on Guile; that any realtime graphics +program will be a stuttery mess. Keep reading and I will explain why +this isn’t the case! + +### Rule 2: Prefer monomorphic over polymorphic + +The base Scheme environment mostly provides monomorphic procedures; +`append` is for lists, `string-append` is for strings, etc. The big +exception to this rule is the numeric tower. While beautiful, it can +be a hinderance to performant code. All of the arithmetic operators +are polymorphic; `+` adds any two numbers together and there are many +types of numbers. + +Compiled as-is, it means that multiple dispatch on the operands needs +to happen at runtime to determine that which specialized “add $type-a +and $type-b” routine needs to be called. + +The R6RS specification introduced monomorphic procedures for fixnums +and floats such as `fx+` and `fl+`. These procedures remove the +overhead of generic dispatching, but they don't help with the +allocation problem; Without a sufficiently advanced compiler, `(fx* +(fx+ x y) z)` will allocate a new float to hold the intermediate +result of `fx+` that gets thrown away after the `fx*` call. But I +wouldn’t be writing this if Guile *didn’t* have a sufficiently +advanced compiler! + +### Why not both? + +We can write numeric code that is both specialized and allocates +minimally. Guile’s compiler performs a *type inference* pass on our +code and will specialize numeric operations wherever possible. For +example, if Guile can prove that all the variables involved in `(* (+ +x y) z)` are floats, it will optimize the resulting bytecode so that: + +* The floats within `x`, `y`, and `z` are used directly. +* `+` and `*` are compiled to specialized `fadd` and `fmul` primitives. +* The intermediate result of `(+ x y)` does not allocate a new heap + object. + +This is called *unboxing*. Imagine every Scheme value as an object +stored inside a little box. Unboxing means removing some objects from +their respective boxes, performing some sequence of operations on them +*without* storing each intermediate result in a throwaway box, and +then putting the final result into a new box. Unboxing is how we we +can satisfy both of our optimization rules for numeric code. + +Unboxed floating point math is what allows Chickadee to do things like +render thousands of sprites at 60 frames per second without constant +GC-related stutter. + +### The tools + +To optimize effectively, we need tools to help us identify problematic +code and tools to validate that our changes are improving things. The +most essential tools I use are accessible via REPL commands: + +* `,profile`: Evaluate an expression in the context of `statprof` and + print the results. +* `,disassemble`: Print the bytecode disassembly of a procedure. + +An additional tool that does not have it’s own REPL command is +`gcprof`, which is a profiler that can help identify code that most +frequently triggers garbage collection. I won’t be using it here but +you should know it exists. + +Now, let’s get into some examples and walk through optimizing each +one. + +### Example 1: Variadic arguments + +It’s common in Scheme for procedures to handle an arbitrary number of +arguments. For example, the `map` procedure can process as many lists +as you throw at it; `(map + '(1 2 3) '(4 5 6) '(7 8 9))` produces the +result `(12 15 18)`. + +Supporting an arbitrary number of arguments makes for flexible +interfaces, but a naive implementation will cause excessive GC churn +in the common case where only a few arguments are passed. + +Let’s analyze a contrived example. The following procedure computes +the average of all arguments: + +```scheme +(use-modules (srfi srfi-1)) + +(define (average . args) + (/ (fold + 0 args) (length args))) +``` + +Let's profile it and see how well it performs: + +```scheme +scheme@(guile-user)> ,profile (let lp ((i 0)) + (when (< i 100000000) + (average 1 2 3) + (lp (+ i 1)))) +% cumulative self +time seconds seconds procedure + 31.99 13.68 4.43 :1918:16:average + 23.43 7.94 3.25 srfi/srfi-1.scm:452:2:fold + 22.73 3.15 3.15 + + 8.22 1.14 1.14 length + 5.94 0.82 0.82 list? + 5.24 0.73 0.73 procedure? + 1.22 13.85 0.17 :1979:9 + 1.22 0.17 0.17 %after-gc-thunk + 0.00 0.17 0.00 anon #x19675c0 +--- +Sample count: 572 +Total time: 13.853321979 seconds (6.297763116 seconds in GC) +``` + +Nearly half of our time was spent in GC. Let's find out why by taking +a look at the disassembly: + +``` +scheme@(guile-user)> ,disassemble average +Disassembly of # at #x1a9cbd0: + + 0 (instrument-entry 240) at (unknown file):1918:16 + 2 (assert-nargs-ge 1) + 3 (bind-rest 1) ;; 2 slots + 4 (alloc-frame 9) ;; 9 slots + 5 (static-ref 8 189) ;; # at (unknown file):1919:6 + 7 (immediate-tag=? 8 7 0) ;; heap-object? + 9 (je 9) ;; -> L1 + 10 (static-ref 8 162) ;; # + 12 (static-ref 6 192) ;; fold + 14 (call-scm<-scm-scm 8 8 6 111) ;; lookup-bound + 16 (static-set! 8 178) ;; # +L1: + 18 (scm-ref/immediate 8 8 1) + 19 (static-ref 6 187) ;; #> at (unknown file):1919:11 + 21 (immediate-tag=? 6 7 0) ;; heap-object? + 23 (je 7) ;; -> L2 + 24 (call-scm<-scmn-scmn 6 194 198 113);; lookup-bound-private + 28 (static-set! 6 178) ;; #> +L2: + 30 (scm-ref/immediate 2 6 1) + 31 (make-immediate 1 2) ;; 0 at (unknown file):1919:13 + 32 (mov 3 8) at (unknown file):1919:5 + 33 (mov 0 7) + 34 (handle-interrupts) + 35 (call 5 4) + 37 (receive 0 5 9) + 39 (static-ref 6 191) ;; #> at (unknown file):1919:21 + 41 (immediate-tag=? 6 7 0) ;; heap-object? + 43 (je 7) ;; -> L3 + 44 (call-scm<-scmn-scmn 6 174 188 113);; lookup-bound-private + 48 (static-set! 6 182) ;; #> +L3: + 50 (scm-ref/immediate 4 6 1) + 51 (mov 3 7) + 52 (handle-interrupts) + 53 (call 4 2) + 55 (receive 1 4 9) + 57 (call-scm<-scm-scm 8 8 7 5) ;; div at (unknown file):1919:2 + 59 (reset-frame 1) ;; 1 slot + 60 (handle-interrupts) + 61 (return-values) +``` + +Note instruction 3, `bind-rest`. The Guile manual says: + +> Instruction: bind-rest f24:DST +> +> Collect any arguments at or above DST into a list, and store that +> list at DST. + +So, for each call, a sequence of pairs is allocated to hold all of the +arguments. That's probably where a lot of our allocation is coming +from. To optimize this, let’s first assume that `average` is +typically called with 3 arguments or less. It would be great if we +could make these common cases fast while still allowing the +flexibility of passing an arbitrary number of arguments. To do this, +we’ll use `case-lambda`: + +```scheme +(define average + (case-lambda + (() 0) + ((x) x) + ((x y) (/ (+ x y) 2)) + ((x y z) (/ (+ x y z) 3)) + ;; ... and so on, add as many cases as you'd like! + (args + (/ (fold + 0 args) (length args))))) +``` + +Let’s re-run the profiler to see if this is actually better: + +``` +% cumulative self +time seconds seconds procedure + 76.47 0.63 0.63 :2055:2:average + 23.53 0.82 0.19 :2073:9 +--- +Sample count: 51 +Total time: 0.82462725 seconds (0.0 seconds in GC) +``` + +I'd say that nearly 17x faster with no GC is an improvement! + +Let’s see what's changed in the disassembly: + +``` +scheme@(guile-user)> ,disassemble average +Disassembly of # at #x1ab4c70: + + 0 (instrument-entry 278) at (unknown file):2055:2 + 2 (arguments<=? 1) + 3 (jne 6) ;; -> L1 + 4 (alloc-frame 9) ;; 9 slots + 5 (make-immediate 8 2) ;; 0 at (unknown file):2056:8 + 6 (reset-frame 1) ;; 1 slot + 7 (handle-interrupts) + 8 (return-values) +L1: + 9 (arguments<=? 2) + 10 (jne 6) ;; -> L2 + 11 (alloc-frame 9) ;; 9 slots + 12 (mov 8 7) + 13 (reset-frame 1) ;; 1 slot + 14 (handle-interrupts) + 15 (return-values) +L2: + 16 (arguments<=? 3) + 17 (jne 10) ;; -> L3 + 18 (alloc-frame 9) ;; 9 slots + 19 (call-scm<-scm-scm 8 7 6 0) ;; add at (unknown file):2058:14 + 21 (make-immediate 7 10) ;; 2 at (unknown file):2058:22 + 22 (call-scm<-scm-scm 8 8 7 5) ;; div at (unknown file):2058:11 + 24 (reset-frame 1) ;; 1 slot + 25 (handle-interrupts) + 26 (return-values) +L3: + 27 (arguments<=? 4) + 28 (jne 12) ;; -> L4 + 29 (alloc-frame 9) ;; 9 slots + 30 (call-scm<-scm-scm 8 7 6 0) ;; add at (unknown file):2059:16 + 32 (call-scm<-scm-scm 8 8 5 0) ;; add + 34 (make-immediate 7 14) ;; 3 at (unknown file):2059:26 + 35 (call-scm<-scm-scm 8 8 7 5) ;; div at (unknown file):2059:13 + 37 (reset-frame 1) ;; 1 slot + 38 (handle-interrupts) + 39 (return-values) +L4: + 40 (assert-nargs-ge 1) + 41 (bind-rest 1) ;; 2 slots + 42 (alloc-frame 9) ;; 9 slots + 43 (static-ref 8 189) ;; #f at (unknown file):2061:9 + 45 (immediate-tag=? 8 7 0) ;; heap-object? + 47 (je 9) ;; -> L5 + 48 (static-ref 8 162) ;; # + 50 (static-ref 6 192) ;; fold + 52 (call-scm<-scm-scm 8 8 6 111) ;; lookup-bound + 54 (static-set! 8 178) ;; #f +L5: + 56 (scm-ref/immediate 8 8 1) + 57 (static-ref 6 187) ;; #f at (unknown file):2061:14 + 59 (immediate-tag=? 6 7 0) ;; heap-object? + 61 (je 7) ;; -> L6 + 62 (call-scm<-scmn-scmn 6 194 198 113);; lookup-bound-private + 66 (static-set! 6 178) ;; #f +L6: + 68 (scm-ref/immediate 2 6 1) + 69 (make-immediate 1 2) ;; 0 at (unknown file):2061:16 + 70 (mov 3 8) at (unknown file):2061:8 + 71 (mov 0 7) + 72 (handle-interrupts) + 73 (call 5 4) + 75 (receive 0 5 9) + 77 (static-ref 6 191) ;; #f at (unknown file):2061:24 + 79 (immediate-tag=? 6 7 0) ;; heap-object? + 81 (je 7) ;; -> L7 + 82 (call-scm<-scmn-scmn 6 174 188 113);; lookup-bound-private + 86 (static-set! 6 182) ;; #f +L7: + 88 (scm-ref/immediate 4 6 1) + 89 (mov 3 7) + 90 (handle-interrupts) + 91 (call 4 2) + 93 (receive 1 4 9) + 95 (call-scm<-scm-scm 8 8 7 5) ;; div at (unknown file):2061:5 + 97 (reset-frame 1) ;; 1 slot + 98 (handle-interrupts) + 99 (return-values) +``` + +There are more instructions now, but the branches for the known arity +cases do not contain a `bind-rest` instruction. Only branch `L4`, the +one that handles the final clause of the `case-lambda`, uses +`bind-rest`. + +### Example 2: Floating point math + +> “Nothing brings fear to my heart more than a floating point number.” +> +> — [Gerald Sussman](https://youtu.be/HB5TrK7A4pI?t=672) + +Programs that need to crunch numbers in realtime, such as games, rely +on floating point numbers. Dedicated hardware in the form of FPUs and +GPUs make them essential for gettin’ math done quick and so we put up +with their black magic. + +Consider the following code that calculates the magnitude of a 2D +vector: + +```scheme +(define (magnitude x y) + (sqrt (+ (* x x) (* y y)))) +``` + +Would you believe me if I told you the bytecode is less than perfect? + +```scheme +scheme@(guile-user)> ,disassemble magnitude +Disassembly of # at #x1a3fad8: + + 0 (instrument-entry 84) at (unknown file):2106:16 + 2 (assert-nargs-ee/locals 3 0) ;; 3 slots (2 args) + 3 (call-scm<-scm-scm 2 1 1 4) ;; mul at (unknown file):2107:11 + 5 (call-scm<-scm-scm 1 0 0 4) ;; mul at (unknown file):2107:19 + 7 (call-scm<-scm-scm 2 2 1 0) ;; add at (unknown file):2107:8 + 9 (call-scm<-scm 2 2 68) ;; sqrt at (unknown file):2107:2 + 11 (reset-frame 1) ;; 1 slot + 12 (handle-interrupts) + 13 (return-values) +``` + +Note the `call-scm<-scm-scm` instructions calling generic math +primitives `mul` and `add`. + +```scheme +scheme@(guile-user)> ,profile (let lp ((i 0)) + (when (< i 100000000) + (magnitude 3.0 4.0) + (lp (+ i 1)))) +% cumulative self +time seconds seconds procedure + 85.12 26.94 24.50 :13:16:magnitude + 8.48 2.44 2.44 %after-gc-thunk + 6.40 28.79 1.84 :21:9 + 0.00 2.44 0.00 anon #x1e9e5c0 +--- +Sample count: 672 +Total time: 28.786191396 seconds (26.349479685 seconds in GC) +``` + +Oof, nearly all of our time is spent in GC! + +To fix this, we need to constrain our inputs by using predicates to +guard the path to the numeric code. This will inform Guile that +certain types of numbers will never reach this branch and allow the +compiler to choose more specialized primitives. If we’re okay with +only working with floats (we are) then we should constrain our +procedure accordingly: + +```scheme +(define (magnitude x y) + (unless (and (real? x) (inexact? x) + (real? y) (inexact? y)) + (error "expected floats" x y)) + (sqrt (+ (* x x) (* y y)))) +``` + +And the stats: + +```scheme +% cumulative self +time seconds seconds procedure + 82.73 4.13 4.06 :177:16:magnitude + 15.83 4.91 0.78 :187:9 + 1.44 0.07 0.07 %after-gc-thunk + 0.00 0.07 0.00 anon #x1e9e5c0 +--- +Sample count: 139 +Total time: 4.909505945 seconds (3.970948419 seconds in GC) +``` + +Our code now runs about 6x faster, but GC is still taking up most of +that time. Let's examine the disassembly: + +``` +Disassembly of # at #x1f41378: + + 0 (instrument-entry 206) at (unknown file):177:16 + 2 (assert-nargs-ee/locals 3 4) ;; 7 slots (2 args) + 3 (immediate-tag=? 5 3 2) ;; fixnum? at (unknown file):178:15 + 5 (je 10) ;; -> L1 + 6 (immediate-tag=? 5 7 0) ;; heap-object? + 8 (jne 54) ;; -> L3 + 9 (heap-tag=? 5 127 23) ;; heap-number? + 11 (jne 51) ;; -> L3 + 12 (heap-tag=? 5 4095 791) ;; compnum? + 14 (je 48) ;; -> L3 +L1: + 15 (immediate-tag=? 5 3 2) ;; fixnum? at (unknown file):178:25 + 17 (je 45) ;; -> L3 + 18 (heap-tag=? 5 4095 535) ;; flonum? + 20 (jne 42) ;; -> L3 + 21 (immediate-tag=? 4 3 2) ;; fixnum? at (unknown file):179:15 + 23 (je 10) ;; -> L2 + 24 (immediate-tag=? 4 7 0) ;; heap-object? + 26 (jne 36) ;; -> L3 + 27 (heap-tag=? 4 127 23) ;; heap-number? + 29 (jne 33) ;; -> L3 + 30 (heap-tag=? 4 4095 791) ;; compnum? + 32 (je 30) ;; -> L3 +L2: + 33 (immediate-tag=? 4 3 2) ;; fixnum? at (unknown file):179:25 + 35 (je 27) ;; -> L3 + 36 (heap-tag=? 4 4095 535) ;; flonum? + 38 (jne 24) ;; -> L3 + 39 (call-f64<-scm 6 5 17) ;; scm->f64 at (unknown file):181:11 + 41 (fmul 6 6 6) + 42 (call-f64<-scm 5 4 17) ;; scm->f64 at (unknown file):181:19 + 44 (fmul 5 5 5) + 45 (fadd 6 6 5) at (unknown file):181:8 + 46 (call-f64<-f64 6 6 70) at (unknown file):181:2 + 48 (allocate-pointerless-words/immediate 5 2) + 49 (load-u64 4 0 535) + 52 (word-set!/immediate 5 0 4) + 53 (tail-pointer-ref/immediate 4 5 1) + 54 (load-u64 3 0 0) + 57 (f64-set! 4 3 6) + 58 (mov 6 5) + 59 (reset-frame 1) ;; 1 slot + 60 (handle-interrupts) + 61 (return-values) +L3: + 62 (static-ref 6 134) ;; misc-error at (unknown file):180:4 + 64 (make-immediate 3 4) ;; #f + 65 (make-non-immediate 2 133) ;; "expected floats ~S ~S" at (unknown file):180:11 + 67 (make-immediate 1 772) ;; () at (unknown file):180:4 + 68 (allocate-words/immediate 0 2) + 69 (scm-set!/immediate 0 0 4) + 70 (scm-set!/immediate 0 1 1) + 71 (allocate-words/immediate 4 2) + 72 (scm-set!/immediate 4 0 5) + 73 (scm-set!/immediate 4 1 0) + 74 (allocate-words/immediate 5 2) + 75 (scm-set!/immediate 5 0 3) + 76 (scm-set!/immediate 5 1 1) + 77 (allocate-words/immediate 1 2) + 78 (scm-set!/immediate 1 0 4) + 79 (scm-set!/immediate 1 1 5) + 80 (allocate-words/immediate 5 2) + 81 (scm-set!/immediate 5 0 2) + 82 (scm-set!/immediate 5 1 1) + 83 (allocate-words/immediate 4 2) + 84 (scm-set!/immediate 4 0 3) + 85 (scm-set!/immediate 4 1 5) + 86 (throw 6 4) +``` + +Important note: It seems that Guile 3.0.9, the latest stable release +as of writing, does not perform the desired optimization here. All +the output you are seeing here is from a Guile built from commit +`fb1f5e28b1a575247fd16184b1c83b8838b09716` of the main branch. If you +are reading this months/years into the future, then as long as you +have Guile > 3.0.9 you should be all set. + +There's a lot more instructions, but starting with instruction 41 we +can see that unboxed float instrutions like `fadd` and `fmul` are +being used. It's not made very clear, but instruction 46, +`call-f64<-f64`, is a call to a `sqrt` primitive specialized for +floats. Since our inputs have to be floats, Guile unboxes them as +f64s via the `call-f64<-scm` instruction. The cost of the runtime +checks is cheap compared to the cost of all the GC churn in the first +version. + +The source of our time spent in GC is the +`allocate-pointerless-words/immediate` instruction at index 48. This +allocates a new heap object and the subsequent instructions like +`f64-set!` set the contents of the heap object to the result of the +`sqrt` call. Our optimizations are local and once we cross the +procedure call boundary we need boxed values again. + +### Example 3: Please inline + +Guile will automatically inline procedures it considers small enough +for the potential performance improvements to be worth the additional +code size. It’s a nice feature, but there are times when you wish +something would be inlined but it doesn’t happen. + +Let’s define a procedure that normalizes 2D vectors. To do so, we’ll +build atop the `magnitude` procedure from example 2. + +```scheme +(define (normalize x y) + (let ((mag (magnitude x y))) + (when (= mag 0.0) + (error "cannot normalize vector with 0 magnitude" x y)) + (values (/ x mag) (/ y mag)))) +``` + +It would be *great* if all the unboxed float goodness from `magnitude` +spilled over to `normalize`. Let’s see it that happened (it didn’t): + +``` +scheme@(guile-user)> ,disassemble normalize +Disassembly of # at #x16609b0: + + 0 (instrument-entry 254) at (unknown file):17:19 + 2 (assert-nargs-ee/locals 3 6) ;; 9 slots (2 args) + 3 (static-ref 8 211) ;; #> at (unknown file):18:14 + 5 (immediate-tag=? 8 7 0) ;; heap-object? + 7 (je 9) ;; -> L1 + 8 (static-ref 8 184) ;; # + 10 (static-ref 5 214) ;; magnitude + 12 (call-scm<-scm-scm 8 8 5 111) ;; lookup-bound + 14 (static-set! 8 200) ;; #> +L1: + 16 (scm-ref/immediate 2 8 1) + 17 (mov 1 7) at (unknown file):18:13 + 18 (mov 0 6) + 19 (handle-interrupts) + 20 (call 6 3) + 22 (receive 0 6 9) + 24 (static-ref 5 210) ;; 0.0 at (unknown file):19:17 + 26 (=? 8 5) at (unknown file):19:10 + 27 (je 11) ;; -> L2 + 28 (call-scm<-scm-scm 7 7 8 5) ;; div at (unknown file):21:12 + 30 (call-scm<-scm-scm 8 6 8 5) ;; div at (unknown file):21:22 + 32 (mov 6 7) at (unknown file):21:4 + 33 (mov 7 8) + 34 (mov 8 6) + 35 (reset-frame 2) ;; 2 slots + 36 (handle-interrupts) + 37 (return-values) +L2: + 38 (static-ref 8 206) ;; misc-error at (unknown file):20:6 + 40 (make-immediate 5 4) ;; #f + 41 (make-non-immediate 4 205) ;; "cannot normalize vector with 0 magnitude ~S ~S" at (unknown file):20:13 + 43 (make-immediate 3 772) ;; () at (unknown file):20:6 + 44 (allocate-words/immediate 2 2) + 45 (scm-set!/immediate 2 0 6) + 46 (scm-set!/immediate 2 1 3) + 47 (allocate-words/immediate 6 2) + 48 (scm-set!/immediate 6 0 7) + 49 (scm-set!/immediate 6 1 2) + 50 (allocate-words/immediate 7 2) + 51 (scm-set!/immediate 7 0 5) + 52 (scm-set!/immediate 7 1 3) + 53 (allocate-words/immediate 3 2) + 54 (scm-set!/immediate 3 0 6) + 55 (scm-set!/immediate 3 1 7) + 56 (allocate-words/immediate 7 2) + 57 (scm-set!/immediate 7 0 4) + 58 (scm-set!/immediate 7 1 3) + 59 (allocate-words/immediate 6 2) + 60 (scm-set!/immediate 6 0 5) + 61 (scm-set!/immediate 6 1 7) + 62 (throw 8 6) +``` + +Instruction 20 is `call`, so inlining didn’t happen. Furthermore, the +two `/` calls (instructions 28 and 30) use the generic division +primitive rather than `fdiv`. No unboxing for us. + +The profiler confirms that things aren’t so great: + +```scheme +scheme@(guile-user)> ,profile (let lp ((i 0)) + (when (< i 100000000) + (normalize 3.0 4.0) + (lp (+ i 1)))) +% cumulative self +time seconds seconds procedure + 52.80 21.16 11.51 :17:19:normalize + 41.01 9.36 8.94 :9:19:magnitude + 3.29 0.72 0.72 %after-gc-thunk + 2.90 21.80 0.63 :23:9 + 0.00 0.72 0.00 anon #x15fd5c0 +--- +Sample count: 517 +Total time: 21.795201408 seconds (19.704395422 seconds in GC) +``` + +To force the compiler to inline `magnitude`, we’ll change the +definition of to use `define-inlinable`: + +```scheme +(define-inlinable (magnitude x y) + (unless (and (real? x) (inexact? x) + (real? y) (inexact? y)) + (error "expected floats" x y)) + (sqrt (+ (* x x) (* y y)))) +``` + +`define-inlinable` is a handy little macro that will substitute the +procedure body into its call sites. + +Now let’s see the disassembly: + +``` +Disassembly of # at #x16993c8: + + 0 (instrument-entry 276) at (unknown file):58:19 + 2 (assert-nargs-ee/locals 3 4) ;; 7 slots (2 args) + 3 (immediate-tag=? 5 3 2) ;; fixnum? at (unknown file):59:13 + 5 (je 10) ;; -> L1 + 6 (immediate-tag=? 5 7 0) ;; heap-object? + 8 (jne 97) ;; -> L4 + 9 (heap-tag=? 5 127 23) ;; heap-number? + 11 (jne 94) ;; -> L4 + 12 (heap-tag=? 5 4095 791) ;; compnum? + 14 (je 91) ;; -> L4 +L1: + 15 (immediate-tag=? 5 3 2) ;; fixnum? + 17 (je 88) ;; -> L4 + 18 (heap-tag=? 5 4095 535) ;; flonum? + 20 (jne 85) ;; -> L4 + 21 (immediate-tag=? 4 3 2) ;; fixnum? + 23 (je 10) ;; -> L2 + 24 (immediate-tag=? 4 7 0) ;; heap-object? + 26 (jne 79) ;; -> L4 + 27 (heap-tag=? 4 127 23) ;; heap-number? + 29 (jne 76) ;; -> L4 + 30 (heap-tag=? 4 4095 791) ;; compnum? + 32 (je 73) ;; -> L4 +L2: + 33 (immediate-tag=? 4 3 2) ;; fixnum? + 35 (je 70) ;; -> L4 + 36 (heap-tag=? 4 4095 535) ;; flonum? + 38 (jne 67) ;; -> L4 + 39 (call-f64<-scm 6 5 17) ;; scm->f64 + 41 (fmul 3 6 6) + 42 (call-f64<-scm 2 4 17) ;; scm->f64 + 44 (fmul 1 2 2) + 45 (fadd 3 3 1) + 46 (call-f64<-f64 3 3 70) + 48 (load-f64 1 0 0) at (unknown file):60:10 + 51 (f64=? 3 1) + 52 (je 28) ;; -> L3 + 53 (fdiv 6 6 3) at (unknown file):62:12 + 54 (allocate-pointerless-words/immediate 5 2) + 55 (load-u64 4 0 535) + 58 (word-set!/immediate 5 0 4) + 59 (tail-pointer-ref/immediate 4 5 1) + 60 (load-u64 1 0 0) + 63 (f64-set! 4 1 6) + 64 (fdiv 6 2 3) at (unknown file):62:22 + 65 (allocate-pointerless-words/immediate 4 2) + 66 (load-u64 3 0 535) + 69 (word-set!/immediate 4 0 3) + 70 (tail-pointer-ref/immediate 3 4 1) + 71 (load-u64 2 0 0) + 74 (f64-set! 3 2 6) + 75 (mov 6 5) at (unknown file):62:4 + 76 (mov 5 4) + 77 (reset-frame 2) ;; 2 slots + 78 (handle-interrupts) + 79 (return-values) +L3: + 80 (static-ref 6 178) ;; misc-error at (unknown file):61:6 + 82 (make-immediate 3 4) ;; #f + 83 (make-non-immediate 2 177) ;; "cannot normalize vector with 0 magnitude ~S ~S" at (unknown file):61:13 + 85 (make-immediate 1 772) ;; () at (unknown file):61:6 + 86 (allocate-words/immediate 0 2) + 87 (scm-set!/immediate 0 0 4) + 88 (scm-set!/immediate 0 1 1) + 89 (allocate-words/immediate 4 2) + 90 (scm-set!/immediate 4 0 5) + 91 (scm-set!/immediate 4 1 0) + 92 (allocate-words/immediate 5 2) + 93 (scm-set!/immediate 5 0 3) + 94 (scm-set!/immediate 5 1 1) + 95 (allocate-words/immediate 1 2) + 96 (scm-set!/immediate 1 0 4) + 97 (scm-set!/immediate 1 1 5) + 98 (allocate-words/immediate 5 2) + 99 (scm-set!/immediate 5 0 2) + 100 (scm-set!/immediate 5 1 1) + 101 (allocate-words/immediate 4 2) + 102 (scm-set!/immediate 4 0 3) + 103 (scm-set!/immediate 4 1 5) + 104 (throw 6 4) +L4: + 105 (static-ref 6 153) ;; misc-error at (unknown file):59:13 + 107 (make-immediate 3 4) ;; #f + 108 (make-non-immediate 2 160) ;; "expected floats ~S ~S" at (unknown file):54:11 + 110 (make-immediate 1 772) ;; () at (unknown file):59:13 + 111 (allocate-words/immediate 0 2) + 112 (scm-set!/immediate 0 0 4) + 113 (scm-set!/immediate 0 1 1) + 114 (allocate-words/immediate 4 2) + 115 (scm-set!/immediate 4 0 5) + 116 (scm-set!/immediate 4 1 0) + 117 (allocate-words/immediate 5 2) + 118 (scm-set!/immediate 5 0 3) + 119 (scm-set!/immediate 5 1 1) + 120 (allocate-words/immediate 1 2) + 121 (scm-set!/immediate 1 0 4) + 122 (scm-set!/immediate 1 1 5) + 123 (allocate-words/immediate 5 2) + 124 (scm-set!/immediate 5 0 2) + 125 (scm-set!/immediate 5 1 1) + 126 (allocate-words/immediate 4 2) + 127 (scm-set!/immediate 4 0 3) + 128 (scm-set!/immediate 4 1 5) + 129 (throw 6 4) +``` + +Much better! All of the instructions for `magnitude` are now part of +`normalize`. `/` is compiled to `fdiv` just like we had hoped. + +```scheme +% cumulative self +time seconds seconds procedure + 93.04 9.24 9.19 :58:19:normalize + 6.52 9.88 0.64 :71:9 + 0.43 0.04 0.04 %after-gc-thunk + 0.00 0.04 0.00 anon #x15fd5c0 +--- +Sample count: 230 +Total time: 9.879456057 seconds (8.858042989 seconds in GC) +``` + +We’re 2x faster now, though still a lot of GC. For our final example, +we will fully embrace *mutable state*. As much us Schemers like +functional programming, mutable state is sometimes necessary. + +### Example 4: Bytevectors + +For *really* performance sensitive math code, we can go one step +further to avoid allocation and use bytevectors to store the results +of numeric operations. Chickadee uses bytevectors extensively to +minimize the number of heap allocated floats. Bytevectors have the +advantage of unboxed getters and setters, so they’re my preferred data +structure for math intensive code. + +Let's revisit the vector math of the previous two examples, but this +time using bytevectors to represent 2D vectors. + +```scheme +(define-inlinable (vec2 x y) + (let ((bv (make-f32vector 2))) + (f32vector-set! bv 0 x) + (f32vector-set! bv 1 y) + bv)) + +(define-inlinable (vec2-x v) + (f32vector-ref v 0)) + +(define-inlinable (vec2-y v) + (f32vector-ref v 1)) + +(define-inlinable (magnitude v) + (let ((x (vec2-x v)) + (y (vec2-y v))) + (sqrt (+ (* x x) (* y y))))) + +(define (normalize v) + (let ((mag (magnitude v))) + (when (= mag 0.0) + (error "cannot normalize vector with 0 magnitude" v)) + (vec2 (/ (vec2-x v) mag) (/ (vec2-y v) mag)))) +``` + +Here’s the disassembly for `normalize` now: + +``` +Disassembly of # at #x1b05d50: + + 0 (instrument-entry 492) at (unknown file):454:19 + 2 (assert-nargs-ee/locals 2 11) ;; 13 slots (1 arg) + 3 (make-immediate 12 2) ;; 0 at (unknown file):455:13 + 4 (immediate-tag=? 11 7 0) ;; heap-object? + 6 (jne 83) ;; -> L8 + 7 (heap-tag=? 11 127 77) ;; bytevector? + 9 (jne 80) ;; -> L8 + 10 (word-ref/immediate 10 11 1) + 11 (load-s64 9 0 0) + 14 (imm-u64 L7 + 16 (usub/immediate 10 10 3) + 17 (pointer-ref/immediate 8 11 2) + 18 (f32-ref 7 8 9) + 19 (make-immediate 6 18) ;; 4 + 20 (load-s64 5 0 4) + 23 (u64 L6 + 25 (f32-ref 10 8 5) + 26 (fmul 8 7 7) + 27 (fmul 4 10 10) + 28 (fadd 8 8 4) + 29 (call-f64<-f64 8 8 70) + 31 (load-f64 4 0 0) at (unknown file):456:10 + 34 (f64=? 8 4) + 35 (je 48) ;; -> L5 + 36 (fdiv 11 7 8) at (unknown file):458:10 + 37 (fdiv 10 10 8) at (unknown file):458:29 + 38 (static-ref 8 332) ;; #f at (unknown file):388:13 + 40 (immediate-tag=? 8 7 0) ;; heap-object? + 42 (je 9) ;; -> L1 + 43 (static-ref 8 305) ;; # + 45 (static-ref 7 335) ;; make-f32vector + 47 (call-scm<-scm-scm 8 8 7 111) ;; lookup-bound + 49 (static-set! 8 321) ;; #f +L1: + 51 (scm-ref/immediate 1 8 1) + 52 (make-immediate 0 10) ;; 2 at (unknown file):388:28 + 53 (handle-interrupts) at (unknown file):458:4 + 54 (call 11 2) + 56 (receive 4 11 13) + 58 (immediate-tag=? 8 7 0) ;; heap-object? + 60 (jne 21) ;; -> L4 + 61 (heap-tag=? 8 127 77) ;; bytevector? + 63 (jne 18) ;; -> L4 + 64 (word-ref/immediate 7 8 1) + 65 (imm-u64 L3 + 67 (usub/immediate 12 7 3) + 68 (pointer-ref/immediate 7 8 2) + 69 (f32-set! 7 9 11) + 70 (u64 L2 + 72 (f32-set! 7 5 10) + 73 (mov 12 8) + 74 (reset-frame 1) ;; 1 slot + 75 (handle-interrupts) + 76 (return-values) +L2: + 77 (throw/value+data 6 331) ;; #(out-of-range "bytevector-ieee-single-native-set!" "Argument 2 out of rang…") +L3: + 79 (throw/value+data 12 329) ;; #(out-of-range "bytevector-ieee-single-native-set!" "Argument 2 out of rang…") +L4: + 81 (throw/value+data 8 353) ;; #(wrong-type-arg "bytevector-ieee-single-native-set!" "Wrong type argument …") +L5: + 83 (throw/value 11 377) ;; #(misc-error #f "cannot normalize vector with 0 magnitude ~S") at (unknown file):457:6 +L6: + 85 (throw/value+data 6 391) ;; #(out-of-range "bytevector-ieee-single-native-ref" "Argument 2 out of range…") at (unknown file):455:13 +L7: + 87 (throw/value+data 12 389) ;; #(out-of-range "bytevector-ieee-single-native-ref" "Argument 2 out of range…") +L8: + 89 (throw/value+data 11 395) ;; #(wrong-type-arg "bytevector-ieee-single-native-ref" "Wrong type argument i…") +``` + +This looks pretty good! All the math is done with unboxed floats and +no heap floats are allocated at all. Unboxed floats are pulled out of +the bytevector with `f32-ref` and stuffed back in with `f32-set!`. +But we’re still allocating a new bytevector at the end. This is +generally fine, but for *reeeeaaally* performance sensitive code we +want to avoid this allocation, too. For this case, we can write a +variant of `normalize` that mutates another 2D vector to store the +result. + +```scheme +(define-inlinable (set-vec2-x! v x) + (f32vector-set! v 0 x)) + +(define-inlinable (set-vec2-y! v y) + (f32vector-set! v 1 y)) + +(define (normalize! v dst) + (let ((mag (magnitude v))) + (when (= mag 0.0) + (error "cannot normalize vector with 0 magnitude" v)) + (set-vec2-x! dst (/ (vec2-x v) mag)) + (set-vec2-y! dst (/ (vec2-y v) mag)))) +``` + +We can then define the functional version in terms of the imperative +version: + +```scheme +(define (normalize v) + (let ((v* (vec2 0.0 0.0))) + (normalize! v v*) + v*)) +``` + +Now we have options. We can use the less elegant, imperative variant +when we can’t afford to allocate and use the functional variant +otherwise. This is a simplified version of how vecs, matrices, and +rects work in Chickadee. + +Let’s compare the two. First, the functional API: + +```scheme +scheme@(guile-user)> ,profile (let ((v (vec2 3.0 4.0))) + (let lp ((i 0)) + (when (< i 100000000) + (normalize v) + (lp (+ i 1))))) +% cumulative self +time seconds seconds procedure + 46.46 7.84 7.73 make-srfi-4-vector + 31.61 5.26 5.26 :425:19:normalize! + 12.95 16.23 2.15 :432:19:normalize + 5.87 0.98 0.98 ice-9/boot-9.scm:408:31:make-f32vector + 2.42 16.63 0.40 :439:32 + 0.69 0.11 0.11 %after-gc-thunk + 0.00 0.11 0.00 anon #x15fd5c0 +--- +Sample count: 579 +Total time: 16.633395281 seconds (12.628994384 seconds in GC) +``` + +And now the imperative API: + +```scheme +scheme@(guile-user)> ,profile (let ((v (vec2 3.0 4.0)) + (dst (vec2 0.0 0.0))) + (let lp ((i 0)) + (when (< i 100000000) + (normalize! v dst) + (lp (+ i 1))))) +% cumulative self +time seconds seconds procedure + 91.03 1.13 1.13 :272:19:normalize! + 8.97 1.24 0.11 :343:32 +--- +Sample count: 78 +Total time: 1.244961515 seconds (0.0 seconds in GC) +``` + +13x faster and no GC! To use this technique in your own program, you +may want to use something like a pool to reuse objects over and over; +or just stash an object somewhere to use as scratch space. + +Note: Unlike example 2, these optimizations *do* happen on Guile 3.0.9 +and IIRC any stable Guile 3.0.x release. + +### Happy hacking + +Well, that’s all I’ve got! There are other sources of allocation to +be aware of, like closures, but I couldn’t come up with clean +examples. If I think of something good maybe I’ll update this post +later. + +To reiterate, most of the code you write doesn’t need to be examined +this closely. Don’t rush off and use `define-inlinable` everywhere +and inflate the size of your compiled modules! Let the profiler focus +your attention on what matters. May your Scheme be speedy and your +GCs infrequent. 🙏 -- cgit v1.2.3