Small Code Size on RP2040 Forth
Every since watching this nifty video on boot sector games, I've been obsessed with the idea of optimizing for code size, i.e., making the compiled code as few bytes as possible.
2022-06-28 - Boot Sector Games
The short summary of the video is that there are some games that have been written which fit entirely within the 512 KB boot sector of a floppy disk, and run without an operating system first being loaded. Not great games, but working games.
Something which challenged me recently was trying to reduce the compiled code size of a small program I wrote in Mecrisp Stellaris Forth. The program was supposed to read the function bits of the RP2040 GPIO CTRL registers, and output the current function that is set for each GPIO pin, in a human readable format. There are nine function options for each pin, and there are 29 pins. The exact function in a slot varies, however, for each pin, e.g., function 1 for GPIO 0 is SPIO RX, but for GPIO 2 it is SPIO SCK, and for GPIO 8 it is SPI1 RX.
Screenshot of GPIO functions
Aiming for small code size, it was out of the question to simply have a table of strings, with a full description for each slot, which would require over 1KB of space just for the string table. There is quite a bit of pattern and partial repetition, however, so that it is possible to build the string for a given slot using logic and comparisions, based on the bits in the pin number. This saves a lot of code space in terms of string data, but that benefit is offset somewhat by extra comparisons and bitwise operations that must be performed.
In my initial solution, I tried to generate long, easy to read strings, such as "UART0_RX", but in the end the compiled code size came out to somewhere around 900-1000 bytes, which seemed much too large for a program with such a humble purpose. The compromise I settled on, to avoid abondoning the "human-readable" goal completely, was to have abbreviated strings limited to three characters each. So, "UART0_RX" becomes "U0R". That is not self-explanatory, but it is much easier to associate this with "UART0_RX" than if you were just looking at the raw number code from the register.
Here is the program I came up with:
The two main words are GPIO_FNS' and GPIO_FNS, which compile to 582 bytes and 80 bytes, with a few more bytes for the other words, as well as the overhead for each entry in the dictionary. That was not a satisfying result, as I feel like this sort of program shouldn't be needing more than 200 or 300 bytes. But I couldn't think of any practical ways to reduce the code size further. I asked on #mecrisp, but nobody there had any ideas either, although somebody asked if Mecrisp Forth had some kind of BETWEEN word which could replace this code:
But sadly, no such word was available. I could of course define such a word, but since (at present) I need to use it in only one place, it would actually increase the code size to factor it out, due to the dictionary entry overhead.
Something I was really wondering about was if there was some other way to print out the characters which would be more code-size efficient. However, it seemed that using the ." and " words took up less space than all the other methods I knew of. To explain how that works, see this disassembly:
Here is a word which prints out just the letter A, using the ." and " words:
Not counting the push and pop, that is four bytes for a branch instruction (no doubt the code which handles the following string). Then one byte for the length of the string (01h) and then the character itself (41h). Here is the same word defined using a simple emit:
Not counting the push or pop, we have four bytes for the branch instruction to the EMIT word, and there are six more bytes worth of instructions for dropping $41 on the top of the stack. So, the first approach will be better in terms of code size, even for a single-character string.
I should think that this would be a sensible place for the Mecrisp compiler to do an optimization: say, just put the number in some other scratch register, and call a special version of EMIT which uses that value. But I imagine that optimization is easier to talk about than to code into the compiler.
Another optimization I was wondering about, was if it would be possible to replace the case statement with something that has a smaller footprint. The case statement works by comparing Top Of Stack (TOS) with a case value. If it doesn't match, branch to the next comparison. If it does, keep running code, and eventually there is a branch instruction which takes you to the end of the whole construction. For example:
However, in my situation, the case numbers are all in a whole number sequence. So, I should be able to just store an array of vectors to the different code blocks, then use the appropriate vector. I believe this would lower code size overall, since I would basically be adding only a single address for each two CMP and BNE instructions which I removed.
That sounds good, but I don't know how to accomplish that without either rewriting the whole GPIO_FNS' word in assembly, or coming up with a fancy Forth compile-time word which somehow builds the array and all the other branching code. Both ideas sound daunting, but perhaps this could be my next mini-project...
I just looked up bootchess, it is only 487kb. Amazing how small a program can be!