Storing Morse code in C -- A comparison of techniques
A few days ago I posted this column on Embedded.com and a corresponding column on EETimes.com describing how I wish to use a light-emitting diode LED to flash messages in Morse code. In these columns I inquired as to the best way to represent, store, and access the Morse code dots and dashes used to represent letters and numbers.
As you can see from the image above, the alphanumeric characters 'A' to 'Z' and '0' to '9' each require between one and five dot-dash symbols. What I neglected to mention was that I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = "--..--") and '.' (period = ".-.-.-").
Of course, you might say: "Does anyone still know Morse code these days?" Well, my chum Ivan from the next bay just wandered into my office, looked at a little proto-shield I threw together last weekend flashing away merrily on my desk and said "Oh, it's flashing 'Hello World'" (take a look at this video of the proto-shield doing its funky stuff).
Alternatively, you might ask: "Does anyone actually need to know Morse code these days." To these sad souls I would respond: "Hello! Didn’t you see the film Independence Day?" Then, as they sheepishly sidled away, I would call out: "Go away, or I shall taunt you a second time!"
But we digress... My original columns unleashed a maelstrom of comments. Community member Cdhmanning suggested using a binary decision tree, in which each node has a character and a link to the left for dot and one to the right for dash. I'd love to try this, but I fear it's beyond my limited programming skills. In fact, in one of his comments, Betajet presented the code for this; however, I'm still trying to wrap my bumbling brain around this little scamp (the code, not Betajet).
Someone else emailed me proposing a form of run-length encoding. Sad to say, however, I couldn’t wrap my brain around that one either ("I'm a bear of little brain," as Winnie-the-Pooh would say).
Several people advocated using two bits per symbol. My chum David Ashton, for example, suggested 01=dot, 10=dash, 11=space, and 00=end. David also noted that, using this scheme, a 5-symbol code would require 20 bits (5 dots/dashes x 2 bits, plus 4 inter-symbol spaces x 2 bits, plus 1 end character x 2 bits).
Actually, just a few minutes ago whilst I was in the process of writing this column, I received an email from Djones A. Boni in Brazil, who suggested the following scheme, which is very similar to David's:
We actually have four different symbols:
The dot (+1 implicit space between parts)
The dash (+1 implicit space between parts)
The space between letters (-1 implicit space between parts)
The space between words (-1 implicit space between parts)
We can encode them as 0b00, 0b01, 0b10 and 0b11, thereby getting four symbols into a byte.
Another community member, Jacob Larsen, suggested using 0 and 1 bits to represent units of light and darkness, so the character 'A' (".-") would be represented as 10111; i.e., dot (1) = LED On = one unit delay; inter-symbol space (0) = LED Off = one unit delay; and dash (111) = LED On = three unit delays. On the one hand, this has the advantage of being relatively simple; on the other hand, the numeric character '0' with five dashes ("-----") would occupy 5 x (3 + 1) = 20 bits, which seems a tad enthusiastic, as it were.
Another proposal from Cdhmanning was to store each Morse code character in an 8-bit byte. The three MSBs (most-significant bits) would be used to represent the number of bits in the character, while the five LSBs (least-significant bits) would be used to represent the dots and dashes.
Consider the character 'F' ("..-."), for example. This could be represented as 100x0010, where the 100 indicates that this character contains four dot-dash symbols and the 0010 indicates dot-dot-dash-dot. Meanwhile, the 'x' indicates that we don’t care about this character (we would set it to 0 in the code).
In fact, Cdhmanning, actually recommended that the dot-dash symbol bits be "swapped over" such that the first to be transmitted is stored in the LSB, because this will make any downstream processing easier. In this case, our 'F' character would actually be stored as 100x0100.
As I mentioned above, I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = "--..--") and '.' (period = ".-.-.-"). Happily, Cdhmanning's scheme can be adapted to accommodate this as shown below, where a 'x' character indicates "don't care" (we'll set these to 0 in the code) and a '?' character indicates a symbol (0 = dot, 1 = dash):
// 000 xxxxx 0 symbols (not used)
// 001 xxxx? 1 symbol
// 010 xxx?? 2 symbols
// 011 xx??? 3 symbols
// 100 x???? 4 symbols
// 101 ????? 5 symbols
// 11 ?????? 6 symbols
Jacob Larsen also commented: "You don't mention what you are optimizing for -- code size, code maintainability, data size, or instruction speed? I am going to assume maintainability, since Morse code generally doesn't go fast enough to need speed optimization."
Jacob makes a very good point. Transmitting Morse code at human-readable speeds means that -- most of the time -- the processor is sitting around twiddling its metaphorical thumbs, so speed optimization isn’t really an issue. On the other hand, it's always a good idea to minimize your code and data size where possible, and it's also gratifying to make your code run as efficiently and as quickly as possible, even if no one else really gives a hoot.
For this particular project, I feel that understandability and maintainability are the two key elements. The reason for this will become clear when I rent the veils asunder and explain the true purpose behind this capriciously cunning contraption, but we'll leave that for a future column.
For the purposes of this exercise, I've decided to compare two different techniques: (a) Storing the Morse code symbols as strings, and (b) packing the symbols for each character into a single byte. Now read on...
- Storing symbols as strings
- Storing symbols as bytes
- Comparison (code size and execution speed)