Storing Morse code in C — A comparison of techniques
A few days ago I posted this column on Embedded.com and a corresponding column on EETimes.com describing how I wish to use a light-emitting diode LED to flash messages in Morse code. In these columns I inquired as to the best way to represent, store, and access the Morse code dots and dashes used to represent letters and numbers.

As you can see from the image above, the alphanumeric characters 'A' to 'Z' and '0' to '9' each require between one and five dot-dash symbols. What I neglected to mention was that I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = “–..–“) and '.' (period = “.-.-.-“).
ESC Minneapolis.]
Of course, you might say: “Does anyone still know Morse code these days?” Well, my chum Ivan from the next bay just wandered into my office, looked at a little proto-shield I threw together last weekend flashing away merrily on my desk and said “Oh, it's flashing 'Hello World'” (take a look at this video of the proto-shield doing its funky stuff).
Alternatively, you might ask: “Does anyone actually need to know Morse code these days.” To these sad souls I would respond: “Hello! Didn’t you see the film Independence Day ?” Then, as they sheepishly sidled away, I would call out: “Go away, or I shall taunt you a second time!”
But we digress… My original columns unleashed a maelstrom of comments. Community member Cdhmanning suggested using a binary decision tree, in which each node has a character and a link to the left for dot and one to the right for dash. I'd love to try this, but I fear it's beyond my limited programming skills. In fact, in one of his comments, Betajet presented the code for this; however, I'm still trying to wrap my bumbling brain around this little scamp (the code, not Betajet).
Someone else emailed me proposing a form of run-length encoding. Sad to say, however, I couldn’t wrap my brain around that one either (“I'm a bear of little brain,” as Winnie-the-Pooh would say).
Several people advocated using two bits per symbol. My chum David Ashton, for example, suggested 01=dot, 10=dash, 11=space, and 00=end. David also noted that, using this scheme, a 5-symbol code would require 20 bits (5 dots/dashes x 2 bits, plus 4 inter-symbol spaces x 2 bits, plus 1 end character x 2 bits).
Actually, just a few minutes ago whilst I was in the process of writing this column, I received an email from Djones A. Boni in Brazil, who suggested the following scheme, which is very similar to David's:
We actually have four different symbols:
The dot (+1 implicit space between parts)
The dash (+1 implicit space between parts)
The space between letters (-1 implicit space between parts)
The space between words (-1 implicit space between parts)We can encode them as 0b00, 0b01, 0b10 and 0b11, thereby getting four symbols into a byte.
Another community member, Jacob Larsen, suggested using 0 and 1 bits to represent units of light and darkness, so the character 'A' (“.-“) would be represented as 10111; i.e., dot (1) = LED On = one unit delay; inter-symbol space (0) = LED Off = one unit delay; and dash (111) = LED On = three unit delays. On the one hand, this has the advantage of being relatively simple; on the other hand, the numeric character '0' with five dashes (“—–“) would occupy 5 x (3 + 1) = 20 bits, which seems a tad enthusiastic, as it were.
Another proposal from Cdhmanning was to store each Morse code character in an 8-bit byte. The three MSBs (most-significant bits) would be used to represent the number of bits in the character, while the five LSBs (least-significant bits) would be used to represent the dots and dashes.
Consider the character 'F' (“..-.”), for example. This could be represented as 100×0010, where the 100 indicates that this character contains four dot-dash symbols and the 0010 indicates dot-dot-dash-dot. Meanwhile, the 'x' indicates that we don’t care about this character (we would set it to 0 in the code).
In fact, Cdhmanning, actually recommended that the dot-dash symbol bits be “swapped over” such that the first to be transmitted is stored in the LSB, because this will make any downstream processing easier. In this case, our 'F' character would actually be stored as 100×0100.
As I mentioned above, I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = “–..–“) and '.' (period = “.-.-.-“). Happily, Cdhmanning's scheme can be adapted to accommodate this as shown below, where a 'x' character indicates “don't care” (we'll set these to 0 in the code) and a '?' character indicates a symbol (0 = dot, 1 = dash):
// 000 xxxxx 0 symbols (not used)
// 001 xxxx? 1 symbol
// 010 xxx?? 2 symbols
// 011 xx??? 3 symbols
// 100 x???? 4 symbols
// 101 ????? 5 symbols
// 11 ?????? 6 symbols
Jacob Larsen also commented: “You don't mention what you are optimizing for — code size, code maintainability, data size, or instruction speed? I am going to assume maintainability, since Morse code generally doesn't go fast enough to need speed optimization.”
Jacob makes a very good point. Transmitting Morse code at human-readable speeds means that — most of the time — the processor is sitting around twiddling its metaphorical thumbs, so speed optimization isn’t really an issue. On the other hand, it's always a good idea to minimize your code and data size where possible, and it's also gratifying to make your code run as efficiently and as quickly as possible, even if no one else really gives a hoot.
For this particular project, I feel that understandability and maintainability are the two key elements. The reason for this will become clear when I rent the veils asunder and explain the true purpose behind this capriciously cunning contraption, but we'll leave that for a future column.
For the purposes of this exercise, I've decided to compare two different techniques: (a) Storing the Morse code symbols as strings, and (b) packing the symbols for each character into a single byte. Now read on…
Index
- Overview
- Storing symbols as strings
- Storing symbols as bytes
- Comparison (code size and execution speed)
Storing symbols as strings
Before we plunge into the fray with gusto and abandon, let’s first discuss a few “housekeeping” issues. The thing about Morse code is that everything is relative, because some people can transmit it faster than others. As we know from our original Morse code chart, a dot equates to one element (unit) of delay (time) and a dash equates to three elements of delay.
Meanwhile, the inter-symbol (dot-dash) spacing is one element of delay; the inter-character spacing (i.e., the spacing between adjacent letters in a string like “Hello”) is three elements of delay; and the inter-word spacing (i.e., the spacing between two words in a string like “Hello World”) is seven elements of delay.
Baring this in mind, what does it actually mean when you hear people say that they can transmit 'n' words per minute in Morse code? This is obviously somewhat subjective, because different words have different numbers of characters; there's a heck of a difference between the number of characters in “A” and in “Antidisestablishmentarianism,” for example.
For this reason, one of the “standard words” Morse code aficionados use to compare themselves against each other is PARIS. Consider the following breakdown, where {} represents the delay elements associated with a dot or a dash; [] represents the delay elements associated with inter-symbol spacing; and () represents the delay elements associated with inter-character and inter-word spacing:
P ".--." {1} [1] {3} [1] {3} [1] {1} (3) = 14
A ".-" {1} [1] {3} (3) = 8
R ".-." {1} [1] {3} [1] {1} (3) = 10
I ".." {1} [1] {1} (3) = 3
S "..." {1} [1] {1} [1] {1} (7) = 12
Total = 50
One minute equals 60 seconds equals 60,000 milliseconds (ms). Thus, if we were to transmit at one word per minute — and assuming 50 elements per word based on PARIS as discussed above — then each element would take 1,200ms. This explains why you will see the following definition in my code:
#define elementDelay 1200
By trial and error, I've discovered that 15 words per minute is pleasing to my eye, which is why you will also see the following definition in my code:
#define WPM 15
Thus, the way in which I've implemented my code means that you only have to tweak the WPM definition in order to modify the transmission rate. And, speaking of my code, you can access the entire program by clicking here.
The interesting part of this program starts with our definition of the dots and dashes, which appears as follows:
char* dotsNdashes[] = {".-", // 0 = A
"-...", // 1 = B
"-.-.", // 2 = C
"-..", // 3 = D
:
etc.
This basically says that we're going to store an array of pointers to strings of characters, so dotsNdashes[0] contains a pointer to a string containing “.-” (plus a hidden terminating '