Storing Morse code in C - A comparison of techniques - Embedded.com

Storing Morse code in C — A comparison of techniques

A few days ago I posted this column on Embedded.com and a corresponding column on EETimes.com describing how I wish to use a light-emitting diode LED to flash messages in Morse code. In these columns I inquired as to the best way to represent, store, and access the Morse code dots and dashes used to represent letters and numbers.

As you can see from the image above, the alphanumeric characters 'A' to 'Z' and '0' to '9' each require between one and five dot-dash symbols. What I neglected to mention was that I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = “–..–“) and '.' (period = “.-.-.-“).

[You can discuss this, and other topics, with Max at
ESC Minneapolis.]

Of course, you might say: “Does anyone still know Morse code these days?” Well, my chum Ivan from the next bay just wandered into my office, looked at a little proto-shield I threw together last weekend flashing away merrily on my desk and said “Oh, it's flashing 'Hello World'” (take a look at this video of the proto-shield doing its funky stuff).

Alternatively, you might ask: “Does anyone actually need to know Morse code these days.” To these sad souls I would respond: “Hello! Didn’t you see the film Independence Day ?” Then, as they sheepishly sidled away, I would call out: “Go away, or I shall taunt you a second time!”

But we digress… My original columns unleashed a maelstrom of comments. Community member Cdhmanning suggested using a binary decision tree, in which each node has a character and a link to the left for dot and one to the right for dash. I'd love to try this, but I fear it's beyond my limited programming skills. In fact, in one of his comments, Betajet presented the code for this; however, I'm still trying to wrap my bumbling brain around this little scamp (the code, not Betajet).

Someone else emailed me proposing a form of run-length encoding. Sad to say, however, I couldn’t wrap my brain around that one either (“I'm a bear of little brain,” as Winnie-the-Pooh would say).

Several people advocated using two bits per symbol. My chum David Ashton, for example, suggested 01=dot, 10=dash, 11=space, and 00=end. David also noted that, using this scheme, a 5-symbol code would require 20 bits (5 dots/dashes x 2 bits, plus 4 inter-symbol spaces x 2 bits, plus 1 end character x 2 bits).

Actually, just a few minutes ago whilst I was in the process of writing this column, I received an email from Djones A. Boni in Brazil, who suggested the following scheme, which is very similar to David's:

We actually have four different symbols:
The dot (+1 implicit space between parts)
The dash (+1 implicit space between parts)
The space between letters (-1 implicit space between parts)
The space between words (-1 implicit space between parts)

We can encode them as 0b00, 0b01, 0b10 and 0b11, thereby getting four symbols into a byte.

Another community member, Jacob Larsen, suggested using 0 and 1 bits to represent units of light and darkness, so the character 'A' (“.-“) would be represented as 10111; i.e., dot (1) = LED On = one unit delay; inter-symbol space (0) = LED Off = one unit delay; and dash (111) = LED On = three unit delays. On the one hand, this has the advantage of being relatively simple; on the other hand, the numeric character '0' with five dashes (“—–“) would occupy 5 x (3 + 1) = 20 bits, which seems a tad enthusiastic, as it were.

Another proposal from Cdhmanning was to store each Morse code character in an 8-bit byte. The three MSBs (most-significant bits) would be used to represent the number of bits in the character, while the five LSBs (least-significant bits) would be used to represent the dots and dashes.

Consider the character 'F' (“..-.”), for example. This could be represented as 100×0010, where the 100 indicates that this character contains four dot-dash symbols and the 0010 indicates dot-dot-dash-dot. Meanwhile, the 'x' indicates that we don’t care about this character (we would set it to 0 in the code).

In fact, Cdhmanning, actually recommended that the dot-dash symbol bits be “swapped over” such that the first to be transmitted is stored in the LSB, because this will make any downstream processing easier. In this case, our 'F' character would actually be stored as 100×0100.

As I mentioned above, I also wish to have two punctuation symbols at my disposal, each of which contains six symbols: ',' (comma = “–..–“) and '.' (period = “.-.-.-“). Happily, Cdhmanning's scheme can be adapted to accommodate this as shown below, where a 'x' character indicates “don't care” (we'll set these to 0 in the code) and a '?' character indicates a symbol (0 = dot, 1 = dash):

// 000 xxxxx 0 symbols (not used)
// 001 xxxx? 1 symbol
// 010 xxx?? 2 symbols
// 011 xx??? 3 symbols
// 100 x???? 4 symbols
// 101 ????? 5 symbols
// 11 ?????? 6 symbols

Jacob Larsen also commented: “You don't mention what you are optimizing for — code size, code maintainability, data size, or instruction speed? I am going to assume maintainability, since Morse code generally doesn't go fast enough to need speed optimization.”

Jacob makes a very good point. Transmitting Morse code at human-readable speeds means that — most of the time — the processor is sitting around twiddling its metaphorical thumbs, so speed optimization isn’t really an issue. On the other hand, it's always a good idea to minimize your code and data size where possible, and it's also gratifying to make your code run as efficiently and as quickly as possible, even if no one else really gives a hoot.

For this particular project, I feel that understandability and maintainability are the two key elements. The reason for this will become clear when I rent the veils asunder and explain the true purpose behind this capriciously cunning contraption, but we'll leave that for a future column.

For the purposes of this exercise, I've decided to compare two different techniques: (a) Storing the Morse code symbols as strings, and (b) packing the symbols for each character into a single byte. Now read on…

Index

Storing symbols as strings
Before we plunge into the fray with gusto and abandon, let’s first discuss a few “housekeeping” issues. The thing about Morse code is that everything is relative, because some people can transmit it faster than others. As we know from our original Morse code chart, a dot equates to one element (unit) of delay (time) and a dash equates to three elements of delay.

Meanwhile, the inter-symbol (dot-dash) spacing is one element of delay; the inter-character spacing (i.e., the spacing between adjacent letters in a string like “Hello”) is three elements of delay; and the inter-word spacing (i.e., the spacing between two words in a string like “Hello World”) is seven elements of delay.

Baring this in mind, what does it actually mean when you hear people say that they can transmit 'n' words per minute in Morse code? This is obviously somewhat subjective, because different words have different numbers of characters; there's a heck of a difference between the number of characters in “A” and in “Antidisestablishmentarianism,” for example.

For this reason, one of the “standard words” Morse code aficionados use to compare themselves against each other is PARIS. Consider the following breakdown, where {} represents the delay elements associated with a dot or a dash; [] represents the delay elements associated with inter-symbol spacing; and () represents the delay elements associated with inter-character and inter-word spacing:

P ".--." {1} [1] {3} [1] {3} [1] {1} (3) = 14
A ".-"   {1} [1] {3}                 (3) =  8
R ".-."  {1} [1] {3} [1] {1}         (3) = 10
I ".."   {1} [1] {1}                 (3) =  3
S "..."  {1} [1] {1} [1] {1}         (7) = 12
                                   Total = 50

One minute equals 60 seconds equals 60,000 milliseconds (ms). Thus, if we were to transmit at one word per minute — and assuming 50 elements per word based on PARIS as discussed above — then each element would take 1,200ms. This explains why you will see the following definition in my code:

#define elementDelay 1200

By trial and error, I've discovered that 15 words per minute is pleasing to my eye, which is why you will also see the following definition in my code:

#define WPM 15

Thus, the way in which I've implemented my code means that you only have to tweak the WPM definition in order to modify the transmission rate. And, speaking of my code, you can access the entire program by clicking here.

The interesting part of this program starts with our definition of the dots and dashes, which appears as follows:

char* dotsNdashes[] = {".-",   // 0 = A
                      "-...", // 1 = B
                      "-.-.", // 2 = C
                      "-..",  // 3 = D
                         :
                        etc.

This basically says that we're going to store an array of pointers to strings of characters, so dotsNdashes[0] contains a pointer to a string containing “.-” (plus a hidden terminating '' null character); dotsNdashes[1] contains a pointer to a string containing “-…”; and so forth.

Now, I want my final program to be able to display any ASCII string that comes its way. For the purpose of this example, however, my main loop is pretty simple as shown below:

void loop() {
  displayThis("Hello World");
}

The idea is that I can pass any ASCII text string to my displayThis() function. As you'll see, the first thing we do in the displayThis() function is to convert any lowercase alpha characters 'a' to 'z' into their uppercase counterparts 'A' to 'Z'. We could just tell people that they can only use uppercase letters, but we all know that they aren’t going to listen (LOL).

You'll also see that we check for any unexpected characters and — if we find any — we call a simple eeekError() function that flashes the LED in a certain way (we vary the LED's on-off times to reflect different errors).

The body of the displayThis() function is used to determine which entry (string) we wish to use in our dotsNdashes[] array. Once we've determined this, we call a flashLED() function and pass this string in as a parameter (whilst pondering this function, it may help to remember that a string is an array of characters, and that any type of array is automatically passed into a function as a pointer; of course, this may not help, in which case I'm sorry I mentioned it).

Finally, we use the flashLED() function to actually output the dots and dashes by flashing the LED. This first thing we do for each dot-dash symbol is to turn the LED on because the only difference between a dot and a dash is the duration it remains active.

OK. That pretty much covers the technique of storing our dot-dash Morse code symbols as strings. The advantage is that this is relatively easy to understand and maintain; the disadvantage is that our dotsNdashes[] array consumes a byte (char) for each dot and dash, which is a lot of memory (relatively speaking). Next, we'll consider packing all of the symbols forming a character into a single byte.

Index

Storing symbols as bytes
The overall architecture and flow of this program is very similar to the string version (you can access the entire program by clicking here). The first difference occurs when we declare our dotsNdashes[] array. In this case, we're using an array of bytes that looks something like the following:

byte dotsNdashes[] = {B01000010, // 0 = A ".-"
                      B10000001, // 1 = B "-..."
                      B10000101, // 2 = C "-.-."
                      B01100001, // 3 = D "-.."
                      :
                      etc.

As we discussed in the overview, the three MSBs contain the number of dot-dash symbols (the two MSBs in the case of the punctuation characters), while the LSBs contain the symbols themselves; 0 = dot and 1 = dash, with the symbols organized in reverse order such that the first symbol appears in the LSB.

Our loop() function is identical to the previous example:

void loop() {
  displayThis("Hello World");
}

Our displayThis() function appears to be identical to the previous example — it's word-for-word identical — but there is a subtle difference regarding the way in which the compiler handles things. Consider the following statement from the displayThis() function:

  flashLED(dotsNdashes[36]);

In the case of our previous “store as strings” program, this statement calls our flashLED() function and passes it a pointer to a string of characters as an argument. By comparison, in the case of our “store as bytes” program, the argument is a single char (which we end up treating as a byte):

void flashLED(byte dNdByte) {
   :
  // Lots of clever stuff
   :
}

The next big difference between our two programs occurs in the body of our flashLED() function. First of all, we extract the number of symbols forming this character (the three MSBs), shift them five bits to the right, and store this value in a variable called “count”:

byte count;

count = (dNdByte >> 5) & B00000111;

if (count >= 6) count &= B00000110;

Since byte is an unsigned data type, shifting it to the right should theoretically cause 0s to be shifted into the MSBs, but I don’t like to take any chances, so — following the shift — the “& B00000111” (bitwise AND) operation forces the five MSBs to be 0s.

Now, remember that the coding scheme we're using looks like the following:

// 000 xxxxx 0 symbols (not used)
// 001 xxxx? 1 symbol
// 010 xxx?? 2 symbols
// 011 xx??? 3 symbols
// 100 x???? 4 symbols
// 101 ????? 5 symbols
// 11 ?????? 6 symbols

Generally speaking, we have three count bits and five data bits. If the count indicates six symbols, however, then we have only two count bits and six data bits. In this case, the original bit 5, a copy of which now finds itself in the LSB of count, could be a 0 or a 1. This explains the statement ” if (count >= 6) count &= B00000110;”, which ensures that a count of six or seven is forced to be six, because a value of seven means that the LS bit of the count contains an unwanted data bit of 1.

Now, I have to admit that I ran into a bit of a problem with the following, which is the part where we actually extract and display the dots and dashes:

for (int i = 0; i < count; i++) {
  digitalWrite(pinLED,HIGH); // LED On

  if (dNdByte & B00000001 == 0) // It's a dot
   delay(elementDelay / WPM); // Dot delay
  else // It's a dash
   delay((elementDelay * 3) / WPM); // Dash delay

  dNdByte = dNdByte >> 1;

  digitalWrite(pinLED,LOW); // LED Off

  delay(elementDelay / WPM); // Inter-symbol delay
}

When I executed this and observed my flashing LED, I realized that I only appeared to be seeing dashes. This struck me as being a bit strange, so I stuffed a lot of Serial.print() statements into my program to determine what was going on. Here's what I saw in the Serial I/O window:

As you see, all I'm generating is dashes. Eventually I focused my beady eye on the following statement:

 if (dNdByte & B00000001 == 0) // It's a dot
   delay(elementDelay / WPM); // Dot delay
 else // It's a dash
   delay((elementDelay * 3) / WPM); // Dash delay

In particular, I considered the “== 0” logical comparison. I couldn’t see how this could be wrong, but when I changed it to read “== 1”, I then saw the following results.

“Well, strike me with a kipper,” I thought, “that's strange and no mistake.” Now we are seeing dots and dashes, but they are inverted; i.e., a dot is displayed as a dash and vice versa. In fact, this is what we'd expect because we've inverted our test from “== 0 ” to “== 1”. But, in that case, why didn’t the “== 0” do what we expected in the first place?

I tell you, I stared and stared at this, but nothing came to mind, so eventually I called my chum Ivan Cowie over from the next bay to take a look. Ivan suggested taking my statement:

 if (dNdByte & B00000001 == 0) // It's a dot

And adding parentheses as follows:

 if ((dNdByte & B00000001) == 0) // It's a dot

So we did so, and everything worked as expected. “Well, blow me down with a feather,” I thought, “who would have thunk?” I personally would have assumed that a bitwise operator like &, |, or ^ would have a higher precedence than a logical comparison (relational) operator like == or !=, but it appears that this is not the case (see also this list C++ operator preferences).

It took me a while to wrap my brain around what was happening here. This thing is that when we use an “if ( {condition is true} ) {do this}” type statement, then “true” equates to 1 and “false” equates to 0.

If the “==” comparison is performed first, then “1 == 0” returns 0, and “dNdByte & 0” also returns zero, which is seen by the “if” statement as “false”, so it appears that we only ever have dashes.

But what happens if we change our comparison to “1 == 1”? In this case “1 == 1” returns 1, so “dNdByte & 1” will return 0 or 1 depending on the contents of the LSB, but a 1 will be seen as meaning “true”, which equates to a dot, all of which explains why “== 1” does end up giving us dots and dashes, but swapped over. Phew!

I tell you, I learn something new every day. So, we now have two programs offering two different ways of storing and manipulating our Morse code symbols — how do they compare?

Index

Comparison (code size and execution speed)
The easiest comparison to make is that of code size and the amount of memory required to store any global variables, because this data is automatically reported at compile time.

Not surprisingly, storing our dot-and-dash symbol data as bytes consumed less space for the global variables as illustrated in the table below.

The interesting point to me is that the “storing as bytes” approach also resulted in a significant reduction in code size, even though it involves more processing when it comes to extracting the count (number of symbols) value and then extracting the dots and dashes one at a time. I'm not sure, but I'm assuming this is due to the fact that the “storing as strings” technique involves a lot of pointer de-referencing. Having said this, I really don’t have a clue, and I would welcome your thoughts on this matter.

Now, as we previously noted, transmitting Morse code a human-readable speeds results in the processor sitting around twiddling its metaphorical thumbs for a large portion of the time (observe the delay() function calls scattered throughout the two programs like confetti).

On this basis, we really don’t care how efficient our two approaches are. On the other hand, it's always interesting to know the nitty-gritty facts, so I decided to investigate. First of all, I commented out all of the delay() function calls, because these would totally overwhelm the amount of time spent doing the real work.

Next, I declared two unsigned long variables called “time1” and “time2”, and then I modified the loop() function in both programs to read as follows:

void loop() {
 time1 = micros();
 displayThis("Hello World");
 time2 = micros();

  Serial.print("Time = ");
  Serial.println(time2 - time1);
}

FYI, I'm using an Arduino Uno for this project and micros() is a built-in function provided by the Arduino IDE. This function returns the number of microseconds since the Arduino began running the current program. This number will overflow and return to zero after approximately 70 minutes, which is more than enough time for my proposes. Also, in 16 MHz Arduinos like the Uno, this function has a resolution of four microseconds (i.e., the value returned is always a multiple of four).

I started by comparing my original “Hello World” strings, but then I thought that this really wasn't fair because some of the time was being spent converting lowercase characters to their uppercase equivalents, which isn’t really the point of the exercise. Thus, I also ran the comparison using strings comprising all of the alphanumeric characters:

  ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

And, just to make sure, I performed a third comparison using a longer string comprising two copies of all the alphanumeric characters. The results of all these comparisons are shown in the table below.

As we see, the “storing as bytes” approach ends up being faster than its “storing as strings” equivalent, even though it involves more processing when it comes to unpacking the count and symbol data. Once again, I'm assuming that this is due to the fact that the “storing as strings” technique involves a lot of pointer de-referencing. And, once again, I really don’t have a clue, and I would welcome your thoughts on this matter.

Now, with regard to the other techniques folks suggested — like the binary tree and the run-length encoding — it would be great if you wanted to take my original “store as strings” program as the basis and modified it to use these other techniques. If you then emailed me your code at , I could run it on my Arduino and compare it against the two schemes discussed here.

In the meantime, should I go with “storing as strings” approach because this is easier to understand and maintain, or should I opt for the “storing as bytes” technique on the basis that the end user doesn’t need to know what's going on under the hood, and this method is slightly faster and — more importantly — uses less memory?

Index

25 thoughts on “Storing Morse code in C — A comparison of techniques

  1. “MaxnnLike you, I am a self taught C programmer. No doubt there are many flaws in my approach, but what I would have done would be to create two arrays. The first contains the total number of dots and dashes for each character. The second would be the bi

    Log in to Reply
  2. “I agree — if we did it as two byte arrays (two bytes per character), we'd end up using twice the storage of one byte-per character — and the only thing we'd save would be extracting the count value.nnI'd be interested to see the results of using a str

    Log in to Reply
  3. “I followed a scheme described in a late 70's or early 80's issue of Ham Radio Magazine.nnEach Morse character is a byte. In the byte, a '1' is a dash and a '0' is a dot. An extra '1' is placed in the next higher-order bit, and when the register is equ

    Log in to Reply
  4. “”Each Morse character is a byte. In the byte, a '1' is a dash and a '0' is a dot. An extra '1' is placed in the next higher-order bit, and when the register is equal to one, you know the character is complete.”nnI'm not sure I'm following you — can y

    Log in to Reply
  5. “I was in a hurry… After reading the LSB of the byte to determine dot or dash, do a right shift so that the next part of the character can be read. Keep shifting until the register is equal to 1, then read the next character and repeat. nnnn”

    Log in to Reply
  6. “Hi Max, Aubrey and all,nWell I have had doubts about Max for some time but FLASHER was not one of them?nLike Aubrey I have been self taught in code, the difficulty I am having at the moment, is why for such a small look up table Max is so worried about

    Log in to Reply
  7. “”I am having at the moment, is why for such a small look up table Max is so worried about packing the table in a modern micro.”nn'Worried' is too strong a word — I'm an anal retentive — I just want to make my code as small and fast as possible, even

    Log in to Reply
  8. “MaxnnLike yourself, Aubrey & Crusty, I'm also a self taught C Programmer. I first learned C over 20 years ago doing a correspondence course and obtained a diploma for my effort. However it was probably only 3-4 years ago I actually started to use C with

    Log in to Reply
  9. “You can easily have both the maintainability of strings and the compact byte form.nnWrite a program using strings to generate the byte table.nnAs for the (dNdByte & B00000001 == 0) issue… always always, always compile with warnings enabled. gcc -Wal

    Log in to Reply
  10. “The mention of repeater controllers reminded me that I have access to repeater controller code which includes a Morse sender so I took a look last night. The encoding scheme is very similar to the byte encoding that Max used except that the count is in th

    Log in to Reply
  11. “Thanks Elizabeth. Re the fact that I coded my count in the MSBs as:nncccddddd (with the d bits flipped so that the first dot/dash is in the LSB)nnThis does mean that I have to shift right by 5 bits to extract the count.nnI did think of coding the c

    Log in to Reply
  12. “Also, I've been trying to contact you to send you a GPAK4 dev kit — but you haven't responded to my emails — maybe they are going into your SPAM folder — can you call me on 256-319-0257”

    Log in to Reply
  13. “Hi Clive.nThe increase in program data size is most likely (not having studied your linker and startup assembler code) that the initial values for global arrays are stored in the code segment. The startup code loads the .data segment with these values fr

    Log in to Reply
  14. “Very interesting — thanks for the info — I must admit I was quite proud that the “store as strings” one worked at all because I'm a bit weak on pointers and stuff.”

    Log in to Reply
  15. “Hi again.nnSome comments on your code:n* In function displayThis where you change to upper case, you do it on the input string. This input could (should) be const and it is not good practice to alter a string literal. On your platform it works as the s

    Log in to Reply
  16. “The real fun is going to start when you want to get real work done with the processor while the Morse code is sending. (See the BlinkWithoutDelay Arduino example.)”

    Log in to Reply
  17. “The other way to set things up so it is easy to read is to use macrosnn#define mlength(a) ((a) shl 5)n#define mchar1(a) (mlength(1) | ((a) shl 0))n#define mchar2(a, b) (mlength(2) | ((a) shl 0) | ((b) shl 1))n…nnThen you havennconst unsigned

    Log in to Reply
  18. “My point was that your encoding scheme was very similar to one devised by an experienced software guy. I'd send you the code but the implementation depends on a RTOS that he also wrote so it would require a bit of fiddling to compile for an Arduino. Besid

    Log in to Reply
  19. “Thanks for the point about converting the string to uppercase — I guess I did this thinking of future developments in which someone might provide the input string “on-the-fly””

    Log in to Reply
  20. “Re your implementation of the encoding Iconoc mentioned — I don't think I'd really wrapped my brain around what he/she was saying until I saw your implementation — very tasty — plus I learned a lot just looking at how you write your code.”

    Log in to Reply
  21. “Its funny you should mention this — I was just looking at the Blink Without Delay example the other day — I may well end up moving to this if I start to do multiple things while flashing (if you see what I mean)”

    Log in to Reply
  22. “Hi David — I hadn't really wrapped my brain around this until I saw a somewhat related implementation coded by Nlklas_e (see below). I'm up to my armpits in alligators at the moment, but when I get some spare time I plan on coding both of these to see ho

    Log in to Reply
  23. “Max…I don't know how this would go in the code size/speed stakes but I would thing that the method proposed by Wnderer on the EETimes version of your original blog is worth a try. here's a link:nnhttp://www.eetimes.com/messages.asp?piddl_msgthread

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.