Lex and Yacc for Embedded Programmers
Not every firmware development tool was originally designed for that purpose. Unix's lex and yacc, for example, can be used to generate some of your code automatically.
Lex and yacc are free development tools that help you write software in C or C++ that interprets or transforms structured input. Applications that interpret structured input can range from a simple text search to a complete compiler, but also include embedded applications such as serial protocol and message interpreters. If you're developing an application that manipulates ASCII data, lex and yacc could save you numerous development headaches. This article will show you when and how to use these tools.
In a program that interprets structured ASCII input, two tasks are typically performed repeatedly. First, the input must be converted from a stream of individual characters into a sequence of meaningful tokens. This operation is called scanning . Lex is a useful tool for generating scanners. Once you've acquired the basic knowledge needed to use lex, you'll find that the rule input required for a lexical scanner is much easier to write than the equivalent C code. Lex also automatically checks your scanner specification for a variety of common errors.
The second step in interpreting structured input is to identify patterns in the sequence of tokens found by the scanner. This process is known as parsing . Yacc can generate parsers to perform this pattern matching. (In the '70s, there was no shortage of groups writing parser generation utilities, which helps explain where the name yacc, which stands for “yet another compiler compiler,” came from.) As with scanners, parsers are manually produced all the time, but yacc-generated parsers can result in more robust code, less debugging, and input files (yacc grammars) that are easy to read and maintain.
Figure 1 shows the flow of control and data in a typical scanner/parser application. The data stream is reduced to tokens by the scanner and the tokens accumulate in the parser, which identifies valid combinations of tokens.
Figure1: Execution path and data flow in a typical lex/yacc derived parser
The development procedure for a lex-/yacc-based application is illustrated in Figure 2. Many simple applications, such as a text search, only require finding tokens in an input stream. Such applications can be created using lex alone. Similarly, where the arrangement of tokens in an input stream is simple, a simple hand-coded parser may be adequate. However, as the possible permutations and combinations of tokens increase, we typically need to direct the output of the scanner (stream of tokens) to the input of a separate yacc-generated parser.
Before learning how to actually use lex and yacc, let's ask what we want to use them for and why. As I mentioned earlier, the variety of embedded applications for lexical scanners and parsers is extensive. We can break down the possibilities into two broad areas: development tools and target code.
By development tools I mean custom applications that reside on the development host and help you produce run-time code, just like any other compiler or code generation tool you may use. Target code means C or C++ code that we ultimately compile and link with the rest of the application for execution on our target hardware. The focus of this article is using lex and yacc to generate target code.
Embedded Uses
Serial protocols are a common component of embedded software systems. Perhaps a system must provide communications between two separate microcontrollers via a serial port. The use of a bi-directional serial protocol requires the presence of a protocol interpreter at both ends of the interface, which can be created with lex and yacc.
Figure 2: Development sequence when using lex and yacc
Another place where we can use lex and yacc is in the interpretation of structured messages between two software components. Structured messages are common in multi-tasking environments, where a block of data must be passed from one task to another via some sort of OS service, such as message queues.
Consider a case where two independent tasks wish to send and receive blocks of data. If both tasks reside on the same hardware and are compiled using the same compiler (with the same settings), it is convenient to simply stuff a struct or object into a message (in other words, a block of memory) and pass this message to the receiving task. Assuming the recipient knows what to expect, it can simply access the data using a pointer of the appropriate type.
In many cases, though, tasks don't want to assume that the recipient stores a data structure in the same way the sender does. This may be due to portability requirements or to the fact that the tasks are part of a distributed application and are running on different hardware. In such cases, it's common for a message to be provided that describes the data in an agreed format rather than passing the binary. Where this is done, you need to parse the description at the receiving end.
Many embedded systems require configuration data, which is commonly stored in a flash filesystem. This configuration data is typically stored within a text file that is parsed at run-time and used to populate a database with values affecting the subsequent behavior of the application. I use the term database in the loosest sense. The data storage may comprise anything from a couple of variables to a large and complex set of data structures.Let's look at a couple of examples to see how it can be done. There are many versions of lex and yacc available for a variety of platforms. For the examples below I have used GNU's flex 2.5 and bison 1.25.
String Processing
Figure 3 shows a motor control system that consists of three key components:
- An electric servomotor for accurate speed and position control of a driven axis
- A digital servomotor control system (drive system) to perform the analog control of the motor
- A microcontroller for supervisory control of the drive electronics and data acquisition from the drive
Figure 3: A motor controller and supervisory micro-controller configuration the drive
Firmware running on the microcontroller sends commands to and receives responses from the drive system by means of an RS-232 cable. We're only concerned here with one component of this data flow: the data coming in from the drive. This data takes the form of a stream of characters; a lex-generated scanner interprets it and generates messages that can be used internally within the application. We'll assume we're looking for just one message:
which represents the present shaft rotational velocity in arbitrary units. It may be generated in response to some microcontroller output or asynchronously by the drive system. This doesn't affect how the scanner interprets the data. Due to its simplicity, this application won't require the use of yacc. Instead of returning tokens to a parser, we'll do everything within our scanner function and generate messages for the application.The procedure for producing our scanner is as follows:
- Write a lex specification file describing what we want our scanner to do.
- Compile our specification using lex.
- Compile and link our generated C or C++ scanner module with the rest of our firmware.
What do we want our scanner to do? Our scanner function will be called as required by the application (maybe by polling it or by generating an event when data is available for it to process). It will read from its input repeatedly until it:
- Encounters an internal error (bad data, for example)
- Returns a valid message
Although the lex specification format is fundamentally simple, like many tools it has enough options to occupy you for a long time, should you be so inclined. The specification in Listing 1 is divided into three sections as follows:
definitions%%rules %%user code
Listing 1 Barebones lex specification for simple serial protocol interpreter
/* definitions */
digit[0-9]
DoubleValue [+-]?{digit}{1,10}”.”{digit}{1,10}([eE][+-]?{digit}{1,2})?
%%
/* rules */
sv,{MessageStruct.msg=SHAFT_VELOCITY;}
{DoubleValue}{MessageStruct.Value=strtod(yytext,NULL);
returnVALID_INPUT;}
.{returnINVALID_INPUT;}
%%
/* user code */
/*Implementationofinputfunction*/
intpipe_yyinput(char*buf,intmax_size)
{
/*Readfromserialporttogetinput*/
/*Returnnumberofcharsreceived*/
}
/* ————————————— */
Each section is separated by a single line containing only the string %% .In Listing 1, the only item in the definitions section is an alias for a complex regular expression. You can use aliases to avoid duplicating a complex expression throughout the rules section. The DoubleValue alias describes the acceptable format of a double-precision decimal value. Lex supports a large regular expression syntax that is documented in the companion manuals.The second section is the rules section. This is the business end of the lex specification and describes most of the functionality of the scanner. The rules section is—you guessed it—a list of rules. Each rule consists of an input pattern to be matched and an action to be taken when and if it is. The input patterns can be literal characters, aliases from our definitions section (enclosed in braces), or a combination of both.
The user code contains any additional code we want to insert at the bottom of the generated source file. In the example, I have included an input function to read from the serial port. Some additional definitions are required to enable this function but I've omitted them for clarity.
By default, a lex-generated scanner gets its input using a global file pointer that is unsuitable for applications where file I/O is unavailable or unnecessary. This is why I have redefined the input function here. Execution proceeds as follows:
- The scanner begins by calling the input function.
- If no input is received, it returns 0 (end-of-file), otherwise it attempts to match the input to its patterns.
- If “sv,” is matched, the corresponding action is executed. This action stores an enumerated type in a message struct to represent the message received.
- The scanner returns VALID_INPUT (a predefined constant) to the calling function. The calling function can now interpret the message structure contents.
- When the double precision value is received, it is copied into the message structure. The scanner stores the currently matched pattern temporarily in an array, yytext, which is used to obtain the value from the input stream and copy it into the message structure.
If at any stage an unexpected input is received (something that cannot be matched), it is matched by the default rule, which is represented by the period (.). In this case, we return the fact that the input is invalid to the calling function. The scanner will repeatedly call the input function until no data is available. If the input function returns 0 (because no further data is available via the serial connection, for example), the scanner will return 0 to the calling function but will maintain its internal state. When it is subsequently called, it will simply try once more to get additional data from the input function.
As already stated, if at any stage the data received is inconsistent with the expected patterns, the default rule is matched and an error is returned. Many aspects of a lex scanner's behavior can be altered by using different options (in both the lex specification and on the command line) and the resulting scanner can be a very different beast. To see all the specific options required in this example, download the code and notes from ftp://ftp.embedded.com/pub/2003/03power.
Now that we have a lex specification, we can use lex to compile it and produce the C/C++ scanner source.
The command line for the example is:
flex lex_spec.txt
for a description file named lex_spec.txt . To use the scanner in your program, call the scanner function int yylex(void) as required.
Configuration File Parsing
Figure 4 illustrates the parsing of a simple configuration file. The configuration file is designed to hold a list of a variable number of image descriptions for a device's GUI. These image descriptions define:
- The image filename
- The x and y coordinates at which to display the image within an application window on the screen of our device
Figure 4: Data flow in a configuration file parser application
Click here for a larger version of this image
When the firmware initializes, we want it to read the configuration file from flash, parse it, generate a simple table or database of the data contained within, and display the images.For this application, we'll use the lex-generated scanner in its classical role as token generator. These tokens will then pass to the yacc-generated parser for interpretation.The procedure for producing both the scanner and the parser is:
- Write a lex specification file and a yacc grammar file (this is what we call a yacc specification) describing what we want our scanner and parser to do.
- Compile the input specifications using lex and yacc.
- Compile and link the generated C or C++ modules with the rest of the application.
We've already seen how to write a lex description so we'll only concern ourselves here with the yacc grammar introduced with this particular application. Listing 2 shows the lex specification and should be understandable now. Listing 3 shows the yacc grammar file.
Listing 2: Lex specification for a simple configuration file scanner
/* definitions */
%{
#include “y_tab.h”
/* Define loop counter */
int iCount;
%}
/* Definition of some aliases for use below */
whitespace[ tr]
digit[0-9]
longVal[+-]?{digit}{1,10}
stringVal[_./-a-zA-Z0-9]{1,50}
%%
/* grammar rules */
[image]{returnDEFINE_IMAGE;}
[start_cfg_file]{returnSTART_CONFIGURATION;}
[end_cfg_file]{returnEND_CONFIGURATION;}
“{stringVal}”{for(iCount=0;iCount<>
{
yylval.stringValue[iCount]=yytext[iCount];
}
yylval.stringValue[yyleng]='