Lex and Yacc for Embedded Programmers - Embedded.com

Lex and Yacc for Embedded Programmers

Not every firmware development tool was originally designed for that purpose. Unix's lex and yacc, for example, can be used to generate some of your code automatically.

Lex and yacc are free development tools that help you write software in C or C++ that interprets or transforms structured input. Applications that interpret structured input can range from a simple text search to a complete compiler, but also include embedded applications such as serial protocol and message interpreters. If you're developing an application that manipulates ASCII data, lex and yacc could save you numerous development headaches. This article will show you when and how to use these tools.

In a program that interprets structured ASCII input, two tasks are typically performed repeatedly. First, the input must be converted from a stream of individual characters into a sequence of meaningful tokens. This operation is called scanning . Lex is a useful tool for generating scanners. Once you've acquired the basic knowledge needed to use lex, you'll find that the rule input required for a lexical scanner is much easier to write than the equivalent C code. Lex also automatically checks your scanner specification for a variety of common errors.

The second step in interpreting structured input is to identify patterns in the sequence of tokens found by the scanner. This process is known as parsing . Yacc can generate parsers to perform this pattern matching. (In the '70s, there was no shortage of groups writing parser generation utilities, which helps explain where the name yacc, which stands for “yet another compiler compiler,” came from.) As with scanners, parsers are manually produced all the time, but yacc-generated parsers can result in more robust code, less debugging, and input files (yacc grammars) that are easy to read and maintain.

Figure 1 shows the flow of control and data in a typical scanner/parser application. The data stream is reduced to tokens by the scanner and the tokens accumulate in the parser, which identifies valid combinations of tokens.


Figure1: Execution path and data flow in a typical lex/yacc derived parser

The development procedure for a lex-/yacc-based application is illustrated in Figure 2. Many simple applications, such as a text search, only require finding tokens in an input stream. Such applications can be created using lex alone. Similarly, where the arrangement of tokens in an input stream is simple, a simple hand-coded parser may be adequate. However, as the possible permutations and combinations of tokens increase, we typically need to direct the output of the scanner (stream of tokens) to the input of a separate yacc-generated parser.

Before learning how to actually use lex and yacc, let's ask what we want to use them for and why. As I mentioned earlier, the variety of embedded applications for lexical scanners and parsers is extensive. We can break down the possibilities into two broad areas: development tools and target code.

By development tools I mean custom applications that reside on the development host and help you produce run-time code, just like any other compiler or code generation tool you may use. Target code means C or C++ code that we ultimately compile and link with the rest of the application for execution on our target hardware. The focus of this article is using lex and yacc to generate target code.

Embedded Uses
Serial protocols are a common component of embedded software systems. Perhaps a system must provide communications between two separate microcontrollers via a serial port. The use of a bi-directional serial protocol requires the presence of a protocol interpreter at both ends of the interface, which can be created with lex and yacc.


Figure 2: Development sequence when using lex and yacc

Another place where we can use lex and yacc is in the interpretation of structured messages between two software components. Structured messages are common in multi-tasking environments, where a block of data must be passed from one task to another via some sort of OS service, such as message queues.

Consider a case where two independent tasks wish to send and receive blocks of data. If both tasks reside on the same hardware and are compiled using the same compiler (with the same settings), it is convenient to simply stuff a struct or object into a message (in other words, a block of memory) and pass this message to the receiving task. Assuming the recipient knows what to expect, it can simply access the data using a pointer of the appropriate type.

In many cases, though, tasks don't want to assume that the recipient stores a data structure in the same way the sender does. This may be due to portability requirements or to the fact that the tasks are part of a distributed application and are running on different hardware. In such cases, it's common for a message to be provided that describes the data in an agreed format rather than passing the binary. Where this is done, you need to parse the description at the receiving end.

Many embedded systems require configuration data, which is commonly stored in a flash filesystem. This configuration data is typically stored within a text file that is parsed at run-time and used to populate a database with values affecting the subsequent behavior of the application. I use the term database in the loosest sense. The data storage may comprise anything from a couple of variables to a large and complex set of data structures.Let's look at a couple of examples to see how it can be done. There are many versions of lex and yacc available for a variety of platforms. For the examples below I have used GNU's flex 2.5 and bison 1.25.

String Processing
Figure 3 shows a motor control system that consists of three key components:

  • An electric servomotor for accurate speed and position control of a driven axis
  • A digital servomotor control system (drive system) to perform the analog control of the motor
  • A microcontroller for supervisory control of the drive electronics and data acquisition from the drive


Figure 3: A motor controller and supervisory micro-controller configuration the drive

Firmware running on the microcontroller sends commands to and receives responses from the drive system by means of an RS-232 cable. We're only concerned here with one component of this data flow: the data coming in from the drive. This data takes the form of a stream of characters; a lex-generated scanner interprets it and generates messages that can be used internally within the application. We'll assume we're looking for just one message:

“sv,

which represents the present shaft rotational velocity in arbitrary units. It may be generated in response to some microcontroller output or asynchronously by the drive system. This doesn't affect how the scanner interprets the data. Due to its simplicity, this application won't require the use of yacc. Instead of returning tokens to a parser, we'll do everything within our scanner function and generate messages for the application.The procedure for producing our scanner is as follows:

  • Write a lex specification file describing what we want our scanner to do.
  • Compile our specification using lex.
  • Compile and link our generated C or C++ scanner module with the rest of our firmware.

What do we want our scanner to do? Our scanner function will be called as required by the application (maybe by polling it or by generating an event when data is available for it to process). It will read from its input repeatedly until it:

  • Encounters an internal error (bad data, for example)
  • Returns a valid message

Although the lex specification format is fundamentally simple, like many tools it has enough options to occupy you for a long time, should you be so inclined. The specification in Listing 1 is divided into three sections as follows:

definitions%%rules %%user code

Listing 1 Barebones lex specification for simple serial protocol interpreter

/* definitions */

digit[0-9]
DoubleValue [+-]?{digit}{1,10}”.”{digit}{1,10}([eE][+-]?{digit}{1,2})?

%%

/* rules */

sv,{MessageStruct.msg=SHAFT_VELOCITY;}
{DoubleValue}{MessageStruct.Value=strtod(yytext,NULL);
returnVALID_INPUT;}
.{returnINVALID_INPUT;}

%%

/* user code */

/*Implementationofinputfunction*/
intpipe_yyinput(char*buf,intmax_size)
{
/*Readfromserialporttogetinput*/
/*Returnnumberofcharsreceived*/
}
/* ————————————— */

Each section is separated by a single line containing only the string %% .In Listing 1, the only item in the definitions section is an alias for a complex regular expression. You can use aliases to avoid duplicating a complex expression throughout the rules section. The DoubleValue alias describes the acceptable format of a double-precision decimal value. Lex supports a large regular expression syntax that is documented in the companion manuals.The second section is the rules section. This is the business end of the lex specification and describes most of the functionality of the scanner. The rules section is—you guessed it—a list of rules. Each rule consists of an input pattern to be matched and an action to be taken when and if it is. The input patterns can be literal characters, aliases from our definitions section (enclosed in braces), or a combination of both.

The user code contains any additional code we want to insert at the bottom of the generated source file. In the example, I have included an input function to read from the serial port. Some additional definitions are required to enable this function but I've omitted them for clarity.

By default, a lex-generated scanner gets its input using a global file pointer that is unsuitable for applications where file I/O is unavailable or unnecessary. This is why I have redefined the input function here. Execution proceeds as follows:

  • The scanner begins by calling the input function.
  • If no input is received, it returns 0 (end-of-file), otherwise it attempts to match the input to its patterns.
  • If “sv,” is matched, the corresponding action is executed. This action stores an enumerated type in a message struct to represent the message received.
  • The scanner returns VALID_INPUT (a predefined constant) to the calling function. The calling function can now interpret the message structure contents.
  • When the double precision value is received, it is copied into the message structure. The scanner stores the currently matched pattern temporarily in an array, yytext, which is used to obtain the value from the input stream and copy it into the message structure.

If at any stage an unexpected input is received (something that cannot be matched), it is matched by the default rule, which is represented by the period (.). In this case, we return the fact that the input is invalid to the calling function. The scanner will repeatedly call the input function until no data is available. If the input function returns 0 (because no further data is available via the serial connection, for example), the scanner will return 0 to the calling function but will maintain its internal state. When it is subsequently called, it will simply try once more to get additional data from the input function.

As already stated, if at any stage the data received is inconsistent with the expected patterns, the default rule is matched and an error is returned. Many aspects of a lex scanner's behavior can be altered by using different options (in both the lex specification and on the command line) and the resulting scanner can be a very different beast. To see all the specific options required in this example, download the code and notes from ftp://ftp.embedded.com/pub/2003/03power.

Now that we have a lex specification, we can use lex to compile it and produce the C/C++ scanner source.

The command line for the example is:

flex lex_spec.txt

for a description file named lex_spec.txt . To use the scanner in your program, call the scanner function int yylex(void) as required.

Configuration File Parsing
Figure 4 illustrates the parsing of a simple configuration file. The configuration file is designed to hold a list of a variable number of image descriptions for a device's GUI. These image descriptions define:

  • The image filename
  • The x and y coordinates at which to display the image within an application window on the screen of our device


Figure 4: Data flow in a configuration file parser application
Click here for a larger version of this image

When the firmware initializes, we want it to read the configuration file from flash, parse it, generate a simple table or database of the data contained within, and display the images.For this application, we'll use the lex-generated scanner in its classical role as token generator. These tokens will then pass to the yacc-generated parser for interpretation.The procedure for producing both the scanner and the parser is:

  • Write a lex specification file and a yacc grammar file (this is what we call a yacc specification) describing what we want our scanner and parser to do.
  • Compile the input specifications using lex and yacc.
  • Compile and link the generated C or C++ modules with the rest of the application.

We've already seen how to write a lex description so we'll only concern ourselves here with the yacc grammar introduced with this particular application. Listing 2 shows the lex specification and should be understandable now. Listing 3 shows the yacc grammar file.

Listing 2: Lex specification for a simple configuration file scanner

/* definitions */

%{
#include “y_tab.h”
/* Define loop counter */
int iCount;
%}

/* Definition of some aliases for use below */
whitespace[ tr]
digit[0-9]
longVal[+-]?{digit}{1,10}
stringVal[_./-a-zA-Z0-9]{1,50}

%%

/* grammar rules */

[image]{returnDEFINE_IMAGE;}
[start_cfg_file]{returnSTART_CONFIGURATION;}
[end_cfg_file]{returnEND_CONFIGURATION;}
“{stringVal}”{for(iCount=0;iCount<>
{
yylval.stringValue[iCount]=yytext[iCount];
}
yylval.stringValue[yyleng]='';
returnSTRING;
}
{longVal}{
yylval.longValue=strtol(yyext,NULL,10);
returnLONG;
}
{whitespace}{;/*DoNothing*/}
.{returnBAD_INPUT;}//Error,nomeaningfulinput

>{yyterminate();}

%%

/* user code */

Listing 3: A yacc grammar file for a simple configuration file parser

/* definitions */

%{
/* prototype for lexer function */
int yylex(void);
/* prototype for error handling function */
int yyerror(char * pErrorMsg);
%}

/* Type definition for terminal tokens (tokens received from scanner) */
%union
{
charstringValue[51];
longlongValue;
}

/* Definition of terminal TOKENS to be provided by lexer */
%token DEFINE_IMAGE
%token START_CONFIGURATION
%token END_CONFIGURATION
%token BAD_INPUT
%token STRING
%token LONG

%%

/* grammar rules */

Configuration:START_CONFIGURATION
ListOfImages
END_CONFIGURATION
{
return0;
/*ConfigurationFileParsingComplete.*/
}

ListOfImages:ImageDefinition|ListOfImagesImageDefinition
{
/*DoNothingHere.Thisruleisprovidedtoallowavariablelength*/
/*listofimagestobedefined*/
}

ImageDefinition:DEFINE_IMAGE
STRING/*Filename*/
LONG/*xco-ordinate*/
LONG/*yco-ordinate*/
{
/*Storedata*/
}

BadInput:BAD_INPUT{return1;}

%%

/* user code */

intyyerror(char*pErrorMsg)
{
return1;/*Recoverynotattempted*/
}

The yacc grammar format is similar to the lex specification. (Actually, yacc came first and lex borrowed the format.) The grammar is divided into three sections:

definitions
%%
grammar rules
%%
user code

The first item in the declarations section of a yacc grammar is a body of C code that is contained inside the %{ and %} patterns. This code is copied verbatim to the top of the generated parser module.

The yyerror() function is called by the parser whenever an internal error is encountered; this function must be provided by the developer. A typical error would be where a valid token is provided by the scanner, but the token is not in a valid position in the token stream to allow a rule to be matched.

Also in the declarations section are the terminal token declarations. Terminal tokens (in uppercase) are the tokens returned by the scanner to the parser. Some of these tokens have only a token type such as the token DEFINE_IMAGE in the example. But consider the token LONG . This token has a type and an associated semantic value; it would be useless if it did not have a value. The semantic values of the terminal tokens are passed to the parser by the scanner using the yylval union. The possible data types for the token semantic values are defined with the %union statement.

The second section is the grammar rules section. This section contains a variety of legal token combinations. Not only are terminal tokens used, but additional non-terminal tokens are also employed (in lowercase) to produce the completed grammar. Non-terminal tokens are internal to the parser and are not known to the scanner. As with the scanner, when grammar rules are matched, the corresponding actions and the C code within are executed.

The execution of the parser/scanner combination is as follows:

  • The application opens the configuration file for reading using the appropriate system calls, and the scanner global file pointer yyin is assigned to the file pointer of the configuration file. The application then calls the parser function yyparse().
  • The parser calls the scanner's yylex() function repeatedly until an error is encountered, the end-of-file is reached, or a parser grammar rule is matched. The scanner returns terminal tokens and their semantic values to the parser. The semantic values are obtained from the scanner's yytext array and placed in the appropriate yylval union member.

For example, the scanner, upon receipt of a longVal pattern will execute the relevant scanner action. This action converts the pattern string (yytext) to a long integer variable and stores this value in the yylval union for the parser. The scanner then returns the LONG terminal token. Upon receipt of the LONG token, the parser knows that this token has a semantic value associated with it, as well as which union member to access (thanks to the token declaration). The token is pushed onto the parser's internal stack, and the scanner is called again for the next token.

When the parser matches the image definition rule, the relevant action is called and the image data is added to the internal application database by the action code (not shown). The parser also matches the recursive ListOfImages rule because this rule can match one or more image definitions. This allows the parser to match a variable number of image definitions prior to receipt of the END_CONFIGURATION token.

When the END_CONFIGURATION token is received, the parser matches the Configuration rule and returns to the caller. The configuration file processing is complete.

If at any stage the scanner returns the BAD_INPUT token (in response to some input that the scanner cannot tokenize), the BadInput rule is matched by the parser, and the parser aborts (returns 1). As mentioned earlier, the receipt of valid tokens that do not match any grammar rules results in the parser calling yyerror() and passing a string describing the error encountered.

To compile the yacc grammar and produce the C parser source, the command line is:

bison -y -d yacc_gmr.txt

for a grammar file named yacc_gmr.txt . Like lex, yacc has many command line options.

Although lex generally produces scanners that are faster than hand-coded equivalents, the parsers generated by yacc are typically slower than a well-designed, hand-coded equivalent. Therefore, you may want to avoid yacc if the application requires extremely fast performance. Considering that the embedded realm encompasses vastly varying hardware types and capabilities, there are cases for and against the use of automatically generated scanners and parsers. While lex and yacc are proven tools, they are not validated to any particular quality standard. For this reason, they are not suitable, in my opinion, for safety critical systems. At the end of the day, the question of suitability is one that only the individual developer can answer.

Flavors Of Lex And Yacc
Although lex and yacc are generally considered to be UNIX utilities, versions exist for a variety of platforms. In the unlikely case that a version does not exist for your particular development OS, you're still not in bad shape. Cross-compiling with lex and yacc is straightforward. The source code output from both tools can be compiled with any ANSI C or C++ compiler, so it's not necessary to use lex and yacc on the same platform as your compiler. The generated output has never presented me with any compilation problems under the variety of compilers I have used.

In fact, the output is typically more portable than the input files. For instance, if you attempt to port your flex specification to AT&T lex (one of many UNIX lex versions), some problems may arise due to differences between the two. Yacc grammars however, are usually easy to port between different yacc versions.

As mentioned earlier, I have used flex 2.5 and bison 1.25 as my tools of choice. Like so many areas of software development, the subject of which lex and yacc variants to use can be an emotional one. The primary reasons for my choice are listed below, but bear in mind that many other versions may be equally suitable for your needs:

  • Flex and bison provide the convenience (for me at least) of running from a DOS box under Windows (although they are not the only versions to do so).
  • Flex is considered a particularly stable version of lex and also generates faster code than some other versions.
  • If required, flex can be used to generate C++ scanner classes.
  • Bison v. 1.24 and above do not restrict the inclusion of bison generated parsers in commercial software distributions. (Previous versions did so.) Similarly, there are no restrictions on the distribution of flex-generated scanners.

One limitation of bison is its inability to generate C++ output. This has not been a major problem for me as it is not difficult to hide one or two C modules in a C++ application. Incidentally, there are more recent versions of flex and bison available that generate C++ code. They are called flex++ and bison++. You can download flex and bison (and lots of other useful software) from www.gnu.org. Documentation is provided with each distribution.

I hope that this article has given you an interest in lex and yacc. I have been using these tools regularly for the last three years and they have undoubtedly eliminated many weeks of coding, debugging, and maintenance. If you would like complete specification files for the examples above, these can be downloaded from ftp://ftp.embedded.com/pub/2003/03power.

Liam Power is a senior design engineer with Embedded Labs Ltd., an embedded systems development company based in Waterford, Ireland. He specialises in automotive applications and can be contacted at .

Further reading
Levine, John R., Tony Mason and Doug Brown. “lex & yacc.” Sebastopol, CA: O'Reilly, February 1995.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.