Designing data-centric software -

Designing data-centric software


A fresh perspective can be good a good thing. As Nobel prize-winning physicist Richard Feynman often noted, thinking about a problem for a long time can make it harder to see the problem from a new perspective, thereby making it harder to arrive at fresh insights into the nature of the problem and optimal solutions. But if we step back and change our perspective slightly, we can see things in a different way and new possibilities may emerge that we didn't see before.

The same potential exists for how we think and carry out embedded software development. A small shift in perspective may yield new ways of seeing and framing our software design challenges. Better yet, this new perspective may provide insights that enable us to be more efficient and more innovative than ever before.

Some studies have shown that the average embedded software engineer has been in this profession for nearly 12 years.1 Regardless of our actual time in the embedded systems trenches, I'm sure we've been thinking about and applying ourselves to hundreds, if not thousands, of programming challenges over the years. The more experienced engineers among us may recall developing embedded systems entirely in assembly language or perhaps a mix of assembly and some early high-level languages such as PL/M. When C was officially standardized by the ANSI X3J11 committee in the mid 1980s and C compilers and cross-compilers emerged for embedded systems development, it was like a breath of fresh air. Finally, a high-level procedural language with enough flexibility, control, and efficiency for embedded software development.

What's changed, what hasn't
Only a few years back, embedded systems were largely standalone, disconnected devices with limited user-interface features. By contrast, today's connected, intelligent devices are powered by CPUs with incredible MIPS per watt, multiple connectivity and storage options are available to share data with a growing variety of other systems, integrated peripheral technology is abundant, and many products must meet requirements for a sophisticated user interface. Perhaps you didn't notice, but the devices we're working on have become application-specific data-driven computing systems. And despite their large and growing complexity, the applications we create must reliably and efficiently acquire, process, integrate, share, store, retrieve, and initiate actions based on growing volumes of data from an increasing variety of sources both inside and outside the device.

So while embedded systems have changed in some fundamental ways, our approach to software design and implementation, our tools and techniques, mostly have not. We continue framing software design problems and solutions using traditional methods and tools. We treat data management simply as part of the application and hand code data-management functions with the data structures built into the C programming language. We keep trying to solve what is increasingly a data-integration and data-management problem with C-level procedural application logic and tools and without the benefit of a common data-abstraction layer. In effect, what we have today is a domain mismatch between the nature of the problems we're trying to solve and the ways in which we're trying to solve them.

Despite all this, few engineers have had the time or inclination to step back, look at their system architecture and development challenges from a fresh perspective and ask, what are my applications really doing and what is the most appropriate way to solve the problems we have today?

The test
Here's a little test. Think about one of your recent projects and consider how much of the application's code is involved in managing and manipulating data. By this I mean code that is:

  • Defining, initializing, and managing data structures such as look-up tables, arrays, linked lists, indexes, pointers, buffers, and so forth
  • Acquiring, aggregating, and transforming data with such mechanisms as message queues, filters, data logging, fill and stop buffers, circular buffers, device drivers, and so on
  • Using data to drive control and event processes such as PID loops, control loops, events, alarms, and signal analysis

Like many of your colleagues, you may be surprised to find that a large portion–in many cases greater than 50%–of your application's code manages or is driven by data structures. I call this code embedded data-management code.

The data-management audit
I tried a similar analysis. It began when I had my development teams perform a detailed data-management audit of our applications code across a diverse range of embedded systems product lines. To do this, we manually analyzed, counted, and categorized the lines of code (by line of code I mean a complete language construct, not just a bracket) in the application layer of our systems. We then added it all up to produce a report like the one in Figure 1. During this audit we found that approximately 50% of our application code was data-management code. What's more, we discovered that nearly 90% of the serious bugs that resulted in schedule delays or product recalls could be traced back to that data-management code.

Figure 1

Surprised? We certainly were. It made us look at data management and what our systems were actually doing in a whole new light. It also made us wonder, if we're going to spend so much of our time, effort, and budget on developing, integrating, debugging, testing, and optimizing data-management code, is there a better way to do it?

Here's another test. Think again about one of your recent projects. What kind of data model did you employ? “Data model,” you ask? Isn't a data model and aren't activities like data modeling for the IT folks? It is, and they are. But modern embedded devices such as mobile phones, set-top boxes, telematics systems, and networking equipment must integrate and manage large amounts of complex data from disparate sources both inside and outside the device. And they must do so with high performance and reliability, yet with limited system resources and at low cost. Talk about a data-management problem.

The accepted way to solve embedded data-management problems today would be to reduce the data manipulation down into procedural steps and then code up each step manually, perhaps reducing groups of steps into libraries. A sample flow chart for an embedded system may look like that in Figure 2.

Figure 2

For various reasons, the procedural logic for the various state tables, data manipulation routines, and event actuations would be developed and tuned to a specific set of inputs, storage devices, displays, operating system capabilities, and CPU. As a result, the problems get solved, but the application is complex and somewhat “brittle.” Just think of the impact this kind of implementation has on schedules and development costs. And what about application maintainability, reliability, flexibility, and portability? When requirements or hardware specifications change, how quickly can you adapt your applications?

Shifting to a data-centric perspective
OK, so we've tried the old way for years. Now step back and consider what your applications are doing today. At a high level, you'll likely notice that data is flowing in from one or more sources, is being filtered and aggregated, and then sent or stored somewhere else, or driving critical functions. If we consider the results of our code audit and focus in on how data is flowing and transforming inside our system, we immediately start to see some interesting patterns in our applications. At a high level it might look like what we see in Figure 3.Some common characteristics regarding the data in our systems include:

  • Data flows in from one or more sources
  • Some data is static and stored, but other data is dynamic and accessed in different ways
  • Data is ordered and structured
  • Data is filtered from one system to another
  • Data must be integrated and shared across multiple tables or applications (or both)
  • Data may have to be stored in different storage media internal and external to the device
  • The amount of data to be managed is growing in both size and complexity
  • Data can be large while the devices operating on that data have limited system resources (usually for cost reasons)

Figure 3

Once we start to think about our systems from this more data-centric perspective, it's much clearer why so much of our applications code and integration issues revolve around data integration and data management. It's also clearer why so many programming experts have long advocated a more data-centric approach. In his timeless book on software development, The Mythical Man Month, Fred Brooks Jr. argued that data representation is the essence of programming. “Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious.”2 In their popular book, The Elements of Programming Style, Brian Kernighan and P.J. Plauger advised developers to let the data structure their programs.3 What these people and many others realize is the central role that data should play in application design and development.

Data-centric development defined
OK, so you're now thinking about the data flowing through your system. What does it mean to do data-centric development? While there is no single definition, this may help. Instead of treating functions as central and the data structures as support for the functions, think of your data as central. Put data at the center with code in support of it rather than the other way around. In essence, data-centric development emphasizes data (data definitions, structures, manipulations, and relations) over procedural functions and interfaces. Once the “data model” for your system takes shape, the path to software design becomes easier. It also raises the question: what are the most appropriate methods and tools for solving the data-management challenges for my design?

What now?
Now that we've got a clearer understanding of (and perhaps a fresh perspective and insights into) the underlying challenges of our applications, what now? What are our pragmatic alternatives for turning these data-centric perspectives and insights into a source for innovative engineering breakthroughs? Well, let's look at our alternatives for data-centric development.

Option 1: work a little smarter
Once we realize that a large part of our software's function is to manage data, we can use software-design approaches and various data-abstraction methods to ease application programming and data management. Of course, many developers have realized this and have been working on these kinds of activities for some time. However, home-grown solutions take time to implement and maintain, design cycles are short, application complexity continues to rise, changes come quickly, and the sought-after benefits of data abstraction prove elusive. This elusiveness stems from the fact that developers are still trying to tackle what is largely a data-management problem with procedural thinking, languages and tools. A domain mismatch occurs between the problem (data management) and the solution (traditional methods, programming languages and tools). It's like using a hammer on a large screw. Sure, we can hammer away at it and bash it in, but there are easier and more efficient ways to get the job done.

For those of us trying the “work a little smarter” approach, data isn't effectively abstracted and application programming remains dominated by data-management infrastructure implementation details in C or C++ The daily grind of data structures and the many ways we interact with them. What's more, without extensive proprietary coding, it is hard to realize the benefits of a uniform data model or the data-integrity features provided by a data-management system. Not much bang for your engineering buck really.

Option 2: an embedded database
Another approach to simplifying data management is to use an embedded database. From a developer's standpoint, databases provide a proven tool for protecting and organizing data so that it can accessed, managed, and updated more easily.

Database technologies have been around for decades. They are at the heart of many enterprise applications and have proven to scale. They play a key role in simplifying large data-management challenges. There's certainly no shortage of research papers, articles, and books on database theory, design and implementation. There's also a growing body of research and information on the use of databases in embedded designs. Although a complete discussion of database technologies is beyond the scope of this article, given the important role they've played in data management, it's useful to take a quick look at what they do, how programmers interact with them, and examine the major pros and cons relative to their use in embedded systems.

Databases are usually classified according to their organizational approach. The most prevalent approach is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways. A distributed database is one that can be dispersed or replicated among different points in a network. An object-oriented database is one that is congruent with the data defined in object classes and subclasses.

Most well-known databases from Oracle, Microsoft, and IBM, as well as popular open-source databases such as PostgreSQL and MySQL, are relational databases and use a client/server architecture shown in Figure 4. In this model, the database resides on one or more servers and manages (or serves) data for multiple clients. The database itself is defined and accessed with the structured query language, or SQL.

Figure 4

SQL is one of the fundamental building blocks of modern database architecture. It defines the methods used to create databases, the data model (popularly known as “schema”), and is used to access and manipulate data in the database. SQL provides a high-level way to interact with data that's abstracted from the underlying data-management specifics. However, using SQL alone, the database must interpret and process SQL statements one at a time. In a networked environment, this interpretive approach will affect data flow and slow down response time. To improve performance and give developers better programmatic control of their data, Oracle developed a procedural language extension to SQL called PL/SQL. Now a virtual industry standard, PL/SQL allows programmers to mix SQL statements with common procedural constructs such as IF … THEN … ELSE, WHILE and LOOP for more efficient run-time execution.

As a higher-level, block-structured language designed specifically for data-management, PL/SQL is a domain-specific language for data-management programming. When combined with a data-management system and its associated run-time components, PL/SQL abstracts underlying data-management details and enables developers to express their data model and data-management functions more straightforwardly than with a lower-level procedural language like C. This enables programmers to build sophisticated data-management applications that are largely isolated from platform specifics, and to do so with a fraction of the effort and complexity of C.

Sounds good, eh? But there's a catch. As with so many technologies that weren't originally designed for embedded systems development, you can run into significant problems trying to use them for embedded systems. Consequently, they often require a “force fit” into both the system and the design process.

In the case of database technologies, they typically require far more computing power, memory, and storage than is available in most embedded systems. Even pared-down embedded databases typically require a few hundred kilobytes or a few megabytes to run. In addition, they're designed and tuned to manage static data stored in files. Traditional databases require “administration,” and few device manufacturers can afford to ship an accompanying database administrator with every device. Recognizing these as fundamental limitations, a number of vendors set out to develop embedded database technologies. These offerings generally remove some enterprise-oriented features in order to improve performance, reduce footprint, and improve database “self-sufficiency.”

While not in widespread use, embedded databases are seeing increased use in telecommunications switches and high-end networking equipment that must manage truly large volumes of data for routing tables, configuration information, encryption keys, and more. Embedded databases are also gaining popularity in a number of applications where disk access or sufficient memory are available to accommodate the operating requirements of the database machinery.

For these applications, embedded databases provide a convenient and well-structured means to manage static data in one or more database tables. Data definition and access is typically provided by some form of SQL support plus a vendor-specific application programming interface (API) as shown in Figure 5. These proprietary functions enable developers to configure the database and make queries against it as well as to perform common SQL-type data-management operations such as INSERTs, UPDATEs, and DELETEs.

Figure 5

Although embedded databases have improved performance and functionality, some key architectural issues remain that limit their usefulness in most embedded systems:

  • In the typical model, client-server SQL-database interactions are interpreted, which slows data-management operations to sometimes unacceptable levels; furthermore, having the parsing and interpretation machinery in memory consumes precious memory resources
  • Embedded databases generally provide no model for data sources other than tables, which means you can't use embedded databases directly on arbitrary stream data sources such as those generated by sensors, network connections, and other I/O sources
  • Embedded databases are typically one of two kinds–in-memory only or disk-based only. These discrete approaches limit their usefulness in devices that store data in flash, SIM cards, and other diverse persistent storage media that don't natively behave like a disk
  • Data-integrity features for crash recovery including roll-back and restore are not possible with database technologies that support in-memory operation exclusively
  • Embedded database technologies typically require a multitasking operating system and a file system

From an application development standpoint, there are further considerations and limitations:

  • Most embedded databases have no procedural language and depend on a fixed, server-style API to access the database. For some applications this entails a significant redesign of the application or “force-fit.” At the very least, it means that data-management statements must be included throughout relevant portions of the main application's programming logic and call a proprietary API.
  • Most database interface languages provide limited–if any–means of importing and exporting functions created in other languages. This limits the ability to use legacy functions such as existing filters, transformation algorithms and more within the new data-management code. Conversely, it limits the ability to export data-management functions written in the new way into existing applications.

As a result of these limitations, embedded databases are not likely to see widespread use in many classes of embedded devices. Still, for certain systems with enough horsepower and memory, they provide useful data management features and other development benefits that would be hard to gain otherwise.


Option 3: data-centric development frameworks
Our third option is a development framework designed for the data-management domain. Recall that we're now looking at our system in terms of dynamic data and the transformations on that data as shown back in Figure 3. To shift to this more data-centric perspective and use it for implementation, we'll need a framework in which to design and develop our solutions. So what tools would be in such a data-centric development framework that we would not find in traditional C/C++/Java development environments?

Let's look at the different elements that would be part of a data-centric application development framework for embedded systems.

Use a data-centric approach

First, we start using a more data-centric design approach: we conceive and define the optimum data model for our design by identifying and rigorously documenting the different data streams into and out of our system and sorting these streams by how they're aggregated, transformed, or filtered. These methods of aggregation and filtering become our operators on these streams of data. At this point we classify these operators as functions or objects and the data structures as tables or classes. The functions and table structures become our “language” for talking about and designing the operation of our application.

Once we have our language, it must be capable of rigorously representing the following four main requirements:

  1. The data structures and data types our application must manipulate
  2. The function operators we'll use to filter, transform, and aggregate this data
  3. The source and destinations of this data, and
  4. The programmatic interfaces to the data operators or to existing operators we want to use from existing applications (likely built with C or C++)

From this perspective, we've just invented a data-management language or, in the parlance of enterprise application development, a data manipulation language (DML). Incidentally, only two well-known commercial languages (and their derivatives) meet all or most of the four language requirements: PL/SQL and XML. Interestingly, when most people think about data-specific languages they think about SQL. Unfortunately, SQL doesn't address requirements 2, 3, and 4 of our language. Thus, SQL alone will not suffice for our needs, just as traditional databases don't meet the requirements and limitations of most embedded systems. But even without these domain-specific languages and their associated support machinery, we can use a more data-centric approach as we design our application components.

Convert data-centric language into code and libraries

Next, we build or buy the tools to convert our data-centric language into source code and data service libraries.

If we've defined our own language in terms of functions, function libraries, and data structures, a standard development environment or simple compiler tool set for our target system is enough to get started. Integrated development environments (IDEs), however, can significantly shorten data-centric application development cycles because modern IDEs often come with simulators, prototyping tools, and debugging aids that can streamline the overall development process, especially when stable hardware is not readily available.

If we choose the buy versus build approach and use XML as our data-management language, the language we've defined for data operators and functions can use XSL to transform itself into C, C++, or Java code. This approach is, in effect, an application-specific compiler for our data-centric language. So once defined, assuming that our next design will be similar to our current one, the tools we build will be useful on future projects.

Note that whether you build or buy, both choices require a fair amount of effort. Plus, it's unclear whether the resulting code will be compact enough or have the performance required for resource-constrained devices. Despite that uncertainty, these methods can yield significant advantages over procedural tools and techniques by providing a data-centric method and a higher-level abstraction. Fortunately, commercially available products and developer community efforts exist that are more suitable for embedded systems development (see the section on commercial products).

Obtain a data-management service architecture

Third, we define and then build or buy the necessary system and data-management service libraries required by our application–a so-called data-management service architecture .

We've all had lots of experience organizing frequently used functions into service libraries that can be used for multiple purposes. Normally, however, these service libraries aren't designed and grouped according to a data-centric approach, such as by types of operations on different data streams. Instead these functions tend to be process or procedural in nature. For example, in the traditional procedural approach, we might see a library of functions that deal with opening and closing files, deleting files, converting Unicode to ASCII, and so forth. These are all step-wise operations as part of a conventional procedural method.

A data-centric service library would contain functions that operate on data, data structures, and data streams. For example, a data-centric service library would contain functions to filter an incoming data stream based on value, concatenate to incoming streams, append two streams, create an index on a table of a certain type, join two tables, search, and so forth.

By reorienting how we think about common data-management service libraries, we can immensely simplify the task of designing and building the services from which our applications are built. When we think of our task in a data-centric way, the organization of the service libraries would likely involve the following services:

  • Storage-management services for I/O to and from various persistent storage devices
  • Data-stream services for I/O from dynamic data sources
  • Table-management services for tabular data used during filtering and aggregating
  • Memory-management services for creating and destroying tables and streams
  • Transaction services for check-point, recovery, and rollback because failures, power on/off, or other forms of interrupts during data operations

By pursuing this level of organization, we can effectively abstract what our applications are doing several levels above the normal procedural method while still mapping each data operation into the most efficient means possible for a given CPU, operating system, and hardware design. In effect, we can program data-management code as if there were a database present in the system because the underlying data-management services handle the details.

In practice, the development flow with a data-centric development framework might look like what we see in Figure 6. Our data model and data operations are expressed in a higher-level data-management language. A code generator then analyzes these statements and generates the implementation in C code. This code is then compiled with the rest of the application and linked with the appropriate data-management services libraries to form a single executable that can be loaded into the target system.

Figure 6

Pay special attention to code portability. To reap the data- and platform-abstraction benefits of data-centric design, the run-time data-management services must be atomic and configurable. After all, the reality of our work is that processors, operating systems, and other system requirements change (ideally not too much during a single product development cycle). Regardless of the underlying platform, these services must handle the data operations required by any given application in the most efficient, consistent, and robust manner possible. These characteristics are essential for effective data-abstraction, optimum performance, and for maximizing data-management code portability. By providing a uniform data-management services layer to isolate applications from platform-specifics, this approach minimizes the dreaded porting problem associated with traditional procedural methods and tools. The bottom line is that as we develop our own data-management code and services, we need to look for the most chip- and operating system-independent implementation methods possible.

So what we've just done is define the four key criteria for an effective data-centric development framework. This method not only gives us a new perspective on traditional development methods, but also enables us to organize our work into a high-level framework or abstraction layer.

With these elements in place, we can solve data-management problems more efficiently with a domain-specific language while letting the remaining elements of the framework generate the code and provide the low-level implementation services necessary for optimum execution. We can then focus on application functionality at a higher level while enjoying the efficiency benefits that a services architecture provides for optimizing this functionality on any given device platform.

Commercial products
Let's take a quick look at commercially available or soon-to-be-available products that can help us implement data-centric design and development practices. Fortunately, help is on the way in the form of commercial product offerings. This is good, since we've seen that working a little smarter yields only modest improvements in our design methods and development time. We've also seen that embedded database technologies have, at best, limited applications. So here's a short summary of what some companies and developer communities are doing in the area of data-centric application development frameworks.

The latest release of Microsoft Visual Studio 2005 with SQL Server 2005 is both an IDE and a data-enabled development framework for Microsoft products. It fulfills most of the needs for a data-centric application framework, assuming of course that your device is a Microsoft-supported machine and capable of using SQL server as the runtime environment. For most embedded system developers, this isn't practical.

The Eclipse Foundation recently approved the Eclipse Data Tools Platform (DTP) project. The DTP project goes beyond the Microsoft-supported systems and provides many of the attributes mention

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.