Search is the new hot feature for embedded applications. Driven by anincrease in device computing power and the plummeting price of bothmicro-disk and NAND Flash memory, devices are now starting to managemany megabytes or even gigabytes of data. Finding the right data at theright moment, possibly by cross-referencing different data stores, hasbecome quite a challenge.
Just as device developers have accepted the help of embeddedoperating systems vendors, so now they are starting to accept help fromembedded data management vendors. An embedded data manager offersmulti-user shared data access and provides a high level query languagein which to express search conditions.
This article discusses the features available from modern embeddeddata managers, and how to use them to create a user-friendly search forboth textual and spatial data.
Search is hot
The hot new feature to add to device applications is search. The needfor search is driven by both technical and social factors. On thetechnical side, the plummeting price of both Flash and micro-drivedisks means that even modest devices are being fitted with amounts ofstorage that were unthinkable only a couple of years ago.
Suddenly developers have to deal with 60-gigabyte devices, biggerthan some low-end laptops. On the social side, Google has conclusivelydemonstrated the value of web-based search, and users are now expectingGoogle-like features on their PDAs, cell phone and other devices.
Search seems simple in concept. How hard can it be to write somesearch algorithms for a new device? The answer is: very hard. Search isone of those technologies where seemingly simple requirements turn intoextremely complex code.
A clue comes from the enterprise world, where no one would everthink of writing their own data manager, even though corporate IT paysaround $12 billion per year for licenses for database managers such asOracle and IBM DB2. Why is corporate ITwilling to put up with thesehigh charges? Because it's hard to do data management right.
Until recently, device developers did not have the luxury ofselecting a COTS data manager for their application because theirdevices were small and lacking in power. Two factors have changed thissituation.
First is the inexorable drive of Moore's law thatexponentiallyincreases device power every year. Second is the emergence oflight-weight database managers designed for the needs of devicedevelopers.
A handful of such data managers have emerged in the last few years,with a run-time memory footprint in the order of a few hundredkilobytes. These data managers offer many of the features of theirenterprise brethren ” and some offer extensions designed specificallyfor the device market.
While there are some other data management solutions available, therelational model has pretty much triumphed over the rest, achieving amarket share of over 90 percent. For that reason, this article focuseson the relational database management system(RDBMS) .
A modern embedded RDBMS offers some compelling technology to thedeveloper. By solving some of the hard data management problems, anembedded RDBMS can help an engineering team get to market much fasterwith a data-rich application. Key among these high-value features isthe transaction.
A transaction is a set of data changes that must happen completelyor not at all. The iconic transaction is a debit from a checkingaccount and a credit to a savings account. There should be no case inwhich either the credit or debit happens alone, even if the power failsat an inconvenient time (and is thereany other time that the power fails? ).
Database transaction properties are described as having ACIDproperties: atomic, consistent,isolated and durable. Atomic means allor nothing. Consistent refers to the fact that a transaction may notviolate database integrity rules: if the checking account debit wouldresult in a negative balance, and negative balances are illegal, thenthe transaction will not take place.
Transactions are isolated so that other applications cannot get aninconsistent view of the data by seeing partial results, and they aredurable because they survive power fail and reboot.
In an RDBMS, data is stored in tables, made up of rows and columns (seeFigure 1, below ). Each row defines a record. The columns specifythedata elements in the record, with each column having a specificdatatype, such as integer or variable length character string.
|Figure1: Table structure in a relational DBMS|
Typically one column of each table is designated as the primary key.Primary key values are unique within a given table, so they provide away to uniquely identify every row. Rows can be searched using othercolumn data as well. Such searches may return more than one row.
Tables are interlinked using foreign keys. Each foreign key in agiven table takes the value of a primary key from another table. Forexample, a band may issue several CDs, each with several tracks.Foreign keys allow the band's data to be stored only once but linked toeach CD. The data for each CD is also stored only once and linked by aforeign key to each of the constituent tracks.
In a typical file system data structure, data elements are retrieved bylocation and offset. This works when the application is looking up arecord that it knows it wants. It is much less suitable for findingunknown data. For that, you need an index.
In an RDBMS, an index is usually built out of some variant ofself-balancingB-Trees (see Figure 2,below ). A B-Tree is a data structure that supports binarycontent search over a large data set.
|Figure2: B-Tree Indexing|
In an RDBMS, the developer defines suitable indexes to support thecontent searches expected in the application. From then on, the RDBMSmaintains the index data structure without attention from theapplication or the developer.Content is retrieved from an RDBMS using Structured Query Language(SQL).
The SQL language has three major benefits for developers. Firstly,it abstracts the underlying data structures, so that applications needhave no knowledge of how and where the data is stored. Secondly, itrefers to data by content only, not by pointers. Thirdly, it is aset-oriented language, so any query can return multiple matchingrecords.
An example query used to look up phone numbers from an address bookmight look like this:
SELECTad_name, ad_type, ad_number
WHERE ad_name LIKE “John%”
Performance and Recovery
Indexes provide fast access to data, but they are rather expensive toupdate. The base data structures that hold the data content are alsointernally quite complex and expensive to update.
For it to be able to offer fast content-based search and also areasonable update speed, an RDBMS can't afford to update the base dataand indexes in permanent storage every time a change takes place. Onthe other hand, it must never lose a completed transaction. An RDBMSuses a clever trick to resolve this dilemma.
An RDBMS maintains a memory cache of recently used data and indexentries. When an application makes data changes, they are applied tothe data structures in memory. However, these structures will be loston power fail, so it must make a durable copy of the changes.
It does this by writing a log record of the transaction to storagebefore it declares the transaction complete. The transaction log is anextremely dense encoding of the transaction, which is cheap to write tostorage.
When the RDBMS finds itself starting up, it refers to thistransaction log to find out what changes were committed by theapplication but never written to storage. It then redoes thosetransactions, applying the changes to data and indexes as required.
At the same time, it looks for changes that were applied to storageas a part of transactions that never completed, and it undoes thosechanges. The result is that, on startup, the RDBMS guarantees thepresence of every transaction that was completed and the absence of allfragments of incomplete transactions, returning the database to aconsistent state.
Kids, Don't Try This at Home
Database management is an arcane and sophisticated branch of computingscience. Modern data managers represent the fruit of tens or evenhundreds of person-years and the best thinking of the world's bestdatabase developers. Shared, recoverable, transactional data access isa field full of subtle traps for the less knowledgeable and is bestleft to the specialists.
Besides, when a team is trying to get a new embedded application tomarket, spending time on infrastructure functionality that can bebought and layered in does not make commercial sense. It is much betterfor scarce engineering talent to be directed to creating visiblecustomer value. This is the same argument that has led most embeddeddevelopers to take advantage of embedded operating systems rather thanwriting their own: it's worth trading off a little RAM and a littleperformance in order to get to market faster.
Embedding an RDBMS
Before any application programming begins, you will need to design thedatabase. This is a task outside the scope of this article, but thereare many books that deal with relational database design.
Any embedded application team that does not have hands-on experienceof designing a relational database would do well to take advantage ofdesign services supplied by their database vendor or an experiencedthird party. Poor database design can result in performance andintegrity problems that are hard to correct, so it is good to get itright early.
To embed an RDBMS into an application, you must understand the RDBMSprocessing cycle. The cycle is bookmarked by connecting to anddisconnecting from the database. In between is where all the action is.As is often the case with the interface to a complex software product,the technique is a mix of populating data structures and making APIcalls.
The beginning step is to allocate a statement handle by making theappropriate API call. The statement handlepoints to a data structurethat is used to set up and control the SQL statement. A statementhandle survives until released, so it can be used for many queries inturn.
In a typical application, SQL queries are parameterized so that aunique statement does not have to be created for each query. Our phonenumber lookup query above could be prepared as a parameterizedstatement that would look something like this:
SELECTad_name, ad_type, ad_number
WHERE ad_name LIKE ?
Once the parameterized statement is prepared (loaded into the statement handle datastructure ) it can be re-used quickly, by just binding newparameters. In this case, it can look up the phone number for any nameregular expression that the user submits.
The SQL execution step sends the SQL query to the RDBMS where it isoptimized and executed by the database manager.
Because SQL is a set language, the application must assume that zeroto many rows are returned by most queries. The usual way to deal withthis is via a cursor. A cursor is a pointer that increments through theresult set, row by row.
The end of set is indicated by an error return or the setting of astatus value. For each row returned, the application can either copythe column values into program variables or, more efficiently, setpointers to directly access the column values in the database interfacebuffer.
An application can consume all the returned rows or abandon unreadrows by closing the cursor. At that point, the RDBMS releases any locksit has held on the data and the SQL statement is available again toreceive new parameters.
Standard SQL shows its origins in the world of business data processingin that SQL is specialized to deal with alphanumeric data. Each columnin a row contains a single scalar value. This is a good solution for ahuge variety of problems but a very poor solution for some importantproblems.
An example of vector data poorly served by B-Tree indexing isspatial data. A spatial point is a two- or three-dimensional objectwhich cannot be efficiently searched in this way.
Asking a simple question like “Is this point in that circle?” turnsout to be very difficult to code in SQL, and the resulting databaseexecution is unacceptably slow. Special indexing schemes built fromR-Treesor Quad-Trees provideefficient spatial search are provided asextensions for some enterprise RDBMS. This technology is starting toappear in the most advanced embedded RDBMS as well.
Spatial search is important for two quite different reasons. Theobvious one is that devices are increasingly becoming aware of theirlocation. Applications that take advantage of this fact can integratespatial data with other data held on the device. This can empower fieldworkers to find the next nearby job without relying on centraldispatching, help mobile phone users to find members of theircommunity, and deliver location-based services to users of mobiledevices.
The other value of spatial search is in the realm of media. Anexplosion of media delivered over the Internet, by cable, or fromsatellite is deluging consumers. Content providers who used to competeover the breadth of their offerings are starting to realize that ifconsumers can't find content, then its value plummets.
If the interface to 1,000 channels of TV is a standard programguide, much of the content will never be seen. Vendors of devices suchas set-top boxes and MP3 music players are starting to explore visualways to navigate through the universe of content, where the basicquestion is “Find me something like this,” and is internallyimplemented as a proximity search in multidimensional space, allowingusers to wander through media space, discovering items of interest.
When to Use an Embedded RDBMS
Managing data on many devices has stopped being a simple problem.Increasing data volume and increasing program interaction make thedevice a dangerous place for data integrity. Sure, you can code tosolve these problems, but does that make commercial sense?
An embedded RDBMS offers sophisticated and flexible data managementright out of the box. The more advanced RDBMS provide search techniquesoptimized to device applications. Using an embedded RDBMS canaccelerate time to market and improve data integrity.
Consider using an embedded RDBMS if:(1) The application manages alot of data; (2) Theapplication needs to integrate data from several sources; (3) Several applications share thesame data; (4) The user issearching data by content; and, (5) The application must search multi-dimensional data
Malcolm Colton is vice president ofsales and marketing at HitachiEmbedded Business Group and can be reached at
Embedded Database resources onEmbedded.com
1) EnsuringDatabase Quality
2) Designingdata-centric software
3) Providingreal-time embedded to enterprise connectivity with DDS and DBMS
4) XML,SQL, and C
5) Buildinga effective real-time distributed publish-subscribe framework
6) Tacklingmemory allocation in multicore andmultithreaded applications
7) Designingdata-centric software
8) Reducecomplexity of network systems development with data-centric software
9) Telematicsoftware needs data-centric development approaches