In his book Moneyball, Michael Lewis chronicled the use of Sabermetrics to guide strategy in managing a baseball team. Though focused on analytical aspects of assembling a baseball team, the basic idea was the use of historical information (advanced statistics) to influence the current strategy (an at-bat, a defensive situation, or lineup decisions).
The book’s focus was on the unexpected decisions that came out of the statistical analysis, often going against the ‘gut feel’ of the long time baseball experts. But it also showed that the successful use of Sabermetrics required (1) knowing what happened in the past, (2) recognizing the current situation, and (3) applying the historical knowledge to drive the current strategy.
This statistics trend can be found in many other domains. For example, financial trading applications are based on models that are calibrated using historical data. Much like in the baseball world, the success of the model depends on how quickly the opportunity can be recognized based on current market activity. If the real-time market data is delayed for any reason, the opportunity is diminished, or worse, lost.
At a recent Big Data conference at the (Stanford GSB), the panelists discussed the growing trend of Big Data in all industries, including industrial, automotive, consumer, medical and not just baseball or finance. Or, more accurately, that Big Data is already the next big thing and businesses would be wise to analyze the tremendous amount of information generated within their systems.
However, what’s overlooked is how to actually collect this tremendous data; or, more precisely, how to move the data from where it is created to all of the different places it needs to be.
This ability to correlate real-time and historical data to achieve situational awareness and transform raw data into information is fast becoming a necessary capability of many large-scale distributed systems.
The rise of Big Data has shown that traditional RDBMS databases cannot keep up with the data rates present in these large distributed systems, and are inefficient at running the kinds of queries needed to retrieve important information. The recent explosion of virtualization and cloud based data management systems allows us to approach these problems in an innovative way.
The rest of this article outlines how to achieve situational awareness by connecting a distributed network of connected devices and systems to cloud-based data management tools, using so-called NoSQL database methods. Examples from several domains will be used to show how real-time and historical data can be combined to give the analyst a complete picture of emerging situations, as well as post-event analysis.
Persisting Real-Time Data
Whenever persistent data management is added to a real-time distributed system the primary concern is maintaining the critical performance characteristics of operational technology (OT). the physical equipment-oriented technology implemented and supported by engineering organizations, typical of many device and embedded systems designs.
In any large enterprise, OT is usually done independent from the Information Technology (IT) groups who are involved in data management, including that generated by the embedded devices on the production line or within deployed systems, with little real time interaction between the two.
In such distributed systems, the performance of persistent storage lags behind that of volatile storage, although there are signs that the two may be converging (i.e. Solid State Disks).
Real-time data management consists of several simultaneous activities:
1 . Storage (write)
2. Querying, Correlation & Retrieval
What distinguishes OT (real-time) data management from traditional IT domains is that all these activities happen simultaneously: data is produced, stored, correlated, retrieved, and redistributed with real-time requirements.
In real-time systems, data is produced at various rates and distributed with different priorities. Therefore, it is desirable that the data management system can also prioritize data and be able to scale to handle arbitrary storage loads.
A good example of data produced in a typical real-time distributed system is information from sensors. Sensor data is generally produced at consistent and well-known rates. While usually published with low priority, the same data can quickly become the highest priority — the urgency of the data is dynamic.
Consider for example a temperature sensor in a car engine. Most of the time the temperature is within the normal operating range and the information can be considered low priority. But when the temperature reaches a specific threshold, it is important to alert the system immediately.
Storage performance fits in two distinct categories — complete and partial. Complete storage is achieved if the data management system can store data at the peak throughput rate of your distributed system. For partial storage the system designer is left with two choices:
1. Slow down the data producers
2. Selectively discard data
Note that simply buffering is not sufficient as any buffer is finite. Buffering simply postpones the inevitable, and is undesirable in real-time systems.
Due to the distributed implementation, NoSQL database write performance is affected by the replication strategy as well as the underlying hardware. It is important for the system designer to understand the database implementation and pick one that is suitable for the application. As an example, one of the main strengths of Apache's Cassandra is the good write performance that stems from a very efficient replication strategy [Perham, 2010].
An archiving service provides the best data storage by subscribing to real-time data with the appropriate Quality of Service (QoS). In a basic implementation, the archiving service uses the NoSQL database API to issue a write to any node in the cloud. From there, the NoSQL database implementation persists and replicates the data. Based on the consistency configuration, the database will notify the archiving service when the desired consistency has been achieved.
A more advanced archiving service implementation can load balance writes to different segments of the cloud to achieve optimal write throughput. The archiving service can detect when it cannot provide complete data storage, and scale cloud resources accordingly.
The ability to subscribe without disrupting the real-time system is a fundamental characteristic of the archiving service. OT systems are extremely time sensitive; any delays in the delivery of data can result in system failure. Though subscribing to data may seem trivial and non-intrusive, traditional corporate Information Technology (IT) systems will often sacrifice latency to ensure receipt of all data.
This balancing act is a common challenge when integrating operational technologies with storage and other common IT systems. To ensure non-intrusive subscriptions, data distribution must enable passive observations without slowing producers or any other data transmission.
A fundamental property of NoSQL databases is that they are schema-free, making them particularly suitable for OT systems. Large real-time distributed systems have data schemas that are naturally complex and dynamic.
The industry is largely moving away from the concept of a fixed, single data model (e.g. CORBA) for several reasons — integration and forward compatibility being two of the most important. In modern OT systems, data schemas must be dynamically discoverable at run-time and must be extensible and/or mutable.
Not only must these schemas be captured by the data management system, they may also be examined for analysis. In other words, meta-data and data may be equally important for situational awareness.
Data Correlation, Querying & Retrieval
Achieving Situational Awareness (SA) requires correlating real-time and historical data. In technical terms, this means running pre-compiled and dynamic queries continuously on the data streams as they are being written to the NoSQL database.
All SA queries are triggered on the occurrence of some real-time event, such as the price change of a security. Once the real-time event occurs the query needs to correlate with historical data to decide what, if any, action to take.
NoSQL databases have already conquered the world of on-demand content distribution. This is perhaps the most telling by the fact that Netflix uses Apache Cassandra for its streaming service [Izrailevsky, 2011].
The result of queries fall into two categories: alerts and content retrieval. Alerts distribution is trivial in most cases: a single high priority message. In the case of data retrieval, distribution becomes an important factor. As shown in Figure 1 below, the data resulting from a query will need to be retrieved, ordered, and distributed to the consumer in a timely fashion.
Use Case: Power Generation
There are hundreds of windmill farms being constructed around the world. Imagine a windmill farm constructed as a hierarchical distributed system.
At the lowest level of the hierarchy each windmill is in itself a distributed system: it has a vast array of sensors that produce information about current power generation as well as structural and environmental data that is consumed by the windmill to operate safely and efficiently.
This data is also automatically shared with the farm control center. The control center is responsible for maintaining contact with other farms and is linked to two important external systems: the power grid and meteorological systems.
As an example consider an instance where a windmill detects sudden and unexpected strong gusts of wind. The windmill determines that the wind patterns are irregular based on precise measurements over the past week and more general measurements over the past few years.
Potentially harmful, the windmill moves to a fail-safe mode. An alert is sent to a command center, which can then use continuous real-time controls to carefully tune the performance of the other turbines based on each windmill’s current state. The command center can also alert farms that are downwind so that they can adjust performance based.
Another scenario involves real-time pricing updates from energy trading. The power grid and power exchanges provide information on load, demand and pricing of electricity.
If demand and prices are low relative to current output, the farm can automatically redirect power generated to storage or cease production. If demand and prices are high relative to current production, the farm can move to peak productivity and sell stored power. Just like securities trading, there is the potential for designing algorithms to detect beneficial patterns in power generation and distribution.
Meteorological systems can also be based on algorithmic statistical analysis, and can benefit from two-way communication with windmill farms. On one hand farms become important real-time sensor stations for the meteorological data; on the other, the farms are dependent on weather forecasts from the meteorological systems to carefully tune for optimal output.
The algorithms used in both power generation and meteorological systems need timely access to real-time events as well as historical trends. If data is delayed for any reason, the consequences can range from inefficient output that results in lost revenue, to catastrophic failure and loss of infrastructure (and possibly human lives).
Predictive maintenance and asset management are examples where advanced OT/IT integration can directly affect revenue. Compared to preventive maintenance, this approach reduces costs because work is performed only when needed. Predictive maintenance is predicated on continuous, real-time monitoring of in-service equipment to predict when maintenance is required.
But it is not enough to just monitor the state of equipment. Just like with the windmill example, the real-time data on its own does not provide enough information to determine whether the equipment is operating within normal bounds. A sensor indicating high temperature may be caused by equipment malfunction, or it may be caused by an increase in factory output.
To be certain, the monitoring data must be analyzed in the context of historical information to determine whether any action needs to be taken.
The same techniques and algorithms used to solve the maintenance problem can also be used to provide long-term business intelligence. Besides preventing unexpected equipment failure, spotting long-term trends in equipment usage increases availability and improves equipment lifetime.
The Needle In The Haystack
Many companies have already demonstrated the value of using Big Data technologies to sift through the vast amounts of information generated by IT systems. The technology has proven adept at finding the proverbial needle in a haystack. At the same time, organizations are just beginning to realize the business value of data produced by OT systems.
Merging OT and IT data is the next logical step, but the integration brings forth many technical challenges. OT systems generate even more data to analyze (more “hay” to hide the needle), while IT systems must be integrated in such a way as to not impact the time sensitivity of OT real-time data flows.
By selecting the right mix of technologies to solve the problems of data management, the power of both OT and IT can be unleashed to capture opportunity as it happens; that is, to find the needle before it even goes into the haystack.
Sumeet Shendrikar is principal applications engineer, services at RTI. He serves as a technology consultant building distributed real-time systems. His technical expertise spans large-scale real-time distributed systems, high performance computing, networking, and embedded systems. Sumeet holds an MS in Computer Science from Stanford University, and a BS EECS from the University of California at Berkeley. Prior to RTI, Sumeet worked at Trilogy Software as a consultant for Global 1000 companies.
1 – Apache Cassandra. (n.d.). Retrieved October 23, 2011, from Apache Cassandra :
2 – Izrailevsky, Y. (2011, January 28). The Netflix “Tech” Blog. Retrieved October 23, 2011, from NoSQL at Netflix:
3 – Lemire, D. (2010, June 28). Daniel Lemire's Blog. Retrieved October 23, 2011, from NoSQL or NoJoin:
4 – NoSQL Database. (2011, October 23). Retrieved October 23, 2011, from NoSQL Database:
5 – Perham, M. (2010, March 13). Cassandra Internals – Writing. Retrieved October 23, 2011, from On Ruby, software and the Internet.