The challenges and opportunities for machine learning in the IoT
Figure. Shape identification in autonomous vehicles. (Source: Figure Eight)
According to Gartner, there will be a total of more than 20 billion Internet-connected devices in use by 2020. These devices will be generating more than 500 zettabytes of data per year, and with more technological advances to come, this number is expected to keep increasing dramatically. For the more than 70% of organizations who are already investing in IoT, all this data naturally represents a unique competitive advantage, and a tremendous opportunity to obtain valuable information and insights to be used for the development of innovative AI applications.
And as it turns out, IoT data is just as exciting to data scientists and machine learning engineers as it is to business leaders. From healthcare and agriculture to education and transportation, the domains where IoT is booming are about as diverse as its applications, which range from the discovery of new information to decision control. IoT data science opens the door to the creation of exciting new data products. However, there is a certain number of specificities to IoT data science that we will examine in this article.
As we have seen, IoT constitutes one of the greatest sources of new data. IoT data might actually be seen as the epitome of Big Data. If we look at the data generated through one device, we would more often than not deal with fairly small amounts of data (even though that is also currently changing). However, with countless distributed devices generating continuous streams of data, IoT generates a prodigious volume of data. Its variety is just as impressive: IoT devices gather all types of information, ranging from audio to sensor data, and are overall responsible for the fantastic explosion of diversity in data formats overall. (It is worth noting here that Big Data coincides with the apparition of ‘rich’ data formats which are much ‘heavier’ than textual formats.) And because the devices are close to the user and continuously collect information, this generated data is typically high-velocity; this makes IoT data particularly well suited for time-series modeling.
But there are also a few unique aspects to IoT data that makes its exploitation singularly challenging. It is often noisy because of errors occurring during both acquisition and transmission. This makes the process of structuring, cleaning and validating the data a critical step in the development of machine learning algorithms. By nature, IoT data is also highly variable, both because of the huge inconsistency in the data flow across the various data collection components and because of the existence of temporal patterns. Not only that, but the value of the data itself highly depends on the underlying mechanisms, the frequency at which that data is captured and the way it is processed. And even when the data coming from a specific device is considered trustworthy, we still need to account for the fact that different devices may behave differently even under similar conditions. Hence, capturing all possible scenarios when gathering training data would be unfeasible in practice.
One of the most remarkable attributes of IoT data, though, resides in its crudeness: because IoT devices are collecting data through various complex sensors, the data they generate is typically very raw. This means major data processing is necessary before business value can be extracted and powerful AI applications can be built. In fact, separating the meaningful signal from the noise and transforming these unstructured data flows into useful, structured data is the most paramount, yet perilous, step when building a smart IoT application.
A large number of IoT applications call for the use of supervised machine learning, a class of machine learning algorithms that require data to be labeled before a model can be trained. Because manually labeling large datasets is a time-consuming, error-prone and potentially expensive task, machine learning professionals often start by turning to labeled open-source datasets whenever those exist, or start with smaller amounts of data to label. However, the difficulty with IoT data comes from its specificity: because this data is often one of a kind, there is no guarantee that an existing open source dataset is readily available, and it then becomes necessary for the engineers to label their own data. This is where high quality, adaptable crowdsourcing labeling platforms such as Figure Eight can help.
However, because of the variability attributed to IoT data, labeling a small random sample might be insufficient. Taking this into account, those are the perfect circumstances for a semi-supervised learning strategy leveraging both labeled and unlabeled data in the training of the algorithm. In particular, active learning, where the algorithm is allowed to query the crowdworker for the labels of a subset of intelligently selected training instances as it is being trained, is a very well-suited approach permitting machine learning scientists to obtain similar algorithm accuracy for a fraction of the labeling cost.
One very interesting aspect to the development of IoT when it comes to machine learning is the emergence of crowd-sensing. Crowd-sensing exists under two different forms: voluntary, when users voluntarily contribute information, and opportunistic, when data is collected automatically without explicit user intervention. This is one way that IoT data can contribute not only to the development or the improvement of IoT applications, but can also be used as input for other, non-IoT, applications.
IoT actually allows collection of very unique datasets in a way that has never been achieved before. Because the data generated by each device is usually at a human-scale, it becomes feasible for a user to label or validate it. It also becomes possible to gather data closest to where the users are: this is what Google does when they ask users to take a picture of a restaurant they are currently dining at, or to answer a few questions regarding the amenities. This is the first time that organizations can collect human-generated data at Big Data scale.