Figure. Shape identification in autonomous vehicles. (Source: Figure Eight)
According to Gartner, there will be a total of more than 20 billion Internet-connected devices in use by 2020. These devices will be generating more than 500 zettabytes of data per year, and with more technological advances to come, this number is expected to keep increasing dramatically. For the more than 70% of organizations who are already investing in IoT, all this data naturally represents a unique competitive advantage, and a tremendous opportunity to obtain valuable information and insights to be used for the development of innovative AI applications.
And as it turns out, IoT data is just as exciting to data scientists and machine learning engineers as it is to business leaders. From healthcare and agriculture to education and transportation, the domains where IoT is booming are about as diverse as its applications, which range from the discovery of new information to decision control. IoT data science opens the door to the creation of exciting new data products. However, there is a certain number of specificities to IoT data science that we will examine in this article.
As we have seen, IoT constitutes one of the greatest sources of new data. IoT data might actually be seen as the epitome of Big Data. If we look at the data generated through one device, we would more often than not deal with fairly small amounts of data (even though that is also currently changing). However, with countless distributed devices generating continuous streams of data, IoT generates a prodigious volume of data. Its variety is just as impressive: IoT devices gather all types of information, ranging from audio to sensor data, and are overall responsible for the fantastic explosion of diversity in data formats overall. (It is worth noting here that Big Data coincides with the apparition of ‘rich’ data formats which are much ‘heavier’ than textual formats.) And because the devices are close to the user and continuously collect information, this generated data is typically high-velocity; this makes IoT data particularly well suited for time-series modeling.
But there are also a few unique aspects to IoT data that makes its exploitation singularly challenging. It is often noisy because of errors occurring during both acquisition and transmission. This makes the process of structuring, cleaning and validating the data a critical step in the development of machine learning algorithms. By nature, IoT data is also highly variable, both because of the huge inconsistency in the data flow across the various data collection components and because of the existence of temporal patterns. Not only that, but the value of the data itself highly depends on the underlying mechanisms, the frequency at which that data is captured and the way it is processed. And even when the data coming from a specific device is considered trustworthy, we still need to account for the fact that different devices may behave differently even under similar conditions. Hence, capturing all possible scenarios when gathering training data would be unfeasible in practice.
One of the most remarkable attributes of IoT data, though, resides in its crudeness: because IoT devices are collecting data through various complex sensors, the data they generate is typically very raw. This means major data processing is necessary before business value can be extracted and powerful AI applications can be built. In fact, separating the meaningful signal from the noise and transforming these unstructured data flows into useful, structured data is the most paramount, yet perilous, step when building a smart IoT application.
A large number of IoT applications call for the use of supervised machine learning, a class of machine learning algorithms that require data to be labeled before a model can be trained. Because manually labeling large datasets is a time-consuming, error-prone and potentially expensive task, machine learning professionals often start by turning to labeled open-source datasets whenever those exist, or start with smaller amounts of data to label. However, the difficulty with IoT data comes from its specificity: because this data is often one of a kind, there is no guarantee that an existing open source dataset is readily available, and it then becomes necessary for the engineers to label their own data. This is where high quality, adaptable crowdsourcing labeling platforms such as Figure Eight can help.
However, because of the variability attributed to IoT data, labeling a small random sample might be insufficient. Taking this into account, those are the perfect circumstances for a semi-supervised learning strategy leveraging both labeled and unlabeled data in the training of the algorithm. In particular, active learning, where the algorithm is allowed to query the crowdworker for the labels of a subset of intelligently selected training instances as it is being trained , is a very well-suited approach permitting machine learning scientists to obtain similar algorithm accuracy for a fraction of the labeling cost.
One very interesting aspect to the development of IoT when it comes to machine learning is the emergence of crowd-sensing. Crowd-sensing exists under two different forms: voluntary, when users voluntarily contribute information, and opportunistic, when data is collected automatically without explicit user intervention. This is one way that IoT data can contribute not only to the development or the improvement of IoT applications, but can also be used as input for other, non-IoT, applications.
IoT actually allows collection of very unique datasets in a way that has never been achieved before. Because the data generated by each device is usually at a human-scale, it becomes feasible for a user to label or validate it. It also becomes possible to gather data closest to where the users are: this is what Google does when they ask users to take a picture of a restaurant they are currently dining at, or to answer a few questions regarding the amenities. This is the first time that organizations can collect human-generated data at Big Data scale.
One of the main factors behind the impressive advances in artificial intelligence nowadays is the appearance of better technologies, such as GPUs, that enable faster data processing. Machine learning for IoT has given rise to an interesting conundrum: while the best models need to be trained with a lot of data, most IoT devices are still limited in storage space and processing power. For that reason, the ability to safely and efficiently transfer large amounts of data from devices to a server or to the cloud, and vice versa, is key to the development of AI applications. In the age of cloud computing, a natural solution is to export the data to the cloud where models are developed, and to export the models back onto the device once they are ready for use. This is particularly appealing, especially since 94% of all generated data is expected to be processed in the cloud by 2021, which means it becomes possible to capitalize also on the other sources of data, either historical or originated on other IoT devices. However, storing complex models back onto a memory-constrained device can in itself be a challenge, as sophisticated models with large numbers of parameters, such as deep learning models, are often very large themselves. On the other hand, the solution consisting in sending data from the device to the model on the cloud for the inference step can also be suboptimal, especially in cases where latency needs to be very low.
Another challenge comes from the fact that IoT devices might not continuously be connected to the cloud and therefore might require some local reference data for offline processing, as well as the capability to function in standalone. This is where an edge-computing architecture becomes interesting, as it enables data to be initially processed at the level of the edge devices. This approach is particularly attractive when enhanced security is desired; it is also advantageous because such edge devices are capable of filtering data, reducing noise and improving data quality on the spot.
Unsurprisingly, AI engineers have been trying to get the best of both worlds and have eventually developed fog computing, which is a decentralized computing infrastructure. In this approach, data, compute power, storage and applications are distributed in the most logical way between the device and the cloud, ultimately leveraging their respective advantages by bringing them closer together.
We have seen that IoT devices were capable of generating Big Data, but in practice, it is not uncommon to use external, historical datasets to develop intelligent applications for IoT. This implies that it is possible to either rely on the data generated by an ensemble of multiple IoT devices (typically, the same type of device across multiple users), or on an entirely different source of data. The more specific and unique the application, the less likely it is that an existing dataset will be available for use – this would be the case, for example, when the device captures a very specific type of image with no similarity with open source image datasets such as Imagenet. That being said, it is very common that IoT applications are actually the clever blend of several existing off-the-shelf models. This makes transfer learning well adapted to the development of intelligent applications in the context of IoT.
The transfer learning paradigm consists in training a model on a dataset (generally a gold standard one) and using it to make inferences on another dataset. Alternatively, it is possible to use the parameters computed during the generation of this model as a starting point when training a model on the actual dataset instead of initializing the model to random values. In this case, we refer to the original model as a “pre-trained” model, which we fine-tune on the data specific to the application. This approach can speed up the training phase by several orders of magnitude. With the same paradigm, it is possible to train a general model that is then refined and optimized on a case-by-case basis, using the data directly generated by the end user.
Security and Privacy Concerns
Because Internet-connected devices technology extends the current Internet by providing connectivity between the physical and cyber worlds, the data it generates is highly versatile but is also the cause for major privacy concerns. In fact, about 50% of organizations involved in IoT consider security the biggest hindrance to IoT deployments. And considering that about two-thirds of IoT devices are in the consumer space, and how personal some of the shared data can be, it is easy to understand why. These concerns, coupled with the expected risks linked to frequent data transfers onto the cloud, explain why users are demanding guarantees regarding the protection of their data.
Yet things get even more insidious when those IoT applications are powered by “federated” data (i.e., data generated by multiple users): not only can user data be leaked directly, it can also be exposed indirectly through side-channel attacks, when malicious agents reverse-engineer the output of a Machine Learning algorithm to infer private information. And for these reasons, there is a clear necessity for data protection laws to evolve alongside the technology and the applications themselves.
IoT Machine Learning Is Human- Centered Machine Learning
Because IoT devices brings the Internet closer to its users and touches all aspects of human life, they often allow to collect highly contextual and personal data. IoT data narrates the story of the life of its users and is making it more achievable than ever to understand a user’s needs, desires, history and preferences. This makes IoT data the perfect data to build personalized applications tailored to a user’s personality.
And because IoT touches our lives so intimately both by collecting highly personal data, and by offering highly personalized applications and services, IoT machine learning can truly be qualified as human-centered machine learning by excellence.
Jennifer Prendki is currently the VP of Machine Learning at Figure Eight, the Human-in-the-Loop AI category leader. She has spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. She is particularly skilled at building and scaling high-performance Machine Learning teams, and is known for enjoying a good challenge. Trained as a particle physicist, she likes to use her analytical mind not only when building complex models, but also as part of her leadership philosophy. She is pragmatic yet detail-oriented. Jennifer also takes great pleasure in addressing both technical and non-technical audiences at conferences and seminars, and is passionate about attracting more women to careers in STEM.