Developing training sets for the IoT

The future has never been so exciting . Thanks to tiny yet powerful hardware that can be embedded in connected devices of all types, IoT represents an unrivaled source of the massive amounts of data necessary for the development of countless new machine learning applications of all stripes. Because they are at the center of our daily lives, IoT devices allow the capture of data at a personal level. Here, a data row isn’t just a store purchase or a click on a website anymore; it is your real-time location or speed, your heartbeat rate, or your grocery list, all of which can leave data scientists with too many choices.

But that doesn’t necessarily mean that building training sets is now easier than ever. In fact, with more options come more parameters to take into account to ensure those datasets are just right. So how do you navigate this tremendous amount of actionable data? Below are a few items to keep in mind as you develop a training set for your next smart IoT application.

Start with the problem, not the data

Just a few short years ago, collecting data was by no means the easy task that it is today. In order to have a sufficiently large historical dataset to use to train a new model, a data scientist had to forecast well ahead of time what his/her needs would be . This is still certainly true, in particular when the data shows some slow seasonal patterns that the model will have to account for. Yet, the fact that data is much cheaper to collect gives data scientists a lot more flexibility for experimentation.

This has had a major impact on the way data scientists work today. They can now afford to think about data collection after the business problem has been identified and formulated. The need to capture data “just in case” someone might want the geolocation information of the customer in the future hasn’t been fully eliminated, but it is certainly becoming less critical to the success of an organization. Nowadays, a data scientist can typically afford to wait until they have more visibility not only on the application they will build, but even, to some extent, on the type of model that they are likely to use, giving them an opportunity to tailor the data to their needs.

Keep iterating

Similarly, the fact that data collection isn’t the bottleneck anymore gives data scientists a lot more flexibility to test the performance of their datasets on their models and potentially identify weaknesses in the data. Ultimately, they can modify their requirements, identify new opportunities to capture a new feature, or even recommend improvements to hardware. This might almost feel a little counterintuitive to data scientists who have been trained to anticipate all future needs given the additional work that changing data collection requirements would cause to the engineering team, or the direct impact on the customer.

With this new order, data scientists can now collect datasets in several steps, training a first model and using a feedback loop to learn about the corner cases impacting the quality of the application the most. There is even an opportunity to use an active learning approach by allowing the algorithm to dynamically identify its pain points and weaknesses as it is being trained. For instance, if a first pass on the model shows that the driver-assistance application fails to recognize lanes in sunny weather, the data scientist can now decide to extend his/her dataset with additional examples recorded in New Mexico.

Collect enough data

Data scientists love to brag about how exciting it is to work in the machine learning field these days. The old days of “not having enough data” might soon be a thing of the past, if that isn’t already the case. This is excellent news for people working with highly complex models such as deep neural networks.

Even though the technology at the core of deep learning was invented decades ago, its practical usage for the creation of machine learning applications only recently took off due to two factors: the development and adoption of GPUs, and the spectacular volumes of data that we are now capable of collecting. Algorithms using a deep network architecture are known to be data-greedy because of the large number of parameters they involve: the more layers, the more neurons, and the more parameters need to be adjusted on the data. If not enough data is available, a phenomenon referred to as overfitting will occur: the model will identify pseudo-patterns in noise and interpret them as a real effect when the patterns are really statistical fluctuations. Finally, more data also equates to higher accuracy, which is why data scientists are notoriously fond of using large datasets whenever possible.

Collect a representative sample of the target data

Collecting a lot of data doesn’t mean anything if you are not collecting the right data. This is particularly true in the context of IoT.

IoT devices pose a very interesting challenge when it comes to data acquisition: even though all IoT users may have the exact same device at their disposal, these devices are still unique, come with their own subtle differences, and are used in different environments and situations. For example, you and your distant cousin might have the same car, but his lane detection device could be mounted just a tiny bit higher than yours, or could have a small defect that makes colors duller. 

When gathering data for the development of machine learning applications for IoT devices, it is important to make sure all data is not collected from the same single device and ensure that the data constitutes a good representation of all possible scenarios. In the case of self-driving vehicles, for example, there are many different factors at play to make sure the data is diverse enough to account for all possible corner cases–in some locations, the roads will be covered in snow most of the time, or washed by constant rains, some drivers will never drive at night while others will only drive after dark, etc. Therefore, it is very important to ensure that a very large diversity of examples are collected and brought together in your training set, so that no biases are introduced.

Leverage what you know

One major difficulty when it comes to the development of a machine learning application comes down to the fact that it is very hard to build a model-agnostic training dataset. For instance, some algorithms, like random forests, are very sensitive to class imbalance, namely the fact that the relative number of examples from different classes is too dissimilar. This can sometimes be very desirable, typically when you need to leverage the relative rarity of one class compared to the others. However, depending on the use case or even the metric that you are using to measure success (specifically, accuracy vs. precision/recall), recalibrating the data with an over- or under-sampling method might be necessary.

Because collecting data is now cheaper and easier than ever, data scientists now have an option to build customized datasets depending on their specific use case.

Think of annotation quality

We have now talked a lot about the collection of data and have seen that in the era of Big Data, gathering data isn’t as much of a challenge as it used to be. However, a new challenge is now facing the machine learning community: as collecting data becomes easier and easier, processing, and in particular, annotating this data is becoming the new bottleneck.

Because most machine learning solutions use a supervised approach, with no labels, all this data is essentially useless. While in some cases, data comes “naturally labeled”, like in the case of eCommerce when a click or a purchase indicates the interest of the customer for a specific product, most other cases require data to be labeled separately through a process that was until now done manually. The volume of data to be annotated nowadays has now reached a point where all human labor on the planet wouldn’t be sufficient to take care of the huge demand in annotation, which is why companies like Figure Eight are taking a Human-in-the-Loop approach to providing high quality labels.

Beyond the sheer volume of annotations that is required comes the constant need for quality control. Some human-in-the-loop platforms fail to provide the guarantee that data has been annotated accordingly to the customer’s requirements. And while it is the responsibility of the platform provider to ensure the contributors who generate the annotations are doing their job right, it is up to the data scientist who requests those annotations to design clear labeling instructions and think of all corner cases that the workers will face: do people on bikes count as pedestrians in the context of self-driving cars? How to handle occluded objects?

Once again, as the creator of the application, the data scientist has full power to use his/her knowledge of the use case to come back with the best recommendation and iterate if necessary.

Closing comment: biases

Each application is different, and therefore there is no surprise that each one of the dataset you will develop will have its own set of challenges. Regardless of the goal you are trying to reach, one main constant across all IoT applications will be the problem of biases.

Because of the large diversity of situations and conditions under which each unit of a given IoT device functions, building an extensive dataset covering all corner cases turns out to be an incredibly difficult task, which causes smart IoT applications to be prone to biases. Forget to add examples of hot weather when building self-driving car software, and the sun glare on the road might just be interpreted by the model as moisture, causing the windshield wipers to activate.

The good news is that this type of artifact is usually easy to fix once it has been diagnosed, which is why continuously refining your dataset in parallel with the development of your model is usually good practice.


Jennifer Prendki is currently the VP of Machine Learning at Figure Eight, the Human-in-the-Loop AI category leader. She has spent most of her career creating a data-driven culture wherever she went, succeeding in sometimes highly skeptical environments. She is particularly skilled at building and scaling high-performance Machine Learning teams, and is known for enjoying a good challenge. Trained as a particle physicist, she likes to use her analytical mind not only when building complex models, but also as part of her leadership philosophy. She is pragmatic yet detail-oriented. Jennifer also takes great pleasure in addressing both technical and non-technical audiences at conferences and seminars, and is passionate about attracting more women to careers in STEM.

2 thoughts on “Developing training sets for the IoT

  1. “From what I know, the advent of the internet and social media and basically all this inter-connectivity of people has given rise to immense amounts of data for the analysts to work with. I'm pretty sure that you won't have to worry about not having data u

    Log in to Reply
  2. “This is as interesting as it gets and we all cannot wait for the future to finally arrive on our laps. We all need improvements in our lives and when it comes to technology, who wouldn't want the most advanced? We want speed as well as usability and we wa

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.