Developing training sets for the IoT
The future has never been so exciting . Thanks to tiny yet powerful hardware that can be embedded in connected devices of all types, IoT represents an unrivaled source of the massive amounts of data necessary for the development of countless new machine learning applications of all stripes. Because they are at the center of our daily lives, IoT devices allow the capture of data at a personal level. Here, a data row isn’t just a store purchase or a click on a website anymore; it is your real-time location or speed, your heartbeat rate, or your grocery list, all of which can leave data scientists with too many choices.
But that doesn’t necessarily mean that building training sets is now easier than ever. In fact, with more options come more parameters to take into account to ensure those datasets are just right. So how do you navigate this tremendous amount of actionable data? Below are a few items to keep in mind as you develop a training set for your next smart IoT application.
Start with the problem, not the data
Just a few short years ago, collecting data was by no means the easy task that it is today. In order to have a sufficiently large historical dataset to use to train a new model, a data scientist had to forecast well ahead of time what his/her needs would be . This is still certainly true, in particular when the data shows some slow seasonal patterns that the model will have to account for. Yet, the fact that data is much cheaper to collect gives data scientists a lot more flexibility for experimentation.
This has had a major impact on the way data scientists work today. They can now afford to think about data collection after the business problem has been identified and formulated. The need to capture data “just in case” someone might want the geolocation information of the customer in the future hasn’t been fully eliminated, but it is certainly becoming less critical to the success of an organization. Nowadays, a data scientist can typically afford to wait until they have more visibility not only on the application they will build, but even, to some extent, on the type of model that they are likely to use, giving them an opportunity to tailor the data to their needs.
Similarly, the fact that data collection isn’t the bottleneck anymore gives data scientists a lot more flexibility to test the performance of their datasets on their models and potentially identify weaknesses in the data. Ultimately, they can modify their requirements, identify new opportunities to capture a new feature, or even recommend improvements to hardware. This might almost feel a little counterintuitive to data scientists who have been trained to anticipate all future needs given the additional work that changing data collection requirements would cause to the engineering team, or the direct impact on the customer.
With this new order, data scientists can now collect datasets in several steps, training a first model and using a feedback loop to learn about the corner cases impacting the quality of the application the most. There is even an opportunity to use an active learning approach by allowing the algorithm to dynamically identify its pain points and weaknesses as it is being trained. For instance, if a first pass on the model shows that the driver-assistance application fails to recognize lanes in sunny weather, the data scientist can now decide to extend his/her dataset with additional examples recorded in New Mexico.
Collect enough data
Data scientists love to brag about how exciting it is to work in the machine learning field these days. The old days of “not having enough data” might soon be a thing of the past, if that isn’t already the case. This is excellent news for people working with highly complex models such as deep neural networks.
Even though the technology at the core of deep learning was invented decades ago, its practical usage for the creation of machine learning applications only recently took off due to two factors: the development and adoption of GPUs, and the spectacular volumes of data that we are now capable of collecting. Algorithms using a deep network architecture are known to be data-greedy because of the large number of parameters they involve: the more layers, the more neurons, and the more parameters need to be adjusted on the data. If not enough data is available, a phenomenon referred to as overfitting will occur: the model will identify pseudo-patterns in noise and interpret them as a real effect when the patterns are really statistical fluctuations. Finally, more data also equates to higher accuracy, which is why data scientists are notoriously fond of using large datasets whenever possible.