Smart data: The next frontier in the IoT
I can hear through my computer screen the voices of hundreds of data scientists protesting that larger training sets are the answer to their pains. After all, haven’t generations of experts told us, over and over again, the more data, the better?
The reality is just not that simple. Clearly, access to quality, large-enough datasets is key to making progress in machine learning. Yet, if the doctor told you were sick and urgently needed large intakes of vitamin C, wouldn't you want to be conscientiously identifying the foods that do contain the said vitamin instead of rushing to your kitchen and eating every single food you could find in there? No doubt that consuming everything in your fridge or food pantry would eventually lead you to getting some much-needed vitamin C, but in the process you would also consume many empty calories. Unfortunately, it seems that the way we approach machine learning today requires intervention from the equivalent of nutritionists.
This might seem like a silly analogy but there’s some heft here. For instance, the belief that more powerful GPU machines will eventually get us out of trouble is deeply flawed, just in the same way that believing that overeating will help us with vitamin intake. In reality, not only is a lot of the data we collect redundant or irrelevant to the models we try to train with it, it is oftentimes detrimental to those models. For example, overgrown training sets are often unbalanced and can lead to overfitting. Some extreme outliers might actually cause models to “unlearn.” Data can be mislabeled, miscollected or faulty.
This calls for an important question: If data scientists are best suited to provide feedback on what data might be more useful for the model to learn, then why are they still taking virtually no part in the design of the hardware devices mean to collect data, and rarely have an opportunity to provide feedback of the data collection process itself?
The answer is actually more straightforward than it might first seem: Just like nutritionists give different advice to different customers due to their unique nutritional needs, a data scientist also can only advise which data to collect for a specific use case. In short, the most informative data in the context of the training of a given model might actually be completely irrelevant to another, which makes it challenging for data to be triaged agnostically at the source.
Hardware obviously can’t solve all of those issues. It can’t figure out which data rows are decreasing the model’s accuracy. It can’t figure out which ones are redundant. It can’t relabel bad data points. In other words, storing too much data can’t fix the problems that arise from storing too much data. We need to focus on creating an additional layer of intelligence capable of sorting meaningful data from the dross. We need to start paying attention to the data scientists who are building great models from smaller, curated datasets. We need to understand that yes, you can overfeed your model.
The future of data isn’t in gigantic server farms that house every single data point, regardless of which are actually useful. It’s in small, smart data. It’s about thoughtful approaches based on the quality of data as well as its relevance to the use case -- not sloppy approaches based largely on quantity. And it’s more accessible to those of us without endless budgets for labeling and servers. In other words, it’s both more intelligent and more democratic.
That’s something we can all get behind.
Jennifer Prendki is the founder and CEO of Alectio. The company is the direct product of her beliefs that good models can only be built with good data, and that the brute force approach that consists in blindly using ever larger training sets is the reason why the barrier to entry into AI is so high. Prior to starting Alectio, Jennifer was the VP of Machine Learning at Figure Eight, the pioneer in data labeling, Chief Data Scientist at Atlassian and Senior Manager of Data Science in the Search team at Walmart Labs. She holds a PhD in particle physics from Sorbonne University. Her favorite slogans are: "not all data is created equal", "data is the new plastic" and "Smart Data > Big Data".