It’s never been easier to collect data than it is today. A few clicks and you’re up and running, armed with all the best data technologies the cloud has to offer, ready to hoard all the data you possibly can. It can be hard to believe that just a decade ago, things were dramatically different. Collecting data at scale was, in fact, only an option for the largest corporations, organizations that could afford both the expensive servers that were the only viable option to store all of the data and the select few engineers who were capable of making the best out of it, back in the days when data science was just a budding field.
Nowadays, luckily, generating data is not just a corporate sport anymore. In fact, thanks to the Internet of Things (IoT), we have now all become, for better or for worse, little Big-Data factories. By 2020, a single human will be responsible for the generation of 1.7 MB of data per second. Even now, just a single autonomous vehicle generates 11TB of data per day. And this trend shows no signs of abating. On the contrary: it’s just going to grow.
This is obviously excellent news for all the data aficionados out there. It wasn’t long ago when collecting high-quality datasets was an onerous and painstaking task. Still, we always want more. If it ever seems like your brand-new Deep Learning model is “only” reaching a 92% accuracy, the easiest and readiest excuse is blaming the data. “My dataset isn’t large enough”, we tell our bosses nonchalantly. “But if we wait a few more weeks, this model will be the best you have ever seen!”
This seems to pose an important question: How much data is actually enough? But it actually poses an even more important one: How much data is too much ?
Interestingly, we don’t hear this question frequently in machine learning circles, even if we really should. While Big Data is a huge opportunity, it is also a gigantic, 40 zettabyte liability. If data is indeed the new oil, we need to push the analogy to its limits: Data is an extremely lucrative resource, but also just like oil, it needs to be refined. The failure to restrain ourselves from uncontrolled usage is putting us at risk. In short, the way we use and consider data today is deeply unsustainable and this fact that is still barely reaching collective consciousness.
Maybe, just maybe, this is the wrong conversation to have. Maybe Big Data isn’t really the answer to AI after all.
Let’s step back for a moment and think about what it is that we are really collecting. Back in the early days of digitalization, data collection was indeed more costly, so we picked our spots. We were more responsible and a bit more conscientious. As generating and gathering data became easier and easier, less attention was paid to quality, while quantity became a natural by-product of new technologies such as cloud storage, cloud compute, GPU machines, large-scale data management and transfer systems. Quickly, data became a commodity, but with the continued escalation of data and data storage, no one asked the simple question: Why are we collecting this? Does it even make sense?
With the commoditization of model building, data moats certainly might seem the obvious answer to differentiation in AI, but did we all miss the big picture? Data ages. It gets stale. And ultimately, even if we have been lured into believing that data and information are two deeply different things, all data is not created equal. A teenager taking 20 selfies of herself before posting to Instagram is certainly different than a searchable catalog of medical literature, after all.
None of this seems like a problem as long as we cling to the belief that progress in hardware will keep us safe from the data apocalypse. Data storage is getting cheaper and cheaper by the day, and that compute power is increasingly accessible. That stays true only if the generation of data is offset by engineers’ ability to keep up with Moore’s Law. Even if they can do that indefinitely, consider this: If not all data is equally informational, then what’s the point in processing subpar or redundant data?
I can hear through my computer screen the voices of hundreds of data scientists protesting that larger training sets are the answer to their pains. After all, haven’t generations of experts told us, over and over again, the more data, the better?
The reality is just not that simple. Clearly, access to quality, large-enough datasets is key to making progress in machine learning. Yet, if the doctor told you were sick and urgently needed large intakes of vitamin C, wouldn't you want to be conscientiously identifying the foods that do contain the said vitamin instead of rushing to your kitchen and eating every single food you could find in there? No doubt that consuming everything in your fridge or food pantry would eventually lead you to getting some much-needed vitamin C, but in the process you would also consume many empty calories. Unfortunately, it seems that the way we approach machine learning today requires intervention from the equivalent of nutritionists.
This might seem like a silly analogy but there’s some heft here. For instance, the belief that more powerful GPU machines will eventually get us out of trouble is deeply flawed, just in the same way that believing that overeating will help us with vitamin intake. In reality, not only is a lot of the data we collect redundant or irrelevant to the models we try to train with it, it is oftentimes detrimental to those models. For example, overgrown training sets are often unbalanced and can lead to overfitting. Some extreme outliers might actually cause models to “unlearn.” Data can be mislabeled, miscollected or faulty.
This calls for an important question: If data scientists are best suited to provide feedback on what data might be more useful for the model to learn, then why are they still taking virtually no part in the design of the hardware devices mean to collect data, and rarely have an opportunity to provide feedback of the data collection process itself?
The answer is actually more straightforward than it might first seem: Just like nutritionists give different advice to different customers due to their unique nutritional needs, a data scientist also can only advise which data to collect for a specific use case. In short, the most informative data in the context of the training of a given model might actually be completely irrelevant to another, which makes it challenging for data to be triaged agnostically at the source.
Hardware obviously can’t solve all of those issues. It can’t figure out which data rows are decreasing the model’s accuracy. It can’t figure out which ones are redundant. It can’t relabel bad data points. In other words, storing too much data can’t fix the problems that arise from storing too much data. We need to focus on creating an additional layer of intelligence capable of sorting meaningful data from the dross. We need to start paying attention to the data scientists who are building great models from smaller, curated datasets. We need to understand that yes, you can overfeed your model.
The future of data isn’t in gigantic server farms that house every single data point, regardless of which are actually useful. It’s in small, smart data. It’s about thoughtful approaches based on the quality of data as well as its relevance to the use case — not sloppy approaches based largely on quantity. And it’s more accessible to those of us without endless budgets for labeling and servers. In other words, it’s both more intelligent and more democratic.
That’s something we can all get behind.
Jennifer Prendki is the founder and CEO of Alectio. The company is the direct product of her beliefs that good models can only be built with good data, and that the brute force approach that consists in blindly using ever larger training sets is the reason why the barrier to entry into AI is so high. Prior to starting Alectio, Jennifer was the VP of Machine Learning at Figure Eight, the pioneer in data labeling, Chief Data Scientist at Atlassian and Senior Manager of Data Science in the Search team at Walmart Labs. She holds a PhD in particle physics from Sorbonne University. Her favorite slogans are: “not all data is created equal”, “data is the new plastic” and “Smart Data > Big Data”.