Smart data: The next frontier in the IoT
It’s never been easier to collect data than it is today. A few clicks and you’re up and running, armed with all the best data technologies the cloud has to offer, ready to hoard all the data you possibly can. It can be hard to believe that just a decade ago, things were dramatically different. Collecting data at scale was, in fact, only an option for the largest corporations, organizations that could afford both the expensive servers that were the only viable option to store all of the data and the select few engineers who were capable of making the best out of it, back in the days when data science was just a budding field.
Nowadays, luckily, generating data is not just a corporate sport anymore. In fact, thanks to the Internet of Things (IoT), we have now all become, for better or for worse, little Big-Data factories. By 2020, a single human will be responsible for the generation of 1.7 MB of data per second. Even now, just a single autonomous vehicle generates 11TB of data per day. And this trend shows no signs of abating. On the contrary: it’s just going to grow.
This is obviously excellent news for all the data aficionados out there. It wasn’t long ago when collecting high-quality datasets was an onerous and painstaking task. Still, we always want more. If it ever seems like your brand-new Deep Learning model is “only” reaching a 92% accuracy, the easiest and readiest excuse is blaming the data. “My dataset isn’t large enough”, we tell our bosses nonchalantly. “But if we wait a few more weeks, this model will be the best you have ever seen!”
This seems to pose an important question: How much data is actually enough? But it actually poses an even more important one: How much data is too much?
Interestingly, we don’t hear this question frequently in machine learning circles, even if we really should. While Big Data is a huge opportunity, it is also a gigantic, 40 zettabyte liability. If data is indeed the new oil, we need to push the analogy to its limits: Data is an extremely lucrative resource, but also just like oil, it needs to be refined. The failure to restrain ourselves from uncontrolled usage is putting us at risk. In short, the way we use and consider data today is deeply unsustainable and this fact that is still barely reaching collective consciousness.
Maybe, just maybe, this is the wrong conversation to have. Maybe Big Data isn’t really the answer to AI after all.
Let’s step back for a moment and think about what it is that we are really collecting. Back in the early days of digitalization, data collection was indeed more costly, so we picked our spots. We were more responsible and a bit more conscientious. As generating and gathering data became easier and easier, less attention was paid to quality, while quantity became a natural by-product of new technologies such as cloud storage, cloud compute, GPU machines, large-scale data management and transfer systems. Quickly, data became a commodity, but with the continued escalation of data and data storage, no one asked the simple question: Why are we collecting this? Does it even make sense?
With the commoditization of model building, data moats certainly might seem the obvious answer to differentiation in AI, but did we all miss the big picture? Data ages. It gets stale. And ultimately, even if we have been lured into believing that data and information are two deeply different things, all data is not created equal. A teenager taking 20 selfies of herself before posting to Instagram is certainly different than a searchable catalog of medical literature, after all.
None of this seems like a problem as long as we cling to the belief that progress in hardware will keep us safe from the data apocalypse. Data storage is getting cheaper and cheaper by the day, and that compute power is increasingly accessible. That stays true only if the generation of data is offset by engineers’ ability to keep up with Moore’s Law. Even if they can do that indefinitely, consider this: If not all data is equally informational, then what’s the point in processing subpar or redundant data?