Synthetic data is critical for AI development - Embedded.com

Synthetic data is critical for AI development

An AI model is only as good as the data it’s trained on, but synthetic data can bridge the gap between model needs and data availability.

Advanced AI development today is still deeply rooted in 1950s computer science philosophies, including the phrase “garbage in, garbage out.” The adage reminds us that an AI model is only as good as the data it’s trained on.

For everything from advanced cancer screenings to suggesting a new movie, data scientists need large and diverse datasets to train AI models. This can be a significant challenge with real-world data. Often protected for privacy reasons, authentic data can be hard to come by and can also be expensive to source, and potentially not as diverse as desired.


Rev Lebaredian (Source: Nvidia)

Luckily, AI can come to its own rescue with synthetic datasets – computer-generated simulations that ensure an ample supply of diverse and anonymous training data. The data is completely anonymous and can be created using various methods, like general adversarial networks or simulators using more non-AI procedures, that ensure a close resemblance to authentic data. By using synthetic datasets, AI developers benefit from higher performing and more robust models.

A Dupe for Data

As developers reach the limits of readily available data, they will soon need to look elsewhere to improve their models. Synthetic data is information that computer simulations or algorithms generate as an alternative to real-world data to fill the gap between model needs and data availability.

Data scientists have many ways to generate synthetic data. Simulations and 3D renderings are excellent starting points. For example, a self-driving car is often trained by having it drive thousands of miles of virtual roads before it ever rolls on a real one. General adversarial networks, generative models that create new data, can also be used for data production. Thanks to these, synthetic data collecting has become more accessible and efficient than ever.

Analyst firm Gartner recently reported* that synthetic data is on a trajectory to go from a sideshow to becoming the main force behind the future of AI. In a study, Gartner notes that, “Synthetic data democratizes the playing field by allowing smaller organizations to create AI models without a lot of data, effectively solving their cold-start problem.”

Artificial Data Addresses AI’s Critical Need
AI is already ubiquitous, as it has been integrated into our lives across healthcare, retail, entertainment, autonomous vehicles, smart spaces and more with smart devices and technology that is accelerating us into the future.

Using AI as a digital mirror is the next step in its evolution. Yet variations in a particular environment can be innumerable. A shirt’s color may have many shades and hues. A room’s lighting changes with the movement of the sun or the turning on of lamps and lights.


This scene of vehicles in a tunnel uses indirect lighting. This is an example of a scene which is challenging to render accurately in real-time, but is enabled in Nvidia Drive Sim by the Nvidia Omniverse RTX renderer (Source: Nvidia)

Capturing the complexity of conditions makes diverse synthetic datasets essential for AI model-making. Synthetic data can be collected to power digital twins with far less time and expense than is required to gather data from primary sources. This maximizes access to large amounts of diverse data and adds the benefit of being free from privacy concerns.

Noting the importance of this AI asset, Gartner also notes that, “Synthetic data is often seen as a lower-quality substitute, useful only when real data is inconvenient to get, expensive or constrained by regulation. This misses the true potential of synthetic data. The fact is you won’t be able to build high-quality, high-value AI models without synthetic data.”

Reality Is Really Random

Diverse training datasets are key for building AI models, but real-world data can fall short. The built-in feature for domain randomization enables Nvidia Isaac Sim, a robotics simulation application and synthetic data generation tool, to randomly vary the texture, colors, lighting and placement in simulations.

The same is true for Nvidia Drive Sim, a simulation platform for testing autonomous vehicles. It has the ability to change the size or language of a street sign or the position of the sun.

These capabilities are emphasized in the O’Reilly Media report “Accelerating AI with Synthetic Data,” which emphasizes that safety and efficiency are priorities in simulations. According to the report, “Some problems that can be tackled by using synthetic data would be too costly or dangerous (e.g., in the case of training models controlling autonomous vehicles) to solve using more traditional methods, or simply cannot be done otherwise.”


The Nvidia Isaac simulation engine creates better photorealistic environments and streamlines synthetic data generation and domain randomization to build datasets for engineers and developers training and deploying robots in a broad range of applications (Source: Nvidia)

Randomizing conditions, like lighting, colors, and object placement, is essential for creating diverse synthetic training data for more accurate AI models. The variations in these digital worlds mirror the variations that appear in real life where the unexpected and unpredictable occur regularly.

In factories, for example, an object handled by one worker may end up in a different position when a different worker handles the same object. The variations in environmental conditions, like positioning, are significant when training robots how to work in a real factory using synthetic data and simulations. These abilities have enabled the production of robust smart factories and cities.

The Critical Link Between Graphics and AI

Beyond virtual cities and factories, synthetic data has paved the way for a renaissance within computer graphics, as simulating worlds in 3D is now a key component for training AI models. In a 3D world, objects should fall, body parts should bend, and skin should be textured to closely resemble all the moving parts of humans.

The different ways an individual can appear in a virtual world, with natural bodily variations, facial features, and behaviors, illustrates the true power of synthetic data. Diverse synthetic data can bridge the gap between virtual and real worlds with precision in features varying from gravitational laws to bodily actions to skin texture.

Humans differ from each other with varying skin colors, reactions and expressions that can be displayed in media productions and digital replicas. Digital humans are only one part of the puzzle, as environmental conditions like lighting and object positioning are just as important in computer graphics and simulations.

For example, a self-driving car needs to be able to respond when the sun is low in the sky, potentially hindering visibility. Synthetic data can help to improve simulated worlds by creating more realistic virtual environments that are true digital twins of reality. Generating physically accurate, physically-based environments and humans is extremely challenging and requires advanced simulation, performant computing resources and large amounts of data.


Nvidia Drive Sim uses high-fidelity and physically accurate simulation to create a safe, scalable and cost-effective way to bring self-driving vehicles to our roads (Source: Nvidia)

AI Advancing Its Own Future

The ability for AI to improve itself using synthetic data makes it a uniquely powerful technology. Synthesizing data is the key to enhanced quality and quantity of robust training data for advanced models and simulations.

Each wave of AI innovation builds upon the last. The opportunity for synthetic data will extend beyond its use in current AI applications to industries across agriculture, autonomous vehicles, healthcare, robotics and more.

When developing data sources for AI, don’t let the words “artificial” and “synthetic” deter you. The data may be artificially created, but the results are essential to real success. Soon, an incredibly accurate digital mirror of reality will exist, built efficiently and accurately using synthetic data.

–Rev Lebaredian is vice president of simulation technology at Nvidia

*Gartner, “Maverick Research: Forget About Your Real Data — Synthetic Data Is the Future of AI,” Leinar Ramos, Jitendra Subramanyam, June 24, 2021.

>> This article was originally published on our sister site, EE Times.


Related Contents:

For more Embedded, subscribe to Embedded’s weekly email newsletter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.