Data access remains a significant challenge for most companies, with synthetic data emerging as a potential solution. A recent survey conducted by MOSTLY AI in collaboration with KDnuggets reveals that 71% of respondents believe synthetic data can help address the data access issue.
The State of Synthetic Data in 2023
The survey, which involved over 300 participants from the data science AI/ML community, aimed to understand the current state of synthetic data in 2023. The results indicate that only 15% of AI/ML models are in production, with 35% of respondents attributing the failure of AI/ML projects to a lack of AI/ML talent and 28% blaming a lack of data access. A staggering 61% of respondents noted that it takes months to access quality data, with 71% agreeing that synthetic data could be the missing piece of the puzzle required for AI/ML projects to succeed.
Synthetic Data: A Privacy-Safe Way to Democratize Data Access
Synthetic data generators, powered by AI, can provide data access to representative synthetic data that can be readily used as a replacement for original data. This approach offers a privacy-safe way to democratize data access and augment datasets to fit specific purposes. The result is shorter time-to-data, easier data access, and data science task automation.
The Understanding and Adoption of Synthetic Data
Despite the potential benefits of synthetic data, the survey revealed that there is still a lot of confusion around the term. About 59% of respondents didn’t know the difference between rule-based and AI-generated synthetic data. This suggests that synthetic data companies have a significant responsibility to educate data consumers about the benefits, limitations, and use cases of synthetic data.
Synthetic Data: The Future of AI
Among the survey respondents, 72% plan to use an AI-powered synthetic data generator within the next few years, and almost 40% plan to use one in the next three months. Most people cited data augmentation as their main use case.
However, data access remains an issue for most organizations, and privacy concerns are more pressing than ever. Although the urgency to adopt and scale AI is tangible across industries, data privacy issues and a lack of awareness of privacy-enhancing technologies, such as synthetic data, prevent most companies from capitalizing on the shift toward AI-supported work and services.
Addressing the Data Access Bottleneck
The survey revealed that only 18% of respondents said that access to quality data is not a problem for them. For 20%, it takes weeks, while for 61% of people asked, it takes months to get data access. This lack of data access is a significant bottleneck for data-centric projects.
Synthetic Data: A Solution to Data Anonymization Challenges
When asked about the most frequently used data anonymization tools and techniques, 49% of respondents said that they use data masking to anonymize data. Twenty percent said they simply remove PII from datasets – an approach that is not only unsafe from a privacy perspective but can also destroy data utility needed for high-quality training data. Privacy-enhancing technologies, like homomorphic encryption, AI-generated synthetic data, and others, account for 31%.
As the need for data access continues to grow, synthetic data is emerging as a promising solution to bridge the gap. With its potential to democratize data access and augment datasets, synthetic data is likely to play a crucial role in the future of AI and ML.