Gretel Unveils the Largest Open Source Text-to-SQL Dataset, Enabling Businesses to Harness AI’s Potential

Gretel, an innovator in synthetic data, has taken a groundbreaking step in democratizing access to high-quality AI training data. On Thursday, they unveiled the world’s largest open source Text-to-SQL dataset; an announcement that is sure to accelerate AI model training and open up opportunities for businesses worldwide.

Gretel has made available on Hugging Face an unprecedented dataset containing over 100,000 synthetic Text-to-SQL samples across 100 verticals under an Apache 2.0 license, providing developers with everything they need to create powerful AI models capable of understanding natural language queries and producing SQL queries to connect business users to complex data sources.

Yev Meyer, Chief Scientist at Gretel, pointed out in an interview with VentureBeat that accessing quality training data is one of the main obstacles to creating with generative AI. High-quality synthetic data may fill this void; one notable change within Large Language Models and AI is this renewed focus on data quality.

Gretel Navigator, a sophisticated compound AI system currently in public preview, was used to generate its groundbreaking dataset. “Gretel Navigator’s Text-to-SQL dataset was produced through agent-based execution combined with multiple proprietary models (such as their custom tabular Large Language Model ), privacy enhancement technologies and privacy-preserving strategies – producing high quality synthetic data on demand from scratch,” explained Meyer.

Gretel’s dataset offers businesses of all industries an invaluable solution to access and leverage data stored in complex databases, data warehouses and data lakes. Not only is her dataset easy for end-users to comprehend but it even comes equipped with plain English descriptions of SQL code making understanding output simpler than ever before.

Gretel is known for its dedication to data quality, evident through their rigorous validation processes. “Every dataset we generate is assessed for quality benchmarking as part of what we do,” according to Meyer. The Text-to-SQL dataset consistently outperformed others when evaluated via independent services or the LLM-as-judge technique, according to Gretel’s own evaluation methodology.