How to get Gemini to deeply understand your database

Step 1: Start with a clean foundation (data filtering)

One important tenet of fine-tuning is “garbage in, garbage out.” A model trained on a dataset with incorrect, inefficient, or ambiguous queries may learn incorrect patterns. The training data provided by the BIRD benchmark is powerful, but like most large-scale datasets, it’s not perfect.

Before we could teach the model to be a SQL expert, we had to curate a gold-standard dataset. We used a rigorous two-stage pipeline: first, execution-based validation to execute every query and discard any that failed, returned an error, or gave an empty result. Second, we used LLM-based validation, where multiple LLMs act as a “judge” to validate the semantic alignment between the question and the SQL, catching queries that run but don’t actually answer the user’s question. This aggressive filtering resulted in a smaller, cleaner, and more trustworthy dataset that helped our model learn from a signal of pure quality rather than noise.

Step 2: Make the model a SQL specialist (multitask learning)

With a clean dataset, we could move on to the supervised fine-tuning itself. This is the process of taking a large, general-purpose model — in our case, Gemini 2.5-pro — and training it further on our narrow, specialized dataset to make it an expert in a specific task.

To build these skills directly into the model, we leveraged the publicly available Supervised Tuning API for Gemini on Vertex AI. This service provided the foundation for our multitask supervised finetuning (SFT) approach, where we trained Gemini-2.5-pro on several distinct-but-related tasks simultaneously.

We also extended our training data to cover tasks outside of the main Text-to-SQL realm, helping enhance the model’s reasoning, planning, and self-correction capabilities.

By training on this combination of tasks in parallel, the model learns a much richer, more robust set of skills. It goes beyond simple question-to-query mapping — it learns to deeply analyze the problem, plan its approach, and refine its own logic, leading to drastically improved accuracy and fewer errors.

Step 3: Inference accuracy + test-time scaling with self-consistency

The final step was to ensure we could reliably pick the model’s single best answer at test time. For this, we used a technique called self-consistency.

With self-consistency, instead of asking the model for just one answer, we ask it to generate several query candidates for the same question. We then execute these queries, cluster them by their execution results, and select a representative query from the largest cluster. This approach is powerful because if the model arrives at the same answer through different reasoning paths, that answer has a much higher probability of being correct.

It’s important to note that self-consistency is a standard, efficient method, but it is not the only way to select a query. More complex, agentic frameworks can achieve even higher accuracy. For example, our team’s own research on CHASE-SQL (our state-of-the-art ensembling methodology) demonstrates that using diverse candidate generators and a trained selection agent can significantly outperform consistency-based methods.

For this benchmark, we wanted to focus on the model’s core performance. Therefore, we used the more direct self-consistency method: we generated several queries, executed them, and selected a query from the group that produced the most common result. This approach allowed us to measure the model’s raw text-to-SQL ability, minimizing the influence of a more complex filtering or reranking system.

The BIRD Single-Model Track explicitly allows for self-consistency, which reflects the model’s own internal capabilities. The benchmark categorizes submissions based on the number of candidates used (‘Few’, ‘Many’, or ‘Scale’). We found our “sweet spot” in the “Few” (1-7 candidates) category.

This approach gave us the final, critical boost in execution accuracy that pushed our model to the top of the leaderboard. More importantly, it proves our core thesis: by investing in high-quality data and instruction tuning, you can build a single model that is powerful enough to be production-ready without requiring a heavy, high-latency inference framework.

A recipe for customizing Gemini for text-to-SQL

A combination of clean data, multi-task learning, and efficient self-consistency allowed us to take the powerful Gemini 2.5-pro model and build a specialist that achieved the top-ranking score on the BIRD single-model benchmark.

Our fine-tuned model represents a much stronger baseline for text-to-SQL. However, it’s important to note that this score is not the upper bound of accuracy. Rather, it is the new, higher baseline we have established for the core model’s capability in a constrained setting. These results can be further amplified by either

creating an ensemble, aka integrating this specialist model into a broader system that employs preprocessing (like example retrieval) or agentic scaffolding (like our CHASE-SQL research), or
optimizing model quality for your unique database by enhancing metadata and/or query examples (which is how our customers typically deploy production workloads).

Nevertheless, the insights from this research are actively informing how we build our next-generation AI-powered products for Google Data Cloud, and we’ll continue to deliver these enhancements in our data services.

Explore advanced text-to-SQL capabilities today

We’re constantly working to infuse our products with these state-of-the-art capabilities, starting with bringing natural language queries to applications built on AlloyDB and BigQuery. For AI-enhanced retrieval, customers especially value AlloyDB and its AI functions. AlloyDB integrates AI capabilities directly into the database, allowing developers to run powerful AI models using standard SQL queries without moving data. It offers specialized operators such as AI.IF() for intelligent filtering, AI.RANK() for semantic reranking of search results, and AI.GENERATE() for in-database text generation and data transformation.

And if you want to write some SQL yourself, Gemini Code Assist can help. With a simple prompt, you can instruct Gemini as to the query you want to create. Gemini will generate your code and you can immediately test it by executing it against your database. We look forward to hearing about what you build with it!

How to get Gemini to deeply understand your database

Using BigQuery ML to solve for the lookalike problem at Zeotap

Leave a Reply Cancel reply

You might also like

How to get Gemini to deeply understand your database

Using BigQuery ML to solve for the lookalike problem at Zeotap

How Protective ReRoute improves network resilience

How Waze keeps traffic flowing with Memorystore

Accelerating innovation and discovery at SC25

Expanding support for AI developers on Hugging Face

Stay Connected

Welcome Back!

Create New Account!

Retrieve your password