More organizations are using natural language to query data instead of writing manual SQL. But moving an AI agent from a prototype to a production-ready tool requires rigorous, repeatable testing.
Prism is an open-source evaluation tool for Conversational Analytics in the BigQuery UI and API, as well as the Looker API. It replaces unpredictable testing methods by letting you create custom sets of questions and answers to reliably measure your agent’s performance. You can inspect execution traces to see exactly how your agent behaves and get targeted suggestions to improve its accuracy.
But to deploy confidently, teams must verify outputs and refine context based on measurable benchmarks. Prism gives you a standardized way to measure accuracy directly. This means the exact experts building the agents can easily validate their success and catch performance regressions as they iterate.
Understanding the Prism framework
To implement Prism effectively, it is important to understand the core architecture governing the evaluation process.
-
The agent: This consists of a conversational analytics agent, system instructions, data sources, and configurations.
-
The test suite: A set of questions that the agent should be able to answer accurately.
-
Assertions: These are automated checks that verify specific criteria, such as whether the generated SQL contains a GROUP BY clause or if the returned data matches a correct answer.
- Evaluation runs: During a run, the agent attempts to answer every question and Prism grades the quality of the answers. This provides a clear pass-fail assessment of the agent’s performance.






