Takeaway 5: Scale everything else too
Ensure that non-LLM aspects of your system can scale too.
Unlike chat bots and code autocompletion, our use case doesn’t require immediate feedback from Gemini. We had the luxury of waiting a minute or so for a single sample, validation feedback, and its associated artifacts such as unit tests to complete generating. On top of that, Gemini kept costs low for our use case — even amidst the hundreds of samples we generated. So the LLM-related aspects of our system scaled well.
However, the tail end of our process didn’t grant us the same luck. Getting samples ready to be published took about 5-15 minutes per sample, our process became our biggest bottleneck as it involves human review of each sample and end-to-end testing.
Takeaway 6: End-to-end test the final output
If the code doesn’t work, it’s not publishable.
So it was important for us to end-to-end test the samples we were publishing. This is perhaps less of a takeaway and more of a philosophical consensus that we arrived at as a team.
The code we publish is designed to be run by our users. If we can’t run it ourselves, in a situation similar to our users, then we can’t deliver samples that are known to work for our users. Since day 1, our handwritten samples have been end-to-end tested before being published — either manually or programmatically. Altering the way in which we produce our samples should not lower our sample quality.
DORA describes this “trust but verify” approach as a sign of “mature AI adoption”. Our rigorous end-to-end testing isn’t just for us, it’s essential for maintaining the trust of developers who rely on our samples.
Takeaway 7: Do excellent engineering
Apply established engineering best practices, and everything else will follow.
If you really think about it, many of the takeaways we’ve presented above aren’t necessarily unique to the world of using LLMs at scale to generate content and code. Many of the positively impactful engineering decisions we made while building out these systems have nothing to do with LLMs at all and aren’t mentioned above in our takeaways.
Conclusion
In the journey of generating high-quality, educational code samples at scale, we learned that success hinges on more than just powerful LLMs. While Gemini and Genkit provided the necessary generative power, the true breakthrough came from building a specialized, end-to-end system. Our seven takeaways—from decomposing the problem and embracing determinism to end-to-end testing and scaling the entire pipeline—show how we successfully built a reliable, scalable generation system that combines LLMs with established engineering practices.






