In this talk, we’ll explore how thoughtful evaluation is transforming the way we design, test, and refine prompts for large language models at Preply. From manual review to automated scoring systems, we’ll walk through the tools and frameworks we use to ensure prompt quality and reliability at scale. Learn how evaluators—both human and model-based—help us iterate faster, uncover failure modes, and drive continuous improvement in our development process.
Lightning talk ⚡️ Intermediate ⭐⭐ Track: AI, ML, Bigdata, Python
LLM
GenAI
python