Ensuring Reliable AI-Assistant Performance with Automated Evaluation
Abi, Beroe's AI-powered procurement assistant, empowers sourcing and procurement professionals to make informed business decisions using real-time market intelligence and AI-driven insights. As conversational AI became an integral part of procurement operations, Beroe recognized the importance of ensuring that Abi chatbot consistently performed at a high standard, even as new updates and features were introduced.
To support this goal, we designed and implemented a comprehensive AI evaluation framework tailored to Abi’s unique requirements. The framework leverages a customized set of metrics and large language models (LLMs) as evaluators to assess Abi’s performance with precision across accuracy, contextual relevance, and response quality. Deployed on AWS, the solution automates the evaluation process end-to-end enabling scalable, repeatable, and data-driven performance monitoring. This approach allows Beroe’s product team to validate Abi’s updates rapidly, maintain reliability across deployments, and continuously optimize the quality of user interactions.
Technologies: AWS, Bedrock, AI Agents, Docker, Python, Langfuse
Challenge
As Abi’s team at Beroe continuously enhanced the chatbot with new features, content sources, and integrations, they faced a critical challenge: how to ensure that each change genuinely improved performance without introducing new issues. Every update carried some level of risk. A change designed to refine response quality or expand coverage could inadvertently affect accuracy, tone, or contextual relevance elsewhere.
Previously, there was no standardized or automated method to measure whether an update had a positive or negative impact. The evaluation relied heavily on manual review, where team members would read through sample responses and judge their quality based on experience. This approach was time-consuming, inconsistent, and difficult to scale, making it nearly impossible to identify systemic issues or track improvements objectively over time.
Without a structured evaluation framework, the team struggled to validate enhancements, detect regressions early, and provide stakeholders with reliable, data-backed insights into Abi’s conversational performance and overall quality trends.
Solution
To solve these challenges, we developed a fully automated evaluation framework that integrates seamlessly into Abi's existing development and deployment workflow.
At the heart of the solution is an automated evaluation pipeline powered by a large language model (LLM) acting as an independent judge. The system evaluates Abi’s responses against a golden dataset which is a curated set of procurement-specific prompts with validated reference answers. This ensures consistent, objective, and reliable performance assessment. Abi chatbot’s responses are analyzed across metrics like accuracy, hallucination rate, and tone to capture a complete picture of response quality and factual soundness.
The evaluation process was fully integrated into the CI/CD pipeline, automatically running evaluations before every deployment or model update. This ensured that each release was validated in real time, delivering immediate insights into whether performance had improved or declined.
Finally, we integrated Langfuse to capture and track all evaluation results. This enables objective performance comparison across multiple releases, monitoring long-term trends, and making data-driven decisions that continuously improve the AI Agent.
Result
The AI evaluation framework provided Beroe’s team with a clear and reliable way to measure how Abi chatbot performed after each update. Through automated assessments and benchmarking, the team could quickly determine whether new changes improved response quality or introduced regressions, making the development process more incremental, transparent, and efficient. By integrating AI evaluations directly into the workflow, Abi’s performance tracking became continuous and data-driven.