As LLMs become more prevalent, there is an increasing need to systematically evaluate their responses to ensure they generate reliable, accurate, and ethical content. LLM-as-a-Judge is a framework that leverages other configured LLMs as evaluators, assessing responses based on specific metrics to ensure quality and relevance. This approach is gaining momentum because human evaluation is often slow and expensive.
In this workshop, Aishwarya Reganti, an Applied Science Tech Lead for Generative AI at AWS, teaches you how to configure LLMs as evaluation judges, providing hands-on experience with commonly used metrics through the AWS Bedrock platform, the fm-eval library, and the Ragas library. We will also explore designs that can make the evaluation process more robust and effective for various use cases.
Presenter Bio
Aishwarya Naresh RegantiApplied Science Tech Lead for Generative AI at AWS
Aishwarya works in the AWS Generative AI Innovation Center helping customers pinpoint and evaluate valuable uses for generative AI. She has 8 years of machine learning experience and over 30 publications on AI. Aishwarya also works as a guest lecturer for MIT and the University of Oxford. She is a cofounder of the LevelUp Org tech community. Previously, Aishwarya was an Applied Scientist in Amazon's Aearch team and Alexa Teachable AI team.