Post

Microsoft's AgentEval seems like a promising tool to assist with this!

๐Ÿค” As a generative AI practitioner, I spend a good chunk of time developing task-specific metrics for various tasks/domains and use-cases. Microsoftโ€™s AgentEval seems like a promising tool to assist with this!

โ— Traditional evaluation methods focus on generic and end-to-end success metrics, which donโ€™t always capture the nuanced performance needed for complex or domain specific tasks. This creates a gap in understanding how well these applications meet user needs and developer requirements.

๐Ÿ’ก AgentEval provides a structured approach to evaluate the utility of LLM-powered applications through three key agents:

  • ๐Ÿค– CriticAgent: Proposes a list of evaluation criteria based on the task description and pairs of successful and failed solutions. Example: For math problems, criteria might include efficiency and clarity of the solution.

  • ๐Ÿค– QuantifierAgent: Quantifies how well a solution meets each criterion and returns a utility score. Example: For clarity in math problems, the quantification might range from โ€œnot clearโ€ to โ€œvery clear.โ€

  • ๐Ÿค– VerifierAgent: Ensures the quality and robustness of the assessment criteria, verifying that they are essential, informative, and have high discriminative power.

Turns out that AgentEval demonstrates robustness and effectiveness in two applications: math problem-solving and household tasks and it outperforms traditional methods by providing a comprehensive multi-dimensional assessment.

I want to try this out soon, let me know if youโ€™ve already used it and have some insights!

 Microsoft's AgentEval

Translate to Korean

๐Ÿค” ์ƒ์„ฑํ˜• AI ์‹ค๋ฌด์ž๋กœ์„œ ์ €๋Š” ๋‹ค์–‘ํ•œ ์ž‘์—…/๋„๋ฉ”์ธ ๋ฐ ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋Œ€ํ•œ ์ž‘์—…๋ณ„ ๋ฉ”ํŠธ๋ฆญ์„ ๊ฐœ๋ฐœํ•˜๋Š” ๋ฐ ๋งŽ์€ ์‹œ๊ฐ„์„ ํ• ์• ํ•ฉ๋‹ˆ๋‹ค. Microsoft์˜ AgentEval์€ ์ด๋ฅผ ์ง€์›ํ•˜๋Š” ์œ ๋งํ•œ ๋„๊ตฌ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค!

โ— ๊ธฐ์กด์˜ ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜ ๋ฐ ์—”๋“œ ํˆฌ ์—”๋“œ ์„ฑ๊ณต ์ง€ํ‘œ์— ์ค‘์ ์„ ๋‘๋ฉฐ, ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋„๋ฉ”์ธ๋ณ„ ์ž‘์—…์— ํ•„์š”ํ•œ ๋ฏธ๋ฌ˜ํ•œ ์„ฑ๋Šฅ์„ ํ•ญ์ƒ ํฌ์ฐฉํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด ์ด๋Ÿฌํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์‚ฌ์šฉ์ž ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ๊ฐœ๋ฐœ์ž ์š”๊ตฌ ์‚ฌํ•ญ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ถฉ์กฑํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๋ฐ ๊ฒฉ์ฐจ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก AgentEval์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์—์ด์ „ํŠธ๋ฅผ ํ†ตํ•ด LLM ๊ธฐ๋ฐ˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์œ ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ•˜๋Š” ๊ตฌ์กฐํ™”๋œ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

  • ๐Ÿค– CriticAgent ํฌ๋ž™: ์ž‘์—… ์„ค๋ช…๊ณผ ์„ฑ๊ณตํ•œ ์†”๋ฃจ์…˜๊ณผ ์‹คํŒจํ•œ ์†”๋ฃจ์…˜ ์Œ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ‰๊ฐ€ ๊ธฐ์ค€ ๋ชฉ๋ก์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: ์ˆ˜ํ•™ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ ๊ธฐ์ค€์—๋Š” ์†”๋ฃจ์…˜์˜ ํšจ์œจ์„ฑ๊ณผ ๋ช…ํ™•์„ฑ์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๐Ÿค– QuantifierAgent๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์†”๋ฃจ์…˜์ด ๊ฐ ๊ธฐ์ค€์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์ถฉ์กฑํ•˜๋Š”์ง€ ์ˆ˜๋Ÿ‰ํ™”ํ•˜๊ณ  ์œ ํ‹ธ๋ฆฌํ‹ฐ ์ ์ˆ˜๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: ์ˆ˜ํ•™ ๋ฌธ์ œ์˜ ๋ช…ํ™•์„ฑ์„ ์œ„ํ•ด ์ˆ˜๋Ÿ‰ํ™”์˜ ๋ฒ”์œ„๋Š” โ€œ๋ช…ํ™•ํ•˜์ง€ ์•Š์Œโ€์—์„œ โ€œ๋งค์šฐ ๋ช…ํ™•ํ•จโ€๊นŒ์ง€์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๐Ÿค– VerifierAgent๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ‰๊ฐ€ ๊ธฐ์ค€์˜ ํ’ˆ์งˆ๊ณผ ๊ฒฌ๊ณ ์„ฑ์„ ๋ณด์žฅํ•˜์—ฌ ์ค‘์š”ํ•˜๊ณ  ์œ ์ตํ•˜๋ฉฐ ๋†’์€ ํŒ๋ณ„๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

AgentEval์€ ์ˆ˜ํ•™ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋ฐ ๊ฐ€์‚ฌ ์ž‘์—…์˜ ๋‘ ๊ฐ€์ง€ ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์—์„œ ๊ฒฌ๊ณ ์„ฑ๊ณผ ํšจ์œจ์„ฑ์„ ์ž…์ฆํ–ˆ์œผ๋ฉฐ ํฌ๊ด„์  ์ธ ๋‹ค์ฐจ์› ํ‰๊ฐ€๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๊ธฐ์กด ๋ฐฉ๋ฒ•์„ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๋‚˜๋Š” ์ด๊ฒƒ์„ ๊ณง ์‹œ๋„ํ•˜๊ณ  ์‹ถ๊ณ , ์ด๋ฏธ ๊ทธ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๊ณ  ํ†ต์ฐฐ๋ ฅ์ด ์žˆ๋‹ค๋ฉด ์•Œ๋ ค์ฃผ์‹ญ์‹œ์˜ค!

This post is licensed under CC BY 4.0 by the author.