'AI Judge' Can Learn New Tricks to Better Check Information and Program

AI researchers and developers are increasingly using large language models (LLMs) to evaluate the answers of other LLMs in a process called 'LLM-as-a-judge'. However, the quality of these evaluations often falls short in complex tasks such as long-form factual checking, advanced programming, and mathematical problem solving.

 

Now, a new study published by researchers from the University of Cambridge and Apple has proposed a new system that augments AI 'judges' with external validation tools to improve the quality of their judgments.

The system aims to overcome the inherent limitations of both human and AI annotation. Humans face challenges and biases due to time constraints, fatigue, and are influenced by writing style rather than actual accuracy. Meanwhile, AI struggles with complex tasks such as those mentioned above.

 

'AI Judge' Can Learn New Tricks to Better Check Information and Program Picture 1

The new 'Evaluation Agent' framework the researchers created is autonomous, allowing it to evaluate responses to determine whether external tools are needed and to use the right tools. Each evaluation goes through three main steps: an initial assessment of the domain, using the tool, and making a final decision.

  1. Fact-checking tools use web searches to verify atomic facts in answers.
  2. The code execution tool leverages OpenAI's code interpreter to run and check the correctness of the code.
  3. A math checker is a specialized version of a code execution engine used to validate mathematical operations and operations.

 

If no useful tool is available to make a judgment, the system will use the basic LLM annotation set to avoid unnecessary processing and the risk of performance degradation on simple tasks.

The system provided significant improvements in long-term fact-checking, with marked increases in agreement with ground-truth annotations across multiple baselines. On programming tasks, the agent-based approach significantly improved performance across all baselines. For difficult tasks, agents improved performance over some but not all baselines, and overall agreement remained relatively low at around 56%. Notably, the researchers found that for long-term fact-checking answers, agent agreement with ground-truth was higher than human annotations.

The framework is extensible, so in the future, other tools can be integrated to further improve LLM assessment systems. The source code for the framework will be open sourced on Apple's GitHub, but has not yet been officially released.

4 ★ | 2 Vote

May be interested

  • Check email with password leaked by the Information Security websiteCheck email with password leaked by the Information Security website
    after the 1.4 billion email accounts were leaked, the department of information security - ministry of information and communications has built a website https://khonggianmang.vn that allows users to check email accounts. whether or not my will be revealed.
  • Shared 11 tricks on window 10Shared 11 tricks on window 10
    whether you've been using windows 10 for years or have only recently upgraded, there are plenty of new and old tips, tricks and hidden features to learn that will make using your laptop every day faster and smoother.
  • Never use these types of SIMs!Never use these types of SIMs!
    junk sims are a method for criminals to take advantage of and commit illegal acts such as calling and texting impersonating authorities and defrauding and appropriating property.
  • View iPhone configuration, check iPhone hardware informationView iPhone configuration, check iPhone hardware information
    see iphone configuration, check iphone hardware information is one of the most important actions when you want to buy an iphone, whether it is genuine or old. this is not simply to help you know the information about the device you are using, but also minimize the risk of buying goods, avoid buying fake goods and poor quality goods.
  • Use EtreCheck to scan, check for errors on your MacUse EtreCheck to scan, check for errors on your Mac
    etrecheck is a free program that runs more than 50 diagnostics on a mac, then provides users with a neat report outlining all these diagnostics - so you know you have to start troubleshooting. where to try
  • How to Install and Use EfficientPIMHow to Install and Use EfficientPIM
    efficientpim is a personal information manager program for windows users to organize daily life. it helps manage tasks, contacts, notes, calendar, passwords and so on together. read the steps below to learn how to install and use...
  • Check computer performance with SiSoftware SandraCheck computer performance with SiSoftware Sandra
    sisoftware sandra is a popular system information checker tool, but it also includes computer performance testing utilities. check out this article to learn how to use it to test your computer performance!
  • Learn the first Python programLearn the first Python program
    in this article we will learn a simple python program to get a little more insight into python, before starting to learn about the main components of this programming language.
  • How to Install and Use EvernoteHow to Install and Use Evernote
    evernote is a useful program for keeping track of information across multiple devices. it is a good way to quickly organize your life and ensure you always have access to the information you need. read the steps below to learn how to...
  • VPN vulnerabilities and how to check and prevent themVPN vulnerabilities and how to check and prevent them
    most users use virtual private networks (vpns) to encrypt data. but did you know that there are some vpn services that leak sensitive information of users? so what can you do to block vpn leakage? the article will give you some information, how to check and prevent vpn leakage.