'AI Judge' Can Learn New Tricks to Better Check Information and Program
AI researchers and developers are increasingly using large language models (LLMs) to evaluate the answers of other LLMs in a process called 'LLM-as-a-judge'. However, the quality of these evaluations often falls short in complex tasks such as long-form factual checking, advanced programming, and mathematical problem solving.
Now, a new study published by researchers from the University of Cambridge and Apple has proposed a new system that augments AI 'judges' with external validation tools to improve the quality of their judgments.
The system aims to overcome the inherent limitations of both human and AI annotation. Humans face challenges and biases due to time constraints, fatigue, and are influenced by writing style rather than actual accuracy. Meanwhile, AI struggles with complex tasks such as those mentioned above.
The new 'Evaluation Agent' framework the researchers created is autonomous, allowing it to evaluate responses to determine whether external tools are needed and to use the right tools. Each evaluation goes through three main steps: an initial assessment of the domain, using the tool, and making a final decision.
- Fact-checking tools use web searches to verify atomic facts in answers.
- The code execution tool leverages OpenAI's code interpreter to run and check the correctness of the code.
- A math checker is a specialized version of a code execution engine used to validate mathematical operations and operations.
If no useful tool is available to make a judgment, the system will use the basic LLM annotation set to avoid unnecessary processing and the risk of performance degradation on simple tasks.
The system provided significant improvements in long-term fact-checking, with marked increases in agreement with ground-truth annotations across multiple baselines. On programming tasks, the agent-based approach significantly improved performance across all baselines. For difficult tasks, agents improved performance over some but not all baselines, and overall agreement remained relatively low at around 56%. Notably, the researchers found that for long-term fact-checking answers, agent agreement with ground-truth was higher than human annotations.
The framework is extensible, so in the future, other tools can be integrated to further improve LLM assessment systems. The source code for the framework will be open sourced on Apple's GitHub, but has not yet been officially released.
You should read it
- Top 5 programming languages to develop AI
- 6 steps to start learning artificial intelligence programming (AI)
- Learn about Artificial Intelligence (AI)
- Watching pictures painted by artificial intelligence, everyone thinks that is the work of a true artist
- Artificial intelligence learns to create another artificial intelligence, replacing people in the future
- Artificial intelligence can now also be programmed
- Artificial intelligence knows how to write computer programming code
- What happens if aliens are artificial intelligence?
- [Infographic] Benefits and hazards from Artificial Intelligence
- How can the AI see us behind the walls?
- Review important milestones in the history of more than 60 years of artificial intelligence development
- 39 Examples of AI Applications in Education
Maybe you are interested
What is 'Code 10' error in Windows? And how to fix it? Beautiful Earth video at night is filmed from the International Space Station ISS Scientists read the bird's brain and know what it is about to sing The 1-0-2 radiator combined with air and ice heat makes PC people surprised by creativity Things to note in the Sanhok PUBG Mobile map The QBU sniper gun information on PUBG Mobile