Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
So, to test this claim, this detailed comparison was done. Like the previous comparison between Claude 3 Opus, GPT-4 and Gemini 1.5 Pro, the comparison evaluated reasoning ability, multimodal reasoning, code generation, etc. Let's find out in detail below. Please!
1. Find the drying time
Although this may seem like a basic question, start the test with this difficult reasoning question. LLMs tend to make frequent mistakes. Claude 3.5 Sonnet makes the same mistake and approaches the question mathematically. The model said it would take 1 hour and 20 minutes to dry 20 towels, which is incorrect. ChatGPT 4o and Gemini 1.5 Pro got the answer right when they said it would still take 1 hour to dry 20 towels.
If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?
Roughly translated: If you dry 15 towels in the sun for 1 hour, how long will it take to dry 20 towels?
Winning options : ChatGPT 4o and Gemini 1.5 Pro
2. Assess weight
Next, in this classic reasoning question, it's nice that all 3 models including Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro all have the correct answer. A kilogram of feathers or whatever will always weigh more than a pound of steel or other materials.
What's heaviest, a kilo of feathers or a pound of steel?
Which is heavier, a pound of feathers or a pound of steel?
Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
3. Word puzzles
In the next reasoning test, Claude 3.5 Sonnet correctly answers that David has no brothers and that he is the only male among the siblings. ChatGPT 4o and Gemini 1.5 Pro have the right answer.
David has three sisters. Each of them has one brother. How many brothers does David have?
=> David has three sisters. Each of them has a younger brother. How many brothers does David have?
Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
4. Sort items
Then, the author of the article asked all three models to arrange these objects so that they were stable. Unfortunately all three are wrong. The models take an identical approach: First place the laptop, then the book, the bottle, finally the 9 eggs at the bottom of the bottle, which is impossible. The older GPT-4 model had the correct answer.
Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
Here we have a book, 9 eggs, a laptop, a bottle and a nail. Tell me how to stack them so they don't fall over.
Winning options : None
5. Follow the instructions
In its blog post, Anthropic mentioned that Claude 3.5 Sonnet is excellent at following instructions, and that seems about right. It generates all 10 sentences ending with the word 'AI'. ChatGPT 4o also gets it right 10/10. However, Gemini 1.5 Pro could only produce 5 correct sentences out of 10. Google must drive the model for better guidance.
Generate 10 sentences that end with the word "AI"
Roughly translated: Create 10 sentences ending with the word "AI"
Winning options : Claude 3.5 Sonnet and ChatGPT 4o
6. Find details
Anthropic was one of the first companies to offer large context lengths, starting from 100K tokens up to today's context window of 200K. So for this test, the author provided a large text with 25K characters and about 6K tokens. The author added a detail somewhere in the middle of the text.
The author asked all three models for details, but only Claude 3.5 Sonnet found the answer, while ChatGPT 4o and Gemini 1.5 Pro did not. So for handling large documents, Claude 3.5 Sonnet is the better model.
Winning option : Claude 3.5 Sonnet
7. Check your eyesight
To test visual abilities, the author uploaded images of difficult-to-read writing to see how well the models could detect the characters and extract them. To our surprise, all three models did a great job and accurately identified the texts. Regarding OCR, all three models are quite capable.
Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
8. Create games
In this test, the author uploaded an image of the classic Tetris game without revealing the name and only asked the model to create a game like this in Python. All three models guessed the game correctly, but only the code created by Sonnet ran successfully. Both ChatGPT 4o and Gemini 1.5 Pro fail to produce error-free code.
In just one go, the game ran successfully using Sonnet's code. Many programmers use ChatGPT 4o to assist with coding, but it looks like Anthropic's model may become the new favorite among programmers.
Claude 3.5 Sonnet has achieved 92% of the HumanEval benchmark to evaluate programming ability. In this benchmark, GPT-4o reached 90.2% and Gemini 1.5 Pro at 84.1%. Obviously, for programming, there is a new SOTA model and that is the Claude 3.5 Sonnet.
Winning option : Claude 3.5 Sonnet
After running various tests on all three models, the Claude 3.5 Sonnet is as good as the ChatGPT 4o model, if not better. Especially in the field of programming, Anthropic's new model is really impressive. It's worth noting that the latest Sonnet model isn't even Anthropic's largest yet.
The company says the Claude 3.5 Opus will launch later this year and perform even better. Google's Gemini 1.5 Pro also performed better than previous tests, which means it has improved significantly. Overall, it can be said that OpenAI is not the only AI doing well in the LLM field. Anthropic's Claude 3.5 Sonnet is a testament to that.
You should read it
- Anthropic launches Claude 3.5 Sonnet, beating ChatGPT 4o
- What is Claude Pro? How does Claude Pro compare to ChatGPT Plus?
- 4 ways AI Claude chatbot outperforms ChatGPT
- How to use Anthropic's new AI Claude 3 Prompt Library
- Claude or ChatGPT is the best LLM for everyday task?
- Anthropic Launches Claude 2: New Competitor for ChatGPT and Bard
- What is Forefront AI? Is it better than ChatGPT?
- Experience AI chatbots for free on the same website
- What is Llama 2? How to use Llama 2?
- WikiLeaks revealed malware of CIA hacks and spies on Linux computers
- Americans are more afraid of being hacked than afraid of being killed
- 12 habits of successful people early in the morning to the office