Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro

Anthropic released its latest Sonnet Claude 3.5 model recently and claims that it beats ChatGPT 4o and Gemini 1.5 Pro on many benchmarks.

 So, to test this claim, this detailed comparison was done. Like the previous comparison between Claude 3 Opus, GPT-4 and Gemini 1.5 Pro, the comparison evaluated reasoning ability, multimodal reasoning, code generation, etc. Let's find out in detail below. Please!

1. Find the drying time

Although this may seem like a basic question, start the test with this difficult reasoning question. LLMs tend to make frequent mistakes. Claude 3.5 Sonnet makes the same mistake and approaches the question mathematically. The model said it would take 1 hour and 20 minutes to dry 20 towels, which is incorrect. ChatGPT 4o and Gemini 1.5 Pro got the answer right when they said it would still take 1 hour to dry 20 towels.

If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?

Roughly translated: If you dry 15 towels in the sun for 1 hour, how long will it take to dry 20 towels?

Winning options : ChatGPT 4o and Gemini 1.5 Pro

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 1Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 1

2. Assess weight

Next, in this classic reasoning question, it's nice that all 3 models including Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro all have the correct answer. A kilogram of feathers or whatever will always weigh more than a pound of steel or other materials.

What's heaviest, a kilo of feathers or a pound of steel?

Which is heavier, a pound of feathers or a pound of steel?

Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 2Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 2

3. Word puzzles

In the next reasoning test, Claude 3.5 Sonnet correctly answers that David has no brothers and that he is the only male among the siblings. ChatGPT 4o and Gemini 1.5 Pro have the right answer.

David has three sisters. Each of them has one brother. How many brothers does David have?

=> David has three sisters. Each of them has a younger brother. How many brothers does David have?

Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 3Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 3

4. Sort items

Then, the author of the article asked all three models to arrange these objects so that they were stable. Unfortunately all three are wrong. The models take an identical approach: First place the laptop, then the book, the bottle, finally the 9 eggs at the bottom of the bottle, which is impossible. The older GPT-4 model had the correct answer.

Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.

Here we have a book, 9 eggs, a laptop, a bottle and a nail. Tell me how to stack them so they don't fall over.

Winning options : None

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 4Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 4

5. Follow the instructions

In its blog post, Anthropic mentioned that Claude 3.5 Sonnet is excellent at following instructions, and that seems about right. It generates all 10 sentences ending with the word 'AI'. ChatGPT 4o also gets it right 10/10. However, Gemini 1.5 Pro could only produce 5 correct sentences out of 10. Google must drive the model for better guidance.

Generate 10 sentences that end with the word "AI"

Roughly translated: Create 10 sentences ending with the word "AI"

Winning options : Claude 3.5 Sonnet and ChatGPT 4o

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 5Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 5

6. Find details

Anthropic was one of the first companies to offer large context lengths, starting from 100K tokens up to today's context window of 200K. So for this test, the author provided a large text with 25K characters and about 6K tokens. The author added a detail somewhere in the middle of the text.

The author asked all three models for details, but only Claude 3.5 Sonnet found the answer, while ChatGPT 4o and Gemini 1.5 Pro did not. So for handling large documents, Claude 3.5 Sonnet is the better model.

Winning option : Claude 3.5 Sonnet

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 6Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 6

7. Check your eyesight

To test visual abilities, the author uploaded images of difficult-to-read writing to see how well the models could detect the characters and extract them. To our surprise, all three models did a great job and accurately identified the texts. Regarding OCR, all three models are quite capable.

Winning options : Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 7Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 7

8. Create games

In this test, the author uploaded an image of the classic Tetris game without revealing the name and only asked the model to create a game like this in Python. All three models guessed the game correctly, but only the code created by Sonnet ran successfully. Both ChatGPT 4o and Gemini 1.5 Pro fail to produce error-free code.

Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 8Compare Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro Picture 8

In just one go, the game ran successfully using Sonnet's code. Many programmers use ChatGPT 4o to assist with coding, but it looks like Anthropic's model may become the new favorite among programmers.

Claude 3.5 Sonnet has achieved 92% of the HumanEval benchmark to evaluate programming ability. In this benchmark, GPT-4o reached 90.2% and Gemini 1.5 Pro at 84.1%. Obviously, for programming, there is a new SOTA model and that is the Claude 3.5 Sonnet.

Winning option : Claude 3.5 Sonnet

After running various tests on all three models, the Claude 3.5 Sonnet is as good as the ChatGPT 4o model, if not better. Especially in the field of programming, Anthropic's new model is really impressive. It's worth noting that the latest Sonnet model isn't even Anthropic's largest yet.

The company says the Claude 3.5 Opus will launch later this year and perform even better. Google's Gemini 1.5 Pro also performed better than previous tests, which means it has improved significantly. Overall, it can be said that OpenAI is not the only AI doing well in the LLM field. Anthropic's Claude 3.5 Sonnet is a testament to that.

5 ★ | 1 Vote