Google Introduces New LLM Development Method: Faster, Stronger, and Cheaper

Google Research introduces speculative cascades to help large language models (LLMs) run faster, more cost-effectively, and maintain superior quality.

Since OpenAI launched GPT-3 in 2022 – the platform behind ChatGPT – large language models (LLMs) have taken the world by storm. They are widely used in many fields, from coding to search. However, the process of generating feedback (called inference ) is quite slow and computationally expensive. As more and more people use LLMs, speeding up, reducing costs while ensuring quality becomes a matter of survival for developers.

There are currently two methods that were hoped to solve this problem: cascades and speculative decoding .

Cascades : use a small, faster model to process first, then switch to a larger model if needed. This saves on computation costs but has the disadvantage of having to 'wait' for the small model to decide. If it is uncertain, the response time is still prolonged and the quality of the answer is subject to fluctuations.
Speculative decoding : a small model acts as a 'draft', predicting tokens in parallel. The larger model then quickly verifies the results. This method prioritizes speed but is quite strict: if just 1 token is wrong, the entire draft is discarded, even if most of the answers are correct. This sometimes causes the speed advantage to be lost and the computational cost does not decrease as expected.

Clearly, both approaches have their limitations. So Google Research has developed a new hybrid approach called speculative cascades. The core of this is a flexible delay rule that can decide to accept the results of the small model or switch to the larger model depending on the situation. This avoids the 'waiting' bottleneck of cascades and escapes the 'reject all drafts' stricture of speculative decoding.

In other words, even if the small model gives an answer that does not match the large model, the system can still accept it if it is a reasonable answer.

Images 1 of Google Introduces New LLM Development Method: Faster, Stronger, and Cheaper

Google Research tested this method on models like Gemma and T5 for a variety of language tasks: text summarization, inference, and coding. The results showed that speculative cascades outperformed traditional methods in terms of cost-to-performance and speed. In many cases, they produced correct answers faster than speculative decoding.

Right now, this is just lab research. But if successful and implemented in the real world, users will have the opportunity to experience LLM that is faster, more powerful, and significantly cheaper.

Google Introduces New LLM Development Method: Faster, Stronger, and Cheaper

GPT-4 exploits vulnerabilities faster and cheaper than humans

SSD is getting cheaper and cheaper, HDD will be the technology of the old period?

5 smart living habits that help you do what you want

Detailed instructions for using the ChatGPT Images 2.0 image creation tool.

Google claims the new occlusion feature makes Chrome on Windows 25.8% faster

Google: From tiny little boy to tech village giant at age 20

How to work with Google Sheets faster

7 harsh facts of life help you be stronger

Is Google DNS or Cloudflare DNS faster?

Method in HTTP

Opera Dragonfly - Comprehensive web development tool

How to remove payment method from Google Play

DNS Goolge - How to switch Google DNS 8.8.8.8 8.8.4.4 to get into the network faster

Google invites a programmer to buy Glass

Google Chrome is now 23% faster, have you tried it?

Google Photos introduces a new messaging feature that allows sharing photos extremely quickly