Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Kavli Affiliate: Felix Fischer

| First 5 Authors: Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai

| Summary:

In this report, we introduce the Gemini 1.5 family of models, representing
the next generation of highly compute-efficient multimodal models capable of
recalling and reasoning over fine-grained information from millions of tokens
of context, including multiple long documents and hours of video and audio. The
family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds
the February version on the great majority of capabilities and benchmarks; (2)
Gemini 1.5 Flash, a more lightweight variant designed for efficiency with
minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on
long-context retrieval tasks across modalities, improve the state-of-the-art in
long-document QA, long-video QA and long-context ASR, and match or surpass
Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of
benchmarks. Studying the limits of Gemini 1.5’s long-context ability, we find
continued improvement in next-token prediction and near-perfect retrieval
(>99%) up to at least 10M tokens, a generational leap over existing models such
as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world
use cases, such as Gemini 1.5 collaborating with professionals on completing
their tasks achieving 26 to 75% time savings across 10 different job
categories, as well as surprising new capabilities of large language models at
the frontier; when given a grammar manual for Kalamang, a language with fewer
than 200 speakers worldwide, the model learns to translate English to Kalamang
at a similar level to a person who learned from the same content.

| Search Query: ArXiv Query: search_query=au:”Felix Fischer”&id_list=&start=0&max_results=3