Gemini 2.0 Flash
Google · December 2024
● activeCloseddecoder onlymultimodalAPI Available
Why It Matters
First production model to combine real-time multimodal I/O with autonomous agent capabilities, pointing toward AI systems that can see, hear, speak, and act.
Description
A natively multimodal model that can process and generate text, images, audio, and video in real-time. The first Gemini model with built-in tool use (the ability to call external APIs and services) and agentic capabilities (the ability to autonomously plan and execute multi-step tasks). Also includes steerable text-to-speech — voice generation where you can control the tone and style.
Key Innovations
Multimodal
MultimodalProcessing multiple types of input (text, images, audio, video) in a single model.
Tool Use
Tool UseAbility to call external tools, APIs, and functions — enabling web browsing, code execution, and real-world actions.
Agentic
AgenticModels that can autonomously plan, execute multi-step tasks, use tools, and self-correct without human intervention.