How Gemini Embedding 2 AI Model Combines Multiple Media in a Single Space
In Focus
- Google’s new Embedding model converts images and texts into a unified data space
- Embedding 2 represents a shift in AI model architecture
- The new AI model can support up to 8,192 input tokens for text
Google has released the initial multimodal embedding AI model. According to Gadgets360, Google’s Gemini Embedding 2 AI model is capable of converting images, text, audio, and videos into a single, shared data space.
By doing so, the model allows developers to search, retrieve, and classify information across different types of media. Currently, the new model is available in public preview.
A Shift in Embedding Models Architecture
The release of Gemini Embedding 2 multimodal AI model represents a shift in the architecture of embedding models. The AI model architecture shifts from the modality-specific approach to a unified multimodal latent space.
According to Google, this approach simplifies the way large language models (LLMs) understand information and improves tasks such as Retrieval-Augmented Generation (RAG), Gemini Embedding 2 semantic search, sentiment analysis, and data clustering.
Google released the Embedding 2 AI model days after it introduced the Gemini 3.1 Flash-Lite model, which is designed to handle high-volume developer workloads. The tech giant said Flash-Lite is the fastest and most cost-efficient AI model in the Gemini 3 series.
How the Gemini Embedding 2 AI Model Works
AI models store text, photos, videos, and audio files in separate digital “cabinets.” When a user requests specific information, the system searches the relevant cabinet to find it. Often, LLMs do not treat different formats of the same information the same way. The retrieval method also varies from one format to another.
Gemini Embedding 2 AI model fixes this problem through a new architecture that stores all types of information in a single system. This allows it to process documents containing both text and images at the same time, the way humans interpret information.
“Beyond processing one modality at a time, this model natively understands interleaved input so you can pass multiple modalities of input (e.g., image + text) in a single request. This allows the model to capture the complex, nuanced relationships between different media types, unlocking a more accurate understanding of complex, real-world data,” Google noted in a blog post.
What are the Capabilities of the Embedding 2 AI Model?
Google’s multimodal AI capabilities include capturing logic intent in more than 100 languages. The new AI model can support a maximum of 8,192 input tokens for text and process up to six images per request in PNG and JPEG formats.
Additionally, the model can support up to 120 seconds of video in MP4 and MOV formats, can embed audio without requiring transcription, and can process PDFs of up to six pages directly.
“Gemini Embedding 2 doesn’t just improve on legacy models. It establishes a new performance standard for multimodal depth, introducing strong speech capabilities and outperforming leading models in text, image, and video tasks,” Google added.
Gemini Embedding 2 also understands mixed inputs. This allows users to send different data formats, like text and images, in the same request. Currently, Gemini Embedding 2 is accessible via Gemini 2 API and Vertex AI.
