The Gemini Embedding model achieves a SOTA performance across many key dimensions including code, multi-lingual, and retrieval.
Gemini Embedding is a state-of-the-art model that leverages the Gemini architecture to produce highly generalizable embeddings for text and code across numerous languages, designed for tasks like retrieval, classification, and clustering.
An Introduction to the Gemini Embedding Model
Gemini Embedding is a state-of-the-art embedding model designed to leverage the capabilities of Google’s Gemini large language model. It produces highly generalizable, dense vector representations for text spanning over 100 languages and various textual modalities, including code. These embeddings can be precomputed and applied to a wide range of downstream tasks such as classification, semantic similarity, clustering, ranking, and information retrieval.
Model Architecture
The model’s architecture is designed to create holistic representations of inputs. The process begins by initializing the embedding model from a pre-existing Gemini model, which allows it to build upon the vast knowledge already contained within Gemini’s parameters.
The technical process involves three main steps:
- An input text sequence is processed by a transformer with bidirectional attention, which produces a sequence of token-level embeddings.
- A mean pooling strategy is then applied. This involves averaging the token embeddings along the sequence axis to generate a single, fixed-size embedding that represents the entire input.
- Finally, a randomly initialized linear projection layer scales this pooled embedding to the desired final output dimension.
Training
The Gemini Embedding model was refined using a training objective based on a noise-contrastive estimation (NCE) loss function with in-batch negatives.
Performance and Capabilities
When evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across more than 250 languages, Gemini Embedding has been shown to substantially outperform previous state-of-the-art models. It established a new state-of-the-art on the public leaderboard, achieving a mean score of 68.32, a significant improvement over the next-best model.
The model demonstrates exceptional performance not only in high-resource languages like English but also in numerous low-resource languages, such as Macedonian. It has also set new records on specific benchmarks like XOR-Retrieve for cross-lingual retrieval. This unified model shows strong capabilities across a broad selection of tasks, surpassing even specialized, domain-specific models in English, multilingual, and code benchmarks.