Gemma is Google’s family of free and open small language models (SLMs). They’re built from the same technology as the Gemini family of large language models (LLMs) and are considered “lightweight” versions of Gemini.
Because they’re leaner than Gemini models, Gemma models can be deployed on laptops and mobile devices, but they’re also optimized for NVIDIA graphics processing units (GPUs) and Google Cloud tensor processing units (TPUs). Yet unlike Gemini, Gemma is not multilingual nor multimodal.
These text-to-text artificial intelligence (AI) models derive their name from the same Latin word, which means “precious stone.” Gemma is a group of open models, with Google providing free access to the model weights, and the models are freely available for individual and commercial use and redistribution.1
Gemma’s first-generation models were introduced in February 2024,1 while the second-generation models were announced in June 2024.2
Gemma’s collection of AI models includes Gemma and Gemma 2 at its core, plus a few more specialized models that have been optimized for specific tasks and have a different architecture underpinning them. Models in the Gemma line have base or pretrained variants and instruction-tuned variants.
Gemma is the first generation of the Gemma models. Gemma 2B is the smallest at 2 billion parameters, while Gemma 7B has 7 billion parameters. These models were trained on code and math datasets and mostly English-language content from web documents.3
Gemma 2 is the second generation of the Gemma family. According to Google, Gemma 2 has better performance and is more efficient at AI inferencing (when a model generates a response to a user’s query) compared to its predecessor.2
The model is available in 2, 9 and 27 billion parameter sizes. Their training datasets encompass English-language web documents, code and science articles.4
This text-to-code model is fine-tuned for coding tasks. It supports multiple programming languages, including C++, C#, Go, Java, JavaScript, Kotlin, Python and Rust.5
CodeGemma has a 7B pretrained variant for code completion and code generation, a 7B instruction-tuned variant for natural language code chat and instruction following and a 2B pretrained variant for swift code completion.5
DataGemma is composed of fine-tuned Gemma and Gemma 2 models that supplement their responses with data from Google’s Data Commons, a repository of public statistical data. DataGemma RIG models apply retrieval-interleaved generation to create natural language queries for getting data from Data Commons. Meanwhile, DataGemma RAG models employ retrieval-augmented generation for fetching data from Data Commons that can augment the models’ prompts.6
This vision-language model accepts both images and text as input and produces text as output. As such, it’s ideal for answering questions about images, detecting objects within images, generating image captions and reading text embedded in images. Its underlying architecture consists of a vision transformer image encoder and a transformer text decoder initialized from Gemma 2B.7
PaliGemma has a general-purpose set of pretrained models and a research-oriented set of models fine-tuned on certain research datasets. Google notes that most PaliGemma models require fine-tuning, and outputs must be tested before deployment to users.8
RecurrentGemma uses a recurrent neural network architecture developed by Google researchers. This makes it quicker at inferencing—particularly when generating long sequences—and requires less memory than Gemma. It comes in 2B and 9B pretrained and instruction-tuned models.9
CodeGemma and PaliGemma have their own specific use cases. But in general, people can use Gemma for natural language processing (NLP) and natural language understanding tasks, including:
Gemma is based on a transformer model, a neural network architecture that originated from Google in 2017.10
Here’s a brief overview of how transformer models work:
Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence.
A self-attention mechanism allows transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position.
Decoders use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.
However, Gemma uses a variation of the transformer architecture known as the decoder-only transformer.11 In this model, input sequences are fed directly into the decoder, which still uses embeddings and attention mechanisms to generate the output sequence.
Gemma’s first-generation models improve upon transformers through a few architectural elements:
Each layer of the neural network applies rotary positional embeddings instead of absolute positional embeddings. Embeddings are also shared across inputs and outputs to compress the model.3
Gemma 7B employs multihead attention, with multiple “attention heads” having their own keys and values to capture different types of relationships between tokens. In contrast, Gemma 2B employs multiquery attention, where all attention heads share a single set of keys and values, thereby enhancing speed and lessening the memory load.11
Gemma 2 uses deeper neural networks than Gemma. Here are some other notable architectural differences:4
For every other layer of its neural network, Gemma 2 alternates between a local sliding window attention and global attention. Local sliding window attention is a dynamic mechanism for focusing on certain fixed-size “windows” of input sequences, allowing models to concentrate on only a few words at a time. Global attention, meanwhile, attends to every token in the sequence.
Gemma 2 also employs grouped-query attention, a divide-and-conquer approach that splits queries into smaller groups and computes attention within each group separately.
Also, the Gemma 2 2B and 9B models apply knowledge distillation, which entails “distilling” a larger model’s knowledge into a smaller one by training the smaller model to emulate the larger model’s reasoning process and match its predictions.
In terms of instruction tuning, which primes the model to better follow instructions, both Gemma and Gemma 2 apply supervised fine-tuning and reinforcement learning from human feedback (RLHF).4 Supervised fine-tuning uses labeled examples of instruction-oriented tasks to teach the model how to structure its responses. Meanwhile, RLHF uses a reward model to translate quality ratings from human evaluators into numerical reward signals, helping models learn which responses will garner positive feedback.
Evaluations of Gemma 7B’s performance in LLM benchmarks spanning code generation, commonsense reasoning, language understanding, mathematical reasoning and question answering indicate that it is comparable to SLMs of a similar scale such as Llama 3 8B and Mistral 7B. Gemma 2 9B and 27B performed even better, surpassing both Llama 3 8B and Mistral 7B in most benchmarks.12
However, Llama 3.2 3B and Ministral 3B, the latest SLMs from Meta and Mistral, respectively, have surpassed Gemma 2 2B in various benchmarks.13 Microsoft’s Phi-3-mini, a 3.8-billion-parameter language model, also gained higher performance than Gemma 7B.14
Gemma models can be accessed through these platforms:
Google AI Studio
Hugging Face (also integrated into Hugging Face Transformers)
Kaggle
Vertex AI Model Garden
Also, developers can implement the models in open source machine learning frameworks such as JAX, LangChain, PyTorch and TensorFlow, and through application programming interfaces (APIs) like Keras 3.0. In addition, because Gemma includes optimization across NVIDIA GPUs, developers can use NVIDIA tools, including the NeMo framework to fine-tune models and TensorRT-LLM to optimize them for efficient inferencing on NVIDIA GPUs.
For enterprise AI development, Gemma models can be deployed on Google Cloud Vertex AI and Google Kubernetes Engine (GKE). For those with limited computational power, Google Colab provides free cloud-based access to computing resources like GPUs and TPUs.
Like other AI models, Google Gemma continues to grapple with the risks of AI, including:
Bias: Smaller models can learn from the bias present in their larger counterparts, and this domino effect can reflect in their results.
Hallucinations: Verifying and monitoring the outputs of SLMs like Gemma is essential to make sure what they produce is accurate and factually correct.
Privacy violations: Google notes that the training datasets for Gemma and Gemma 2 have been filtered to remove certain personal information and other sensitive data.4 However, individual users and enterprises must still be cautious with the data they use to fine-tune Gemma to avoid leaking any personal or proprietary data.
When it comes to safety and security, Google evaluated Gemma on several metrics, including offensive cybersecurity, CBRN (chemical, biological, radiological and nuclear) knowledge, self-proliferation (the ability to autonomously replicate) and persuasion. Gemma’s knowledge in CBRN domains is low. Similarly, the model has low capabilities in offensive cybersecurity, self-proliferation and persuasion.4
Google also released a Responsible Generative AI Toolkit to help AI researchers and developers build responsible and safe AI applications.1
All links reside outside ibm.com
1 Gemma: Introducing new state-of-the-art open models, Google, 21 February 2024
2 Gemma 2 is now available to researchers and developers, Google, 27 June 2024
3 Gemma: Open Models Based on Gemini Research and Technology, Google DeepMind, 21 February 2024
4 Gemma 2: Improving Open Language Models at a Practical Size, Google DeepMind, 27 June 2024
5 CodeGemma model card, Google AI for developers, 5 August 2024
6 Knowing When to Ask — Bridging Large Language Models and Data, arXiv, 10 September 2024
7 PaliGemma model card, GoogleAI for developers, 5 August 2024
8 PaliGemma, Google AI for developers, 5 August 2024
9 RecurrentGemma model card, Google AI for developers, 5 August 2024
10 Transformer: A Novel Neural Network Architecture for Language Understanding, Google Research, 31 August 2017
11 Gemma explained: An overview of Gemma model family architectures, Google for Developers, 15 August 2024
12 Gemma Open Models, Google AI for Developers, Accessed 5 November 2024
13 Un Ministral, des Ministraux, Mistral AI, 16 October 2024
14 Introducing Phi-3: Redefining what’s possible with SLMs, Microsoft, 23 April 2024
Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Learn how to select the most suitable AI foundation model for your use case.
Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.
Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.
Explore Granite 3.2 and the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.
Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.