My IBM

What is Google Gemma?

8 November 2024

Authors

Rina Diane Caballar

Staff Writer

What is Google Gemma?

Gemma is Google’s family of free and open small language models (SLMs). They’re built from the same technology as the Gemini family of large language models (LLMs) and are considered “lightweight” versions of Gemini.

Because they’re leaner than Gemini models, Gemma models can be deployed on laptops and mobile devices, but they’re also optimized for NVIDIA graphics processing units (GPUs) and Google Cloud tensor processing units (TPUs). Yet unlike Gemini, Gemma is not multilingual nor multimodal.

These text-to-text artificial intelligence (AI) models derive their name from the same Latin word, which means “precious stone.” Gemma is a group of open models, with Google providing free access to the model weights, and the models are freely available for individual and commercial use and redistribution.¹

Gemma’s first-generation models were introduced in February 2024,¹ while the second-generation models were announced in June 2024.²

The Gemma model family

Gemma’s collection of AI models includes Gemma and Gemma 2 at its core, plus a few more specialized models that have been optimized for specific tasks and have a different architecture underpinning them. Models in the Gemma line have base or pretrained variants and instruction-tuned variants.

Gemma

Gemma is the first generation of the Gemma models. Gemma 2B is the smallest at 2 billion parameters, while Gemma 7B has 7 billion parameters. These models were trained on code and math datasets and mostly English-language content from web documents.³

Gemma 2

Gemma 2 is the second generation of the Gemma family. According to Google, Gemma 2 has better performance and is more efficient at AI inferencing (when a model generates a response to a user’s query) compared to its predecessor.²

The model is available in 2, 9 and 27 billion parameter sizes. Their training datasets encompass English-language web documents, code and science articles.⁴

CodeGemma

This text-to-code model is fine-tuned for coding tasks. It supports multiple programming languages, including C++, C#, Go, Java, JavaScript, Kotlin, Python and Rust.⁵

CodeGemma has a 7B pretrained variant for code completion and code generation, a 7B instruction-tuned variant for natural language code chat and instruction following and a 2B pretrained variant for swift code completion.⁵

DataGemma

DataGemma is composed of fine-tuned Gemma and Gemma 2 models that supplement their responses with data from Google’s Data Commons, a repository of public statistical data. DataGemma RIG models apply retrieval-interleaved generation to create natural language queries for getting data from Data Commons. Meanwhile, DataGemma RAG models employ retrieval-augmented generation for fetching data from Data Commons that can augment the models’ prompts.⁶

PaliGemma

This vision-language model accepts both images and text as input and produces text as output. As such, it’s ideal for answering questions about images, detecting objects within images, generating image captions and reading text embedded in images. Its underlying architecture consists of a vision transformer image encoder and a transformer text decoder initialized from Gemma 2B.⁷

PaliGemma has a general-purpose set of pretrained models and a research-oriented set of models fine-tuned on certain research datasets. Google notes that most PaliGemma models require fine-tuning, and outputs must be tested before deployment to users.⁸

RecurrentGemma

RecurrentGemma uses a recurrent neural network architecture developed by Google researchers. This makes it quicker at inferencing—particularly when generating long sequences—and requires less memory than Gemma. It comes in 2B and 9B pretrained and instruction-tuned models.⁹

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Gemma use cases

CodeGemma and PaliGemma have their own specific use cases. But in general, people can use Gemma for natural language processing (NLP) and natural language understanding tasks, including:

Building conversational AI assistants and chatbots
Editing and proofreading
Question answering and research
Text generation, such as emails, advertising copy and other content
Text summarization, especially for lengthy documents and huge volumes of reports or research papers

How does Google Gemma work?

Gemma is based on a transformer model, a neural network architecture that originated from Google in 2017.¹⁰

Here’s a brief overview of how transformer models work:

Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence.

A self-attention mechanism allows transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position.

Decoders use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.

However, Gemma uses a variation of the transformer architecture known as the decoder-only transformer.¹¹ In this model, input sequences are fed directly into the decoder, which still uses embeddings and attention mechanisms to generate the output sequence.

Gemma model architecture

Gemma’s first-generation models improve upon transformers through a few architectural elements:

Each layer of the neural network applies rotary positional embeddings instead of absolute positional embeddings. Embeddings are also shared across inputs and outputs to compress the model.³

Gemma 7B employs multihead attention, with multiple “attention heads” having their own keys and values to capture different types of relationships between tokens. In contrast, Gemma 2B employs multiquery attention, where all attention heads share a single set of keys and values, thereby enhancing speed and lessening the memory load.¹¹

Gemma 2 model architecture

Gemma 2 uses deeper neural networks than Gemma. Here are some other notable architectural differences:⁴

For every other layer of its neural network, Gemma 2 alternates between a local sliding window attention and global attention. Local sliding window attention is a dynamic mechanism for focusing on certain fixed-size “windows” of input sequences, allowing models to concentrate on only a few words at a time. Global attention, meanwhile, attends to every token in the sequence.

Gemma 2 also employs grouped-query attention, a divide-and-conquer approach that splits queries into smaller groups and computes attention within each group separately.

Also, the Gemma 2 2B and 9B models apply knowledge distillation, which entails “distilling” a larger model’s knowledge into a smaller one by training the smaller model to emulate the larger model’s reasoning process and match its predictions.

Instruction tuning

In terms of instruction tuning, which primes the model to better follow instructions, both Gemma and Gemma 2 apply supervised fine-tuning and reinforcement learning from human feedback (RLHF).⁴ Supervised fine-tuning uses labeled examples of instruction-oriented tasks to teach the model how to structure its responses. Meanwhile, RLHF uses a reward model to translate quality ratings from human evaluators into numerical reward signals, helping models learn which responses will garner positive feedback.

Gemma performance

Evaluations of Gemma 7B’s performance in LLM benchmarks spanning code generation, commonsense reasoning, language understanding, mathematical reasoning and question answering indicate that it is comparable to SLMs of a similar scale such as Llama 3 8B and Mistral 7B. Gemma 2 9B and 27B performed even better, surpassing both Llama 3 8B and Mistral 7B in most benchmarks.¹²

However, Llama 3.2 3B and Ministral 3B, the latest SLMs from Meta and Mistral, respectively, have surpassed Gemma 2 2B in various benchmarks.¹³ Microsoft’s Phi-3-mini, a 3.8-billion-parameter language model, also gained higher performance than Gemma 7B.¹⁴

How can people access Gemma?

Gemma models can be accessed through these platforms:

Google AI Studio

Hugging Face (also integrated into Hugging Face Transformers)

Kaggle

Vertex AI Model Garden

Also, developers can implement the models in open source machine learning frameworks such as JAX, LangChain, PyTorch and TensorFlow, and through application programming interfaces (APIs) like Keras 3.0. In addition, because Gemma includes optimization across NVIDIA GPUs, developers can use NVIDIA tools, including the NeMo framework to fine-tune models and TensorRT-LLM to optimize them for efficient inferencing on NVIDIA GPUs.

For enterprise AI development, Gemma models can be deployed on Google Cloud Vertex AI and Google Kubernetes Engine (GKE). For those with limited computational power, Google Colab provides free cloud-based access to computing resources like GPUs and TPUs.

Gemma risks

Like other AI models, Google Gemma continues to grapple with the risks of AI, including:

Bias: Smaller models can learn from the bias present in their larger counterparts, and this domino effect can reflect in their results.

Hallucinations: Verifying and monitoring the outputs of SLMs like Gemma is essential to make sure what they produce is accurate and factually correct.

Privacy violations: Google notes that the training datasets for Gemma and Gemma 2 have been filtered to remove certain personal information and other sensitive data.⁴ However, individual users and enterprises must still be cautious with the data they use to fine-tune Gemma to avoid leaking any personal or proprietary data.

When it comes to safety and security, Google evaluated Gemma on several metrics, including offensive cybersecurity, CBRN (chemical, biological, radiological and nuclear) knowledge, self-proliferation (the ability to autonomously replicate) and persuasion. Gemma’s knowledge in CBRN domains is low. Similarly, the model has low capabilities in offensive cybersecurity, self-proliferation and persuasion.⁴

Google also released a Responsible Generative AI Toolkit to help AI researchers and developers build responsible and safe AI applications.¹

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Go to episode

Footnotes

All links reside outside ibm.com

¹ Gemma: Introducing new state-of-the-art open models, Google, 21 February 2024

² Gemma 2 is now available to researchers and developers, Google, 27 June 2024

³ Gemma: Open Models Based on Gemini Research and Technology, Google DeepMind, 21 February 2024

⁴ Gemma 2: Improving Open Language Models at a Practical Size, Google DeepMind, 27 June 2024

⁵ CodeGemma model card, Google AI for developers, 5 August 2024

⁶ Knowing When to Ask — Bridging Large Language Models and Data, arXiv, 10 September 2024

⁷ PaliGemma model card, GoogleAI for developers, 5 August 2024

⁸ PaliGemma, Google AI for developers, 5 August 2024

⁹ RecurrentGemma model card, Google AI for developers, 5 August 2024

¹⁰ Transformer: A Novel Neural Network Architecture for Language Understanding, Google Research, 31 August 2017

¹¹ Gemma explained: An overview of Gemma model family architectures, Google for Developers, 15 August 2024

¹² Gemma Open Models, Google AI for Developers, Accessed 5 November 2024

¹³ Un Ministral, des Ministraux, Mistral AI, 16 October 2024

¹⁴ Introducing Phi-3: Redefining what’s possible with SLMs, Microsoft, 23 April 2024

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

Explore IBM Granite

Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Discover the power of LLMs

Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.

The CEO’s guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the Power of Generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

AI in Action 2024

Read about 2,000 organizations we surveyed about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

What is Google Gemma?

Tags

8 November 2024

Authors

Rina Diane Caballar

What is Google Gemma?

The Gemma model family

Gemma

Gemma 2

CodeGemma

DataGemma

PaliGemma

RecurrentGemma

The latest AI News + Insights

Gemma use cases

How does Google Gemma work?

Gemma model architecture

Gemma 2 model architecture

Instruction tuning

Gemma performance

How can people access Gemma?

Gemma risks

Why foundation models are a paradigm shift for AI

Footnotes

Resources

Related solutions

What is Google Gemma?

Tags

8 November 2024

Share

Authors

Rina Diane Caballar

What is Google Gemma?

The Gemma model family

Gemma

Gemma 2

CodeGemma

DataGemma

PaliGemma

RecurrentGemma

The latest AI News + Insights

Gemma use cases

How does Google Gemma work?

Gemma model architecture

Gemma 2 model architecture

Instruction tuning

Gemma performance

How can people access Gemma?

Gemma risks

Why foundation models are a paradigm shift for AI

Footnotes

Resources

Related solutions

The latest AI News + Insights