My IBM

What is Google Gemini?

27 September 2024

Authors

What is Google Gemini?

Gemini is Google’s large language model (LLM). More broadly, it’s a family of multimodal AI models designed to process multiple modalities or types of data, including audio, images, software code, text and video.

Gemini is also the model that powers Google’s generative AI (gen AI) chatbot (formerly Bard) of the same name, much like Anthropic’s Claude is named for both the chatbot and the family of LLMs behind it. The Gemini apps on both the web and mobile act as a chatbot interface for the underlying models.

Google is gradually integrating the Gemini chatbot into its suite of technologies. For instance, Gemini is the default artificial intelligence (AI) assistant on the latest Google Pixel 9 and Pixel 9 Pro phones, replacing Google Assistant. In Google Workspace, Gemini is available on the Docs side panel to help write and edit content, and on the Gmail side panel to assist with drafting emails, suggesting responses and searching a user’s inbox for information.

Other Google apps are also incorporating Gemini. Google Maps, for example, is drawing on Gemini model capabilities to supply summaries of places and areas.

How does Google Gemini work?

Gemini has been trained on a massive corpus of multilingual and multimodal data sets. It employs a transformer model, a neural network architecture that Google itself introduced in 2017.¹

Here’s a brief overview of how transformer models work:

Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence.

A self-attention mechanism enables transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position.

Decoders use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.

Unlike generative pretrained transformer (GPT) models that take only text-based prompts or diffusion models used for image generation that take both text and image prompts, Google Gemini supports interleaved sequences of audio, image, text and video as inputs and can produce interleaved text and image outputs.²

Gemini AI model versions

The Gemini family of multimodal AI models comes in multiple variants. Each variant is optimized for different devices and tasks.

Gemini’s first-generation model, 1.0, comes in Nano and Ultra. The next-generation model, 1.5, comes in Pro and Flash.

Building on and experimenting with Gemini’s AI features and functions can be done through the Gemini API in the Google AI Studio and Google Cloud Vertex AI development platforms. For now, only Gemini 1.5 Pro and Gemini 1.5 Flash are available.

Gemini 1.0 Nano

Gemini 1.0 Nano is the smallest version of the 1.0 family designed to operate on mobile devices, even without a data network. It can perform on-device tasks such as describe images, suggest replies to chat messages, summarize text and transcribe speech.

Gemini Nano is available on Android devices starting with the Pixel 8 Pro. Moving beyond its mobile-only limits, Google is incorporating Gemini Nano into its Chrome desktop client.

Gemini 1.0 Ultra

Gemini 1.0 Ultra is the largest version of the 1.0 family with advanced analytical capabilities. It’s built for highly complex tasks such as coding, mathematical reasoning and multimodal reasoning. The context window—the number of tokens that a model can process at once—of both Gemini Nano and Gemini Ultra is 32,000 tokens.²

Gemini 1.5 Pro

Gemini 1.5 Pro is a midsized multimodal model with a context window of up to 2 million tokens. This long context window enables Gemini Pro to process information on a larger scale: from hours of audio and video to thousands of lines of code or hundreds of pages of documents.³

In addition to a transformer architecture, Gemini 1.5 Pro applies a Mixture of Experts (MoE) architecture. MoE models are split into smaller “expert” neural networks, each specializing in a certain domain or data type. The model learns to selectively activate only the most relevant experts depending on the input type. This results in swifter performance while reducing computational costs.⁴

Gemini 1.5 Flash

Gemini 1.5 Flash is a lightweight version of Gemini Pro. It was trained using a machine learning (ML) technique called knowledge distillation, wherein insights from Gemini 1.5 Pro were transferred to the more compact Gemini 1.5 Flash. It also features a long context window of up to 1 million tokens but has a lower latency that makes it faster and more efficient.³

A brief history of Google Gemini

Google has been a pioneer in LLM architecture and draws upon its robust research to develop its own AI models.

2017: Google researchers present the transformer architecture, which underpins many of today’s LLMs.

2020: The company introduces the Meena chatbot, a neural network-based conversational agent with 2.6 billion parameters.⁵

2021: Google unveils LaMDA (Language Model for Dialogue Applications), its conversational LLM.⁶

2022: PaLM (Pathways Language Model) is released, with more advanced capabilities compared to LaMDA.⁷

2023: Bard starts during the first quarter of the year, backed by a lightweight and optimized version of LaMDA.⁸ The second quarter sees PaLM 2 released—with enhanced coding, multilingual and reasoning skills—and adopted by Bard.⁹ Google announces Gemini 1.0 in the last quarter of the year.

2024: Google renames Bard as Gemini and upgrades its multimodal AI models to version 1.5.

The word “Gemini” means “twins” in Latin and is both a zodiac sign and a constellation. It was an apt name given that the Gemini model is the brainchild of Google DeepMind, a merging of forces between the teams at DeepMind and Google Brain. The company also took inspiration from NASA’s Project Gemini, a two-person spacecraft integral to the success of the Apollo mission.¹⁰

Gemini performance

Gemini Ultra surpasses similar models in various LLM benchmarks. It outperforms Claude 2, GPT-4 and Llama 2 in benchmarks such as GSM8K for mathematical reasoning, HumanEval for code generation and MMLU for natural language understanding.²

Notably, Gemini Ultra exceeded even human expert performance in MMLU. However, GPT-4 still performs better than Gemini Ultra in the HellaSwag benchmark for common sense reasoning and natural language inference.²

Google also evaluated Gemini Ultra’s multimodal capabilities. It performed higher than other models in document understanding, image understanding and automatic speech recognition benchmarks. And despite beating LLMs in benchmarks for automatic speech translation, English video captioning, multimodal understanding and reasoning, and video question answering, Gemini Ultra’s performance in these areas leave room for improvement.²

Meanwhile, the performance of both Gemini 1.5 Flash and Gemini 1.5 Pro is comparable to or even surpasses Gemini 1.0 Ultra.¹¹ As its context window increases, Gemini 1.5 Pro maintains a high level of performance.⁴

Gemini use cases

Google Gemini is still in its early stages, but this highly capable AI model has the potential to be implemented in a wide array of applications:

Advanced coding

Image and text understanding

Language translation

Malware analysis

Personalized AI experts

Universal AI agents

Voice assistants

Advanced coding

The Gemini AI model can work across programming languages such as C++, Java and Python to understand, explain and generate code. Google used fine-tuned versions of Gemini Pro as foundation models to develop AlphaCode2, a code generation system that can solve competitive programming problems with elements of theoretical computer science and complex math.

Image and text understanding

Gemini can be used to extract text from images and caption images. It can analyze visuals such as charts, diagrams and figures without the aid of optical character recognition (OCR) tools that convert images of text into a machine-readable format.

Language translation

Because of their multilingual capabilities, Google’s AI models can be used to translate different languages. In the Meet video conferencing app, for instance, users can turn on translated captions to translate to and from specific languages.

Malware analysis

Both Gemini 1.5 Pro and Gemini 1.5 Flash can be employed for malware analysis. Gemini Pro can accurately determine whether a file or code snippet is malicious and can generate a detailed report of its findings.¹² Meanwhile, Gemini Flash can conduct rapid, large-scale malware dissection.¹³

Personalized AI experts

Google recently released a new feature called Gems that allows users to customize the Gemini chatbot to create tailored AI “experts” on any task or topic. Some examples of premade Gems include a learning coach to help break down complex topics and make them easier to understand, a brainstorming partner to offer fresh ideas for the next video, and a writing editor to provide feedback on grammar and structure.

Gems come with a Gemini Advanced subscription, which uses the Gemini 1.5 Pro model.

Universal AI agents

Through Project Astra, Google is building on its Gemini models to create a universal AI agent that can process, remember and understand multimodal information in real time. To improve recall and efficiency, Project Astra harnesses caching, continuous encoding of video frames and coupling speech and video input into a timeline of events.¹⁴

In one of Google’s demos, the Gemini AI assistant was able to explain the parts of a speaker, recognize the neighborhood a person was in and remember where they put their glasses.¹⁴

Voice assistants

With Gemini Live, users can have a dialogue with the Gemini chatbot that feels more natural and conversational. It offers more intuitive responses and can adapt to a person’s conversational style.

Gemini risks

Like other LLMs, Google Gemini continues to grapple with the risks of AI. Caution is recommended, especially for individuals intending to use Gemini and organizations considering the model for commercial use or integration into their workflows.

Bias: In February 2024, Google decided to pause the Gemini chatbot’s ability to create images of people owing to its inaccurate portrayal of historical figures, erasing a history of racial bias.¹⁵

Hallucinations: As of this writing, Gemini-backed AI overview search results are still occasionally producing factually incorrect outputs.

Intellectual property violations: Google was fined by regulators in France, noting that the company’s AI chatbot was trained on news stories and content without the knowledge or consent of publishers in the country.¹⁶

The latest AI News + Insights  

Expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

Explore IBM Granite

Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Discover the power of LLMs

Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.

The CEO’s guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the Power of Generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

AI in Action 2024

Read about 2,000 organizations we surveyed about their AI initiatives to discover what’s working, what’s not and how you can get ahead.