Gemini is Google’s large language model (LLM). More broadly, it’s a family of multimodal AI models designed to process multiple modalities or types of data, including audio, images, software code, text and video.
Gemini is also the model that powers Google’s generative AI (gen AI) chatbot (formerly Bard) of the same name, much like Anthropic’s Claude is named for both the chatbot and the family of LLMs behind it. The Gemini apps on both the web and mobile act as a chatbot interface for the underlying models.
Google is gradually integrating the Gemini chatbot into its suite of technologies. For instance, Gemini is the default artificial intelligence (AI) assistant on the latest Google Pixel 9 and Pixel 9 Pro phones, replacing Google Assistant. In Google Workspace, Gemini is available on the Docs side panel to help write and edit content, and on the Gmail side panel to assist with drafting emails, suggesting responses and searching a user’s inbox for information.
Other Google apps are also incorporating Gemini. Google Maps, for example, is drawing on Gemini model capabilities to supply summaries of places and areas.
Gemini has been trained on a massive corpus of multilingual and multimodal data sets. It employs a transformer model, a neural network architecture that Google itself introduced in 2017.1
Here’s a brief overview of how transformer models work:
Encoders transform input sequences into numerical representations called embeddings that capture the semantics and position of tokens in the input sequence.
A self-attention mechanism enables transformers to “focus their attention” on the most important tokens in the input sequence, regardless of their position.
Decoders use this self-attention mechanism and the encoders’ embeddings to generate the most statistically probable output sequence.
Unlike generative pretrained transformer (GPT) models that take only text-based prompts or diffusion models used for image generation that take both text and image prompts, Google Gemini supports interleaved sequences of audio, image, text and video as inputs and can produce interleaved text and image outputs.2
The Gemini family of multimodal AI models comes in multiple variants. Each variant is optimized for different devices and tasks.
Gemini’s first-generation model, 1.0, comes in Nano and Ultra. The next-generation model, 1.5, comes in Pro and Flash.
Building on and experimenting with Gemini’s AI features and functions can be done through the Gemini API in the Google AI Studio and Google Cloud Vertex AI development platforms. For now, only Gemini 1.5 Pro and Gemini 1.5 Flash are available.
Gemini 1.0 Nano is the smallest version of the 1.0 family designed to operate on mobile devices, even without a data network. It can perform on-device tasks such as describe images, suggest replies to chat messages, summarize text and transcribe speech.
Gemini Nano is available on Android devices starting with the Pixel 8 Pro. Moving beyond its mobile-only limits, Google is incorporating Gemini Nano into its Chrome desktop client.
Gemini 1.0 Ultra is the largest version of the 1.0 family with advanced analytical capabilities. It’s built for highly complex tasks such as coding, mathematical reasoning and multimodal reasoning. The context window—the number of tokens that a model can process at once—of both Gemini Nano and Gemini Ultra is 32,000 tokens.2
Gemini 1.5 Pro is a midsized multimodal model with a context window of up to 2 million tokens. This long context window enables Gemini Pro to process information on a larger scale: from hours of audio and video to thousands of lines of code or hundreds of pages of documents.3
In addition to a transformer architecture, Gemini 1.5 Pro applies a Mixture of Experts (MoE) architecture. MoE models are split into smaller “expert” neural networks, each specializing in a certain domain or data type. The model learns to selectively activate only the most relevant experts depending on the input type. This results in swifter performance while reducing computational costs.4
Gemini 1.5 Flash is a lightweight version of Gemini Pro. It was trained using a machine learning (ML) technique called knowledge distillation, wherein insights from Gemini 1.5 Pro were transferred to the more compact Gemini 1.5 Flash. It also features a long context window of up to 1 million tokens but has a lower latency that makes it faster and more efficient.3
Google has been a pioneer in LLM architecture and draws upon its robust research to develop its own AI models.
2017: Google researchers present the transformer architecture, which underpins many of today’s LLMs.
2020: The company introduces the Meena chatbot, a neural network-based conversational agent with 2.6 billion parameters.5
2021: Google unveils LaMDA (Language Model for Dialogue Applications), its conversational LLM.6
2022: PaLM (Pathways Language Model) is released, with more advanced capabilities compared to LaMDA.7
2023: Bard starts during the first quarter of the year, backed by a lightweight and optimized version of LaMDA.8 The second quarter sees PaLM 2 released—with enhanced coding, multilingual and reasoning skills—and adopted by Bard.9 Google announces Gemini 1.0 in the last quarter of the year.
2024: Google renames Bard as Gemini and upgrades its multimodal AI models to version 1.5.
The word “Gemini” means “twins” in Latin and is both a zodiac sign and a constellation. It was an apt name given that the Gemini model is the brainchild of Google DeepMind, a merging of forces between the teams at DeepMind and Google Brain. The company also took inspiration from NASA’s Project Gemini, a two-person spacecraft integral to the success of the Apollo mission.10
Gemini Ultra surpasses similar models in various LLM benchmarks. It outperforms Claude 2, GPT-4 and Llama 2 in benchmarks such as GSM8K for mathematical reasoning, HumanEval for code generation and MMLU for natural language understanding.2
Notably, Gemini Ultra exceeded even human expert performance in MMLU. However, GPT-4 still performs better than Gemini Ultra in the HellaSwag benchmark for common sense reasoning and natural language inference.2
Google also evaluated Gemini Ultra’s multimodal capabilities. It performed higher than other models in document understanding, image understanding and automatic speech recognition benchmarks. And despite beating LLMs in benchmarks for automatic speech translation, English video captioning, multimodal understanding and reasoning, and video question answering, Gemini Ultra’s performance in these areas leave room for improvement.2
Meanwhile, the performance of both Gemini 1.5 Flash and Gemini 1.5 Pro is comparable to or even surpasses Gemini 1.0 Ultra.11 As its context window increases, Gemini 1.5 Pro maintains a high level of performance.4
Google Gemini is still in its early stages, but this highly capable AI model has the potential to be implemented in a wide array of applications:
Advanced coding
Image and text understanding
Language translation
Malware analysis
Personalized AI experts
Universal AI agents
Voice assistants
The Gemini AI model can work across programming languages such as C++, Java and Python to understand, explain and generate code. Google used fine-tuned versions of Gemini Pro as foundation models to develop AlphaCode2, a code generation system that can solve competitive programming problems with elements of theoretical computer science and complex math.
Gemini can be used to extract text from images and caption images. It can analyze visuals such as charts, diagrams and figures without the aid of optical character recognition (OCR) tools that convert images of text into a machine-readable format.
Because of their multilingual capabilities, Google’s AI models can be used to translate different languages. In the Meet video conferencing app, for instance, users can turn on translated captions to translate to and from specific languages.
Both Gemini 1.5 Pro and Gemini 1.5 Flash can be employed for malware analysis. Gemini Pro can accurately determine whether a file or code snippet is malicious and can generate a detailed report of its findings.12 Meanwhile, Gemini Flash can conduct rapid, large-scale malware dissection.13
Google recently released a new feature called Gems that allows users to customize the Gemini chatbot to create tailored AI “experts” on any task or topic. Some examples of premade Gems include a learning coach to help break down complex topics and make them easier to understand, a brainstorming partner to offer fresh ideas for the next video, and a writing editor to provide feedback on grammar and structure.
Gems come with a Gemini Advanced subscription, which uses the Gemini 1.5 Pro model.
Through Project Astra, Google is building on its Gemini models to create a universal AI agent that can process, remember and understand multimodal information in real time. To improve recall and efficiency, Project Astra harnesses caching, continuous encoding of video frames and coupling speech and video input into a timeline of events.14
In one of Google’s demos, the Gemini AI assistant was able to explain the parts of a speaker, recognize the neighborhood a person was in and remember where they put their glasses.14
With Gemini Live, users can have a dialogue with the Gemini chatbot that feels more natural and conversational. It offers more intuitive responses and can adapt to a person’s conversational style.
Like other LLMs, Google Gemini continues to grapple with the risks of AI. Caution is recommended, especially for individuals intending to use Gemini and organizations considering the model for commercial use or integration into their workflows.
Bias: In February 2024, Google decided to pause the Gemini chatbot’s ability to create images of people owing to its inaccurate portrayal of historical figures, erasing a history of racial bias.15
Hallucinations: As of this writing, Gemini-backed AI overview search results are still occasionally producing factually incorrect outputs.
Intellectual property violations: Google was fined by regulators in France, noting that the company’s AI chatbot was trained on news stories and content without the knowledge or consent of publishers in the country.16
All links reside outside ibm.com
1 Transformer: A Novel Neural Network Architecture for Language Understanding, Google Research, 31 August 2017.
2 Gemini: A Family of Highly Capable Multimodal Models, Google DeepMind, Accessed 16 September 2024.
3 Gemini Models, Google DeepMind, Accessed 16 September 2024.
4 Our next-generation model: Gemini 1.5, Google, 15 February 2024.
5 Towards a Conversational Agent that Can Chat About…Anything, Google Research, 28 January 2020.
6 LaMDA: our breakthrough conversation technology, Google, 18 May 2021.
7 Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance, Google Research, 4 April 2022.
8 Try Bard and share your feedback, Google, 21 March 2023.
9 Introducing PaLM 2, Google, 10 May 2023.
10 How Google’s AI model Gemini got its name, Google, 15 May 2024.
11 Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Google DeepMind, Accessed 16 September 2024.
12 From Assistant to Analyst: The Power of Gemini 1.5 Pro for Malware Analysis, Google Cloud, 30 April 2024.
13 Scaling Up Malware Analysis with Gemini 1.5 Flash, Google Cloud, 16 July 2024.
14 Project Astra, Google DeepMind, Accessed 16 September 2024.
15 Google chief admits ‘biased’ AI tool’s photo diversity offended users, The Guardian, 28 February 2024.
16 Google fined €250m in France for breaching intellectual property deal, The Guardian, 20 March 2024.
Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence.
IBM Granite is a family of artificial intelligence (AI) models built for business to help drive trust and scalability in AI-driven applications. Open source and proprietary Granite models are available today.
IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business.