What is LLM Temperature?

Authors

Data Scientist

What is LLM temperature?

In artificial intelligence (AI) and machine learning, temperature is a parameter for adjusting the output of large language models (LLMs). Temperature controls the randomness of text that is generated by LLMs during inference.

LLMs generate text by predicting the next word (or rather, the next token) according to a probability distribution. Each token is assigned a logit (numerical value) from the LLM and the total set of tokens is normalized into a “softmax probability distribution.” Each token is assigned a “softmax function” that exists between zero and one, and the sum of all the tokens’ softmax probabilities is one.

The LLM temperature parameter modifies this distribution. A lower temperature essentially makes those tokens with the highest probability more likely to be selected; a higher temperature increases a model's likelihood of selecting less probable tokens. This happens because a higher temperature value introduces more variability into the LLM's token selection. Different temperature settings essentially introduce different levels of randomness when a generative AI model outputs text.

Temperature is a crucial feature for controlling randomness in model performance. It allows users to adjust the LLM output to better suit different real-world applications of text generation. More specifically, this LLM setting allows users to balance coherence and creativity when generating output for a specific use case. For instance, a low temperature might be preferable for tasks requiring precision and factual accuracy, such as technical documentation or conversational replies with chatbots. The lower temperature value helps the LLM to produce more coherent and consistent text and avoid irrelevant responses. By contrast, a high temperature is preferable for creative outputs or creative tasks such as creative writing or concept brainstorming. The temperature setting effectively allows users to fine-tune LLMs and adjust a model's output to their own desired outcome.

Temperature is often conflated with ‘creativity’ but this isn’t always the case. It’s more helpful to think of it as how broadly the model uses text from its training data. Max Peeperkorn et al¹ conducted an empirical analysis of LLM output for different temperature values and wrote:

“We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the "creativity parameter" claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher.”

A high temperature value can make model outputs seem more creative but it's more accurate to think of them as being less determined by the training data.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Configuring temperature

Many APIs to access models including IBM® Granite™ Instruct or OpenAIs ChatGPT have parameters to configure temperature along with a variety of other LLM parameters. The three most common are:

do_sample: This parameter controls whether the model samples during text generation. Sampling is a method to vary text output. When set to "True" the model randomly samples from redacted token probabilities rather than always selecting the most probable word from a sequence in a dataset. In fact, we need to have this set to true to enable temperature adjustments for the pretrained LLM.

top_k: This parameter restricts the model's possible choices when random sampling to the top k most likely tokens. While the previous parameter enables random sampling to other predicted tokens beyond the most likely, this parameter limits the number of potential tokens from which the model selects. While random sampling helps produce more varied and diverse outputs, this parameter helps maintain the quality of generated text by excluding the more unlikely tokens from being sampled.

top_p: This parameter is sometimes also called nucleus sampling. It is another method of limiting the choices of random sampling to avoid inconsistent and nonsensical output. This parameter allows the model to consider tokens whose cumulative probability is greater than a specified probability value. When selecting tokens for the generated output the model only selects a group of tokens whose total probability is more than, for instance, 95%. While random sampling enables the model to have a more dynamic output, the top p parameter ensures that the output maintains some coherence and consistency.

AI Academy

Why foundation models are a paradigm shift for AI

Learn about a new class of flexible, reusable AI models that can unlock new revenue, reduce costs and increase productivity, then use our guidebook to dive deeper.

Go to episode

Controlling output

Oftentimes models allow users to control the output from the model more explicitly as well. Some commonly used parameters for output control are:

Maximum length: The maximum length is the total # of tokens the AI is allowed to generate. This setting is useful because it allows users to manage the length of the model's response and can prevent overly long or irrelevant responses.

Stop sequences: These sequences tell the model when to stop generating output and helps to control content length and structure. Prompting an LLM to write an email using "Best regards," or "Sincerely," as a stop sequence tells the model to stop before the closing salutation. This prompt can help keep the email short and to the point. Stop sequences are useful for output that you expect to come out in a structured format such as an email, a numbered list or dialog.

Frequency penalty: A frequency penalty is a setting that discourages repetition in the generated text by penalizing tokens proportionally to how frequently they appear. The more often a token is used in the text, the less likely the LLM is to use it again.

Presence penalty: The presence penalty is similar to the frequency penalty, but penalizes tokens based on whether they have occurred or not rather than penalizing them proportionally.

Comparing outputs with IBM Granite

To see how temperature affects the output of an LLM, look at the outputs created by the IBM Granite 3.1 2 billion parameter model. The prompt provided here was: “Write a story about a data scientist who loves Python.”

With a temperature of 0.1 Granite generates:

Once upon a time, in the bustling city of San Francisco, there lived a brilliant data scientist named Alex. Alex was known for their exceptional skills in Python programming and data analysis. They spent their days and nights immersed in the world of data, using Python to extract insights and make predictions.

One day, Alex was tasked with analyzing customer churn for a popular streaming service. They used Python's scikit-learn library to build a machine learning model that could predict which customers were likely to cancel their subscriptions.

With a temperature of 0.75 Granite generates:

Once upon a time, in the bustling city of San Francisco, there lived a brilliant data scientist named Alex. Alex was known for their deep love for Python, a programming language that they believed was the perfect tool for data analysis and machine learning.

Every day, Alex would wake up early, brew a pot of coffee, and sit down at their desk. They would spend hours writing code in Python, exploring datasets, and building predictive models. Their love for Python was not just about the language itself, but also about the community that had grown around it.

With a temperature of 1.25 Granite generates:

Once upon a time, in a bustling city filled with towering skyscrapers and neon lights, there was a data scientist who loved Python more than anything in the world. The data scientist has a feline helper named Suki. Suki is a curious and playful cat, always eager to explore every nook and cranny of the house.

One day, the data scientist is working on a complex coding problem. Suki keeps meowing loudly as if she is trying to communicate a message to the data scientist. The data scientist notices that the loud meowing usually happens when they are close to a solution to the complex coding problem.

Notice that even though each story starts with a similar beginning, increasing the temperature increases the variability of the text and the details that are generated.

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

Explore IBM Granite

Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Discover the power of LLMs

Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

The CEO’s guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the power of generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

AI in Action 2024

Read about 2,000 organizations we surveyed about their AI initiatives to discover what's working, what's not and how you can get ahead.

Footnotes

1 Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous, Is Temperature the Creativity Parameter of Large Language Models?, 2024

What is LLM Temperature?

Authors