My IBM

What is tokenization?

27 January 2025

Authors

James Holdsworth

Content Writer

Matthew Kosinski

Enterprise Technology Writer

What is tokenization?

In data security, tokenization is the process of converting sensitive data into a nonsensitive digital replacement, called a token, that maps back to the original.

Tokenization can help protect sensitive information. For example, sensitive data can be mapped to a token and placed in a digital vault for secure storage. The token can then act as a secure replacement for the data. The token itself is nonsensitive and has no use or value without connection to the data vault.

Strengthen your security intelligence  

Stay ahead of threats with news and insights on security, AI and more, weekly in the Think Newsletter.  

Subscribe today

What is a token?

A digital token is a collection of characters that serve as an identifier for some other asset or piece of information. For example, one could replace an annual expense figure of USD 45,500,000 in a confidential report with the token “ot&14%Uyb.”

Tokens also appear in natural language processing (NLP), although the concept is slightly different in this field. In NLP, a token is an individual unit of language—usually a word or a part of a word—that a machine can understand.

Different types of tokenization produce different types of tokens. Common tokens include:

Irreversible tokens are just that: tokens that cannot be converted back into their original values. Irreversible tokens are often used to anonymize data, which enables the tokenized dataset to be used for third-party analytics or in less secure environments.
Reversible tokens can go through detokenization to be converted back into their original data values. Reversible tokens are useful when people and systems need access to the original data. For example, when issuing a refund, a payment processor might need to convert a payment token back into the actual payment card details.
Format-preserving tokens have the same format as the data they replace. For example, the token for a credit card number with the format 1234-1234-1234-1234 might be 8493-9756-1986-6455. Format-preserving tokens support business continuity because they help ensure that a data’s structure remains the same, even when tokenized. This stable structure makes the token more likely to be compatible with both traditional and updated software.
Payment systems that use tokenization to protect sensitive information have high- and low-value tokens. The high-value token (HVT) can replace a primary account number (PAN) in a transaction—enabling it to complete the transaction by itself. Low-value tokens (LVTs) substitute for PANs, but are not allowed to complete transactions. The LVT must map to the valid PAN.

How tokenization works

Tokenization systems often include the following components:

1. A token generator that creates tokens through one of several techniques. These techniques can include different functions:

Mathematically reversible cryptographic functions that use powerful encryption algorithms that can be reversed with an associated encryption key.
One-way, nonreversible cryptographic functions, such as a hash function.
A random number generator to create random tokens, which is often considered one of the strongest techniques for generating token values.

2. A token mapping process that assigns the newly created token value to the original value. A secure cross-reference database is created to track the associations between tokens and the real data. This database is kept in a secure data store so that only authorized users can access it.

3. A token data store or token vault that holds the original values and their related token values. Data that is stored in the vault is often encrypted for greater security. The vault is the only location where a token is connected back to its original value.

4. An encryption key manager to track and secure any cryptographic keys used to encrypt the data in the vault, tokens in transit or other data and assets in the tokenization system.

Tokenization without a vault is also possible. Rather than storing sensitive information in a secure database, vaultless tokenization uses an encryption algorithm to generate a token from the sensitive data. The same algorithm can be used to reverse the process, turning the token back into the original data. Most reversible tokens do not require the original sensitive information to be stored in a vault.

When a third-party tokenization provider is used, the original sensitive data might be removed from an enterprise’s internal systems, moved to the third party’s storage and replaced with tokens. This substitution helps to mitigate the risk of data breaches within the enterprise. The tokens themselves are typically stored within the enterprise to streamline normal operations.

An example of tokenization in action

To make an account on an official government website, a user must enter their Social Security number (SSN).
The website sends the Social Security number to a tokenization service. The tokenization service generates a token that represents the SSN and stores the actual SSN in a secure vault.
The tokenization service sends the token back to the website. The website stores only the nonsensitive token.
When the website needs access to the original SSN—for example, to confirm the user’s identity during later visits—it sends the token back to the tokenization service. The service matches the token with the correct SSN in its vault to confirm the user’s identity.

Mixture of Experts | 25 April, episode 52

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Use cases and benefits of tokenization

Tokenization methods can bring extra data protection to many types of data across many industries and business functions.

Data security

Data tokenization makes it possible for an organization to remove or disguise any or all sensitive data elements from their in-house data systems. As a result, there is less—or no—valuable data for hackers to steal, which helps reduce the organization’s vulnerability to data breaches.

Tokenization is often used to safeguard sensitive business data and personally identifiable information (PII) such as passport numbers or Social Security numbers. In financial services, marketing and retail, tokenization is often used to secure cardholder data and account information.

Each piece of sensitive information receives its own unique identifier. These tokens can be used in place of the real data for most intermediate data uses—using sensitive data after it is gathered but before final disposition—without needing to detokenize it.

Tokenization can also help organizations meet compliance requirements. For example, many healthcare organizations use tokenization to help meet the data privacy rules that are imposed by the Health Insurance Portability and Accountability Act (HIPAA).

Some access control systems also use digital tokens. For example, in a token-based authentication protocol, users verify their identities and in return receive an access token that they can use to gain access to protected services and assets. Many application programming interfaces (APIs) use tokens in this way.

Digital payments

Banks, e-commerce websites and other apps often use tokenization to protect bank account numbers, credit card numbers and other sensitive data.

During payment processing, a tokenization system can substitute a payment token for credit card information, a primary account number (PAN) or other financial data.

This tokenization process removes the linkage between the purchase and the financial information, shielding customer’s sensitive data from malicious actors.

Natural language processing (NLP)

Tokenization is a preprocessing technique used in natural language processing (NLP). NLP tools generally process text in linguistic units, such as words, clauses, sentences and paragraphs. Thus, NLP algorithms must first segment large texts into smaller tokens that NLP tools can process. The tokens represent text in a way that algorithms can understand.

Learn more about how tokenization works in natural language processing

Compliance requirements

Data tokenization can help organizations comply with governmental regulatory requirements and industry standards. Many organizations use tokenization as a form of nondestructive obfuscation to protect PII.

For example, the Payment Card Industry Data Security Standard (PCI DSS) mandates that businesses meet cybersecurity requirements to protect cardholder data. Tokenizing primary account numbers is one step organizations might take to comply with these requirements. Tokenization can also help organizations adhere to the data privacy rules laid out by the EU’s General Data Protection Regulation (GDPR).

Asset tokenization

Tokens can be used to represent assets, whether tangible or intangible. Tokenized assets are often safer and easier to move or trade than the actual asset, enabling organizations to automate transactions, streamline operations and increase asset liquidity.

Tangible assets that are represented by a token might include artworks, equipment or real estate. Intangible assets include data, intellectual property or security tokens that promise an ROI, similar to bonds and equities. Nonfungible tokens (NFTs) enable the purchase of digital assets such as art, music and digital collectibles.

Blockchain

Token-based blockchain technology enables the transfer of ownership and value in a single transaction, unlike traditional methods that have a possible delay between the transaction time and the settlement. Smart contracts can help automate token transfers and other transactions on the blockchain.

Cryptocurrencies can use a crypto token to tokenize an asset or interest on their blockchains. Asset-backed tokens, called stable coins, can optimize business processes by eliminating intermediaries and escrow accounts.

Tokenization versus encryption

Tokenization replaces sensitive data with strings of nonsensitive (and otherwise useless) characters. Encryption scrambles the data so that it can be unscrambled with a secret key, which is known as a decryption key.

Both tokenization and encryption can help protect data, but they often serve different use cases. Tokenization is common in situations where the original data can easily be replaced, such as storing payment data for recurring payments. Encryption is common in situations where access to the original data is important, such as protecting data at rest and in transit.

Tokenization can be a less resource-intensive process than encryption. Whereas tokenization requires only swapping data with a nonsensitive token, an encryption system requires regular encryption and decryption when data is used, which might become costly.

See why KuppingerCole ranks IBM as a leader

The KuppingerCole data security platforms report offers guidance and recommendations to find sensitive data protection and governance products that best meet clients’ needs.