In data security, tokenization is the process of converting sensitive data into a nonsensitive digital replacement, called a token, that maps back to the original.
Tokenization can help protect sensitive information. For example, sensitive data can be mapped to a token and placed in a digital vault for secure storage. The token can then act as a secure replacement for the data. The token itself is nonsensitive and has no use or value without connection to the data vault.
A digital token is a collection of characters that serve as an identifier for some other asset or piece of information. For example, one could replace an annual expense figure of USD 45,500,000 in a confidential report with the token “ot&14%Uyb.”
Tokens also appear in natural language processing (NLP), although the concept is slightly different in this field. In NLP, a token is an individual unit of language—usually a word or a part of a word—that a machine can understand.
Different types of tokenization produce different types of tokens. Common tokens include:
Tokenization systems often include the following components:
1. A token generator that creates tokens through one of several techniques. These techniques can include different functions:
2. A token mapping process that assigns the newly created token value to the original value. A secure cross-reference database is created to track the associations between tokens and the real data. This database is kept in a secure data store so that only authorized users can access it.
3. A token data store or token vault that holds the original values and their related token values. Data that is stored in the vault is often encrypted for greater security. The vault is the only location where a token is connected back to its original value.
4. An encryption key manager to track and secure any cryptographic keys used to encrypt the data in the vault, tokens in transit or other data and assets in the tokenization system.
Tokenization without a vault is also possible. Rather than storing sensitive information in a secure database, vaultless tokenization uses an encryption algorithm to generate a token from the sensitive data. The same algorithm can be used to reverse the process, turning the token back into the original data. Most reversible tokens do not require the original sensitive information to be stored in a vault.
When a third-party tokenization provider is used, the original sensitive data might be removed from an enterprise’s internal systems, moved to the third party’s storage and replaced with tokens. This substitution helps to mitigate the risk of data breaches within the enterprise. The tokens themselves are typically stored within the enterprise to streamline normal operations.
Tokenization methods can bring extra data protection to many types of data across many industries and business functions.
Data tokenization makes it possible for an organization to remove or disguise any or all sensitive data elements from their in-house data systems. As a result, there is less—or no—valuable data for hackers to steal, which helps reduce the organization’s vulnerability to data breaches.
Tokenization is often used to safeguard sensitive business data and personally identifiable information (PII) such as passport numbers or Social Security numbers. In financial services, marketing and retail, tokenization is often used to secure cardholder data and account information.
Each piece of sensitive information receives its own unique identifier. These tokens can be used in place of the real data for most intermediate data uses—using sensitive data after it is gathered but before final disposition—without needing to detokenize it.
Tokenization can also help organizations meet compliance requirements. For example, many healthcare organizations use tokenization to help meet the data privacy rules that are imposed by the Health Insurance Portability and Accountability Act (HIPAA).
Some access control systems also use digital tokens. For example, in a token-based authentication protocol, users verify their identities and in return receive an access token that they can use to gain access to protected services and assets. Many application programming interfaces (APIs) use tokens in this way.
Banks, e-commerce websites and other apps often use tokenization to protect bank account numbers, credit card numbers and other sensitive data.
During payment processing, a tokenization system can substitute a payment token for credit card information, a primary account number (PAN) or other financial data.
This tokenization process removes the linkage between the purchase and the financial information, shielding customer’s sensitive data from malicious actors.
Tokenization is a preprocessing technique used in natural language processing (NLP). NLP tools generally process text in linguistic units, such as words, clauses, sentences and paragraphs. Thus, NLP algorithms must first segment large texts into smaller tokens that NLP tools can process. The tokens represent text in a way that algorithms can understand.
Data tokenization can help organizations comply with governmental regulatory requirements and industry standards. Many organizations use tokenization as a form of nondestructive obfuscation to protect PII.
For example, the Payment Card Industry Data Security Standard (PCI DSS) mandates that businesses meet cybersecurity requirements to protect cardholder data. Tokenizing primary account numbers is one step organizations might take to comply with these requirements. Tokenization can also help organizations adhere to the data privacy rules laid out by the EU’s General Data Protection Regulation (GDPR).
Tokens can be used to represent assets, whether tangible or intangible. Tokenized assets are often safer and easier to move or trade than the actual asset, enabling organizations to automate transactions, streamline operations and increase asset liquidity.
Tangible assets that are represented by a token might include artworks, equipment or real estate. Intangible assets include data, intellectual property or security tokens that promise an ROI, similar to bonds and equities. Nonfungible tokens (NFTs) enable the purchase of digital assets such as art, music and digital collectibles.
Token-based blockchain technology enables the transfer of ownership and value in a single transaction, unlike traditional methods that have a possible delay between the transaction time and the settlement. Smart contracts can help automate token transfers and other transactions on the blockchain.
Cryptocurrencies can use a crypto token to tokenize an asset or interest on their blockchains. Asset-backed tokens, called stable coins, can optimize business processes by eliminating intermediaries and escrow accounts.
Tokenization replaces sensitive data with strings of nonsensitive (and otherwise useless) characters. Encryption scrambles the data so that it can be unscrambled with a secret key, which is known as a decryption key.
Both tokenization and encryption can help protect data, but they often serve different use cases. Tokenization is common in situations where the original data can easily be replaced, such as storing payment data for recurring payments. Encryption is common in situations where access to the original data is important, such as protecting data at rest and in transit.
Tokenization can be a less resource-intensive process than encryption. Whereas tokenization requires only swapping data with a nonsensitive token, an encryption system requires regular encryption and decryption when data is used, which might become costly.