What is Positional Encoding?

Authors

Developer Advocate

IBM

What is positional encoding?

Positional encoding is a technique that injects information about the position of the words in a sequence to transformer architectures. The order of words plays a fundamental part in understanding the semantic meaning of a sentence. For example, “Allen walks dog” and “dog walks Allen” have entirely different meanings despite having the same words, or tokens. When implementing natural language processing (NLP) applications by using deep learning and neural networks, we need to create a mechanism by which machines can retain the orders of words in a sentence to produce logical output.

Traditionally, models such as recurrent neural networks (RNNs), or long short-term memories (LSTM), have a built-in mechanism that handles the order of words. RNNs and LSTMs process inputs sequentially, one token at a time, memorizing all positions of words in a sequence. In other words, the n-dimension vector, also called “input vector” is processed one after the other, inherently learning orders. In contrast, other architectures that take advantage of convolutional neural networks (CNNs) or transformers (Vaswani et al. 2017) do not retain word order and process tokens in parallel. Therefore, we need to implement a mechanism that can explicitly represent the order of words in a sequence—a technique known as positional encoding. Positional encoding allows the transformer to retain information of the word order, enabling parallelization and efficient model training. You can often find implementations of positional encoding on GitHub.

Why does positional encoding matter?

The ordering of words in a sentence or a sequence dictates the inherent meaning of the sentence in natural languages. In addition, for machine learning, encoding the order of the word gives us a “dictionary” on where each word should be. This information is retained and can generalize throughout the training of transformer models, enabling parallelization and beating RNNs and LSTMs for its training efficiency.

Let's revisit the example:

"Allen walks dog"
"dog walks Allen"

These two sentences with the same three tokens have entirely different meanings based on the word orders. Transformers, which rely on self-attention and multi-head attention mechanism, do not have inherent representation of word orders, and would treat the individual word in a sequence identically if we did not provide explicit positional information. We want the model to understand who is doing the walking and who is being walked, which depends entirely on positions.

We achieve this objective by first processing each word as a vector that represents its meaning—for example, “dog” will be encoded in a high dimensional array that encodes its concept. In technical terms, each word or sub word is mapped to an input embedding of varying lengths. However, on its own, the meaning vector does not tell us where in the sentence dog appears. Positional encoding adds a second vector—one that encodes the position index, such as “first word”, or “second word”, and so on. The two vectors are then added to represent what the word is and where the word is. This resulting vector is often referred to as the positional encoding vector.

There are several ways of creating positional encoding. In this article, we explore the most well-known example of using a sinusoidal function introduced by authors in Attention is all you need¹ to create positional encoding.

Positional encoding in transformers

The latest AI News + Insights  

Expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

In the original paper introduced by Vaswani et al. 2017, the key idea is to generate a fixed and deterministic encoding for each position in a sequence by using a sinusoidal function—particularly, the sine function $s i n (x)$ and cosine function $c o s (x)$ .

What are sinusoidal functions?

The sine functions are a fundamental mathematical concept that produces a smooth, wavelength pattern. In particular, the cosine and sine functions are used by the authors in the original transformer functions to aid positional encoding.

If we plot $s i n (x)$ and $c o s (x)$ , we will see a curve that rises and falls between -1 and 1 in a repeating, periodic pattern.

A few properties of sine that make it powerful for positional encoding:

It is periodic: It repeats regularly over intervals, which is useful for representing repeated patterns.
It is smooth and continuous: Small changes in input result in small changes in output, which gives us a way to represent positions in a differentiable space.
By varying the frequency of the wavelengths across dimensions, we can create a rich, multiscale representation of position.

Let us plot the sine and cosine waves to visualize what they look like:

import numpy as np 
import matplotlib.pyplot as plt 

# Create an array of 100 x values evenly spaced from 0 to 2π (approx 6.28)
x = np.linspace(0, 2 * np.pi, 100) 

# Compute the sine and cosine of each x value 
sin_values = np.sin(x) 

# Create the plot 
plt.figure(figsize=(5, 2)) 
plt.plot(x, sin_values, label='sin(x)', color='blue') 

# Customize the plot 
plt.title('Sine Function') 
plt.xlabel('x') 
plt.ylabel('Function value') 
plt.axhline(0, color='black', linewidth=0.5) # horizontal line at y=0 
plt.axvline(0, color='black', linewidth=0.5) # vertical line at x=0 
#plt.grid(True, linestyle='--', alpha=0.5) 
plt.legend() 
plt.tight_layout() 

# Show the plot 
plt.show()

The sine function

And now let's look at how we can plot the cosine function:

#apply the cosine function to the same array, x
cosine = np.cos(x) 

plt.figure(figsize = (5,2)) 
plt.plot(x, cosine, label = 'cos(x)', color = 'blue') 
plt.title('The Cosine Function') 
plt.xlabel('x') 
plt.ylabel('Function value') 
plt.axhline(0, color='black', linewidth=0.5) # horizontal line at y=0 
plt.axvline(0, color='black', linewidth=0.5) # vertical line at x=0 
#plt.grid(True, linestyle='--', alpha=0.5) 
plt.legend() 
plt.tight_layout()

The sinusoidal position encoding formulae, defined by the authors of the original transformers paper (Vaswani et al. 2017), are shown as follows:

For even positions:

$P E_{p o s, 2 i} = \sin (\frac{p o s}{10000^{2 i / d_{m o d e l}}})$

For odd positions:

$P E_{p o s, 2 i + 1} = \cos (\frac{p o s}{10000^{2 i / d_{m o d e l}}})$

$k$ : The position of the word in the sentence (for example, 0 for the first word, 1 for the second, and so on.)

$i$ : The dimension index of the embedding vector. maps to column index. 2i will indicate an even position and 2i+1 will indicate an odd position

$d_{m o d e l}$ : The predefined dimensionality of the token embeddings (for example, 512)

$n$ : user-defined scaler value (for example, 10000)

$P E$ : position function for mapping position k in the input sequence to get the positional mapping

Using this formula, each word, at position k, will have an embedding value based on the position of the word. Take the example that we used, “Allen walks dog”, we can calculate the positional embedding for each word:

- $k_{1}$ = "Allen"

- $k_{2}$ = "walks"

- $k_{3}$ ="dog"

Let’s write a simple Python function to calculate the value of $P E (k)$ :

import numpy as np 

import matplotlib.pyplot as plt 

  

# create the positional encoding function using the formula above 

def getPositionEncoding(seq_len, d, n=10000): 

    # instantiate an array of 0s as a starting point 

    P = np.zeros((seq_len, d)) 

    # iterate through the positions of each word  

    for k in range(seq_len): 

        #calculate the positional encoding for even and odd position of each word 

        for i in np.arange(int(d/2)): 

            denominator = np.power(n, 2*i/d) 

            P[k, 2*i] = np.sin(k/denominator) 

            P[k, 2*i+1] = np.cos(k/denominator) 
    return P

Once we called the function and input the corresponding value in our example, where the sequence length is 3, with a simplified dimension of $d = 4$ , and $n = 10000$

P = getPositionEncoding(seq_len=3, d=4, n=10000) 

print(P)

We get the following encoding matrix (also referred to as a tensor):

[[ 0. 1. 0. 1. ]

[ 0.84147098 0.54030231 0.09983342 0.99500417]

[ 0.90929743 -0.41614684 0.19866933 0.98006658]]

To represent this result more concretely, we get

Word Position	Dim 0 sin(pos ÷ 10000^(0 ÷ 4))	Dim 1 cos(pos ÷ 10000^(0 ÷ 4))	Dim 2 sin(pos ÷ 10000^(2 ÷ 4))	Dim 3 cos(pos ÷ 10000^(2 ÷ 4))
“Allen” k = 0	0.0000	0.0000	0.0000	1.0000
“walks” k = 1	0.841471	0.540302	0.010000	0.999950
“dog” k = 2	0.909297	-0.416147	0.020000	0.999800

Here we can see the concrete value of each word and their corresponding positional embedding value. However, we cannot use these word embeddings directly to interpret the order of words. The value calculated here is used to inject information about the position in an input vector of the transformer. Because the input of $s i n (x)$ and $c o s (x)$ are different, each position $k$ will respond to a different sinusoidal function. The corresponding position of the different sinusoidal function gives us information on the absolute position and relative position of the word in “Allen walks dog”. In other words, this information can be used by the model in such a way that the model can learn to associate these patterns with order, spacing and structure.

Now let's implement a python function to visualize the positional matrix

import numpy as np 

import matplotlib.pyplot as plt 

  

def get_position_encoding(seq_len, d_model, n=10000): 

    P = np.zeros((seq_len, d_model)) 

    for pos in range(seq_len): 

        for i in range(d_model): 

            angle = pos / np.power(n, (2 * (i // 2)) / d_model) 

            P[pos, i] = np.sin(angle) if i % 2 == 0 else np.cos(angle) 

    return P 

  

# Parameters 

seq_len = 100   # Number of tokens 

d_model = 512   # Embedding dimensions 

  

# Generate positional encoding 

P = get_position_encoding(seq_len, d_model) 

  

# Plot 

plt.figure(figsize=(10, 6)) 

cax = plt.matshow(P, cmap='viridis', aspect='auto') 

plt.title("Sinusoidal Positional Encoding Heatmap") 

plt.xlabel("Embedding Dimension") 

plt.ylabel("Token Position") 

plt.colorbar(cax) 

plt.tight_layout() 

plt.show()

Final thoughts

As we can see with the different frequencies based on values of x, each corresponding position from the input word k will differ on a scale of $[- 1.1]$ —the range of the $s i n (x)$ function. From there our encoder and decoder based transformer model will learn and preserve the different position encoding of each word, allowing the model to retain the information for training. The encoded position vector stays static through training, allowing for parallel computation.

Footnotes

1. “Attention Is All You Need”, Ashish Vaswani et al., Proceedings of the 31st International Conference on Neural Information Processing Systems, arXiv:1706.03762v7, revised on 2 August 2023.

2. “Long Short-Term Memories”, Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 15, 1997), 1735–1780.,

3. “Foundations of Recurrent Neural Networks (RNNs) and Long Short-Term Memories” Alex Sherstinsky et al., Elsevier "Physica D: Nonlinear Phenomena" journal, Volume 404, March 2020: Special Issue on Machine Learning and Dynamical Systems

What is positional encoding?

Authors

What is positional encoding?

Why does positional encoding matter?

Positional encoding in transformers

The latest AI News + Insights

What are sinusoidal functions?

Final thoughts

Related solutions

Resources

Footnotes