Weaving patterns with artificial intelligence, Part 3

# Using Markov Chains to generate language from letter correlation matrices and N-grams

Modeling natural language characteristics at the level of the word and generating frequency plots

## This content is part # of # in the series: Weaving patterns with artificial intelligence, Part 3

Stay tuned for additional content in this series.

## This content is part of the series:Weaving patterns with artificial intelligence, Part 3

Stay tuned for additional content in this series.

In the first two tutorials of this series, we looked at how to compile statistics from text, and in particular, the frequencies of particular sequences of letters or words, called N-grams. In Part 2, I also showed you how to easily get a rich and comprehensive text corpus for such applications.

Going back to the metaphor from Part 1, imagine you had monkeys that use typewriters to produce text, and that the typewriters were rigged in such a way that the probability of them striking a particular key is connected to the expected frequency of the N-gram it would produce. Just to give one example, if the typewriter were derived from English trigram frequencies, and the monkey typed EQ, the probability that the monkey next types U should be much higher than any other option. Just to give one example, if the typewriter is derived from English trigram frequencies, and the monkey typed EQ, the probability that the monkey next types U should be much higher than any other option.

## Capturing N-gram frequencies

We can simulate the monkey at a probabilistic typewriter in code through the power of random number generation utilities, which can select according to distributions. First, we need to capture the N-grams generated so far into a reusable form.

The following code is a variation on code from Part 1. It writes computed N-gram frequencies to a file.

##### Listing 1. store_letter_ngram_counts.py
```import sys
import json
from nltk.probability import FreqDist

from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer that captures only lowercase letters and spaces
#This requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a-z ]")

def count_ngrams(input_fp, frequencies, order, buffer_size=1024):
'''Read the text content of a file and keep a running count of how often
each bigram (sequence of two) characters appears.

Arguments:
input_fp -- file pointer with input text
frequencies -- mapping from each bigram to its counted frequency
buffer_size -- incremental quantity of text to be read at a time,
in bytes (1024 if not otherwise specified)
order -- The N in each N-gram (i.e. number of items)

Returns:
nothing
'''
#Read the first chunk of text, and set all letters to lowercase
#Loop over the file while there is text to read
while text:
#This step is needed to collapse runs of space characters into one
text = ' '.join(text.split())
spans = TOKENIZER.span_tokenize(text)
tokens = (text[begin : end] for (begin, end) in spans)
for bigram in ngrams(tokens, order):
#Increment the count for the bigram. Automatically handles any
#bigram not seen before. The join expression turns 2 separate
#single-character strings into one 2-character string
if '  ' not in ''.join(bigram):
frequencies[''.join(bigram)] += 1
#Read the next chunk of text, and set all letters to lowercase

return

if __name__ == '__main__':
#Initialize the mapping
frequencies = FreqDist()
#The order of the ngrams is the first command line argument
ngram_order = int(sys.argv[1])
#Pull the input data from the console
count_ngrams(sys.stdin, frequencies, ngram_order)
outputfp = open(sys.argv[2], 'w')
json.dump(dict(frequencies), outputfp)
print('Stored frequencies of {} encountered N-grams.'.format(len(frequencies)))```

Run the program as follows:

`python store_letter_ngram_counts.py 3 oanc-trigrams.json < oanc.txt`

This results in a file, oanc-trigrams.json, with the trigram frequencies in JSON form. It prints to the console the number of distinct N-grams encountered, in this case:

`Stored frequencies of 15833 encountered N-grams.`

I also generated 5-gram and 7-gram data sets to use later:

```python store_letter_ngram_counts.py 5 oanc-5grams.json < oanc.txt
python store_letter_ngram_counts.py 7 oanc-7grams.json < oanc.txt```

## Rigging the typewriters

Now comes the fun part. We need a separate program to read the N-gram frequencies from the file and use them to generate text. It needs to do some additional computation to determine how likely one letter is to occur after a known sequence of other letters. For example, when working with trigrams, it needs to compute how often E follows TH compared to how often A, Z, or any other letters do. The logic involved can be a little confusing, so I thought I should start with a diagram that shows a few simplified steps in the process.

There are several simplifications made. First, we assume that the process has already generated the leading text "TH." It just has to figure out which letter comes next. The diagram only shows three options per step, rather than the full 27 (letters plus space). It illustrates with some mocked-up frequencies. If in the training model it found 100 instances of the trigram "THA," 350 of "THE," and 50 of "THR," then to follow that model, it would have a 70-percent chance of selecting "E" as the next letter, completing the "THE" trigram.

In the second step it chooses a space character, the 73-percent option. In the diagram, space characters are displayed as underscores for clarity.

Of course, it is a random selection in each step, and it won't always pick the highest-probability item. In the third step, it picks the very low-probability "Q" to complete the trigram "T Q." Pretend that it goes on like this to produce "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG."

I hope you get a picture of what's going on in this process of generating text.

### Markov process

This generation of text is an example of what is called a Markov process or Markov chain. At each point, the context or lead (the most recent two items generated) make up a current state. There are multiple transitions possible to the next state, one of which is selected with a probability based on the frequency of the N-gram it represents. At that point, the new state is the final item in the lead plus the new item generated.

The combination of state with random selection of transitions based on probability is a Markov process. It's a very important concept in computer science, but I won't be delving too deeply into the theoretical details in this tutorial. If you are interested, I recommend Setosa's visual and interactive explainer.

## Getting the ball rolling

The basic process of text generation is easily explained, but there are many tricky areas that are very sensitive to the strategies you select in implementation. One such area is how you start the process. What is the initial state? In the trigram example illustrated previously, the initial state is arbitrarily chosen as "TH." You could decide to always use the same initial state, ideally one with many possible trigram completions, but observers of the output will notice the fixed pattern at the start.

Another possible strategy is to select N-1 initial characters at random, based on the overall frequency of each letter. The following code uses this strategy to generate text. I've added liberal comments to help you follow along.

##### Listing 2. generate_letters_strategy1.py
```import sys
import json
import string
import random

population = ' ' + string.ascii_lowercase

def preprocess_frequencies(frequencies, order):
'''Compile simple mapping from N-grams to frequencies into data structures to help compute
the probability of state transitions to complete an N-gram

Arguments:
frequencies -- mapping from N-gram to frequency recorded in the training text
order -- The N in each N-gram (i.e. number of items)

Returns:
sequencer -- Set of mappings from each N-1 sequence to the frequency of possible
items completing it
item_freqs -- individual frequency of each item in the training text
'''
sequencer = {}
item_freqs = {}
for ngram in frequencies:
#Separate the N-1 lead of each N-gram from its item completions
freq = frequencies[ngram]
final = ngram[-1]
#Accumulate the overall frequency of each item
for c in ngram:
item_freqs.setdefault(c, 0)
item_freqs[c] += freq
return sequencer, item_freqs

def generate_letters(sequencer, item_freqs, length, order):
'''Generate text based on probabilities derived from statistics for initializing
and continuing sequences of letters

Arguments:
sequencer -- mapping from each leading sequence to frequencies of the next letter
item_freqs -- mapping from each item to its overall frequency regardless of sequence
length -- approximate number of characters to generate before ending the program
order -- The N in each N-gram (i.e. number of items)

Returns:
nothing
'''
#The lead is the initial part of the N-Gram to be completed, of length N-1
#containing the last N-1 items produced
#Keep track of how many items have been generated
generated_count = 0
#Turn item frequencies into weights for selection probability
item_weights = [ item_freqs.get(c, 0) for c in population ]
while generated_count < length:
#This condition will be true until the initial lead N-gram is constructed
#It will also be true if we get to a dead end where there are no stats
#For the next item from the current lead
#Returns a list, length 1. Extract that one item.
chosen = random.choices(population, weights=item_weights)[0]
#end='' prevents a newline from being printed after each letter
#flush=True forces output to be displayed right away, not buffered
print(chosen, end='', flush=True)
#If needed to accommodate the new item, clip the first item from the lead
#Tack on the new item
generated_count += 1
else:
weights = [ freq.get(c, 0) for c in population ]
chosen = random.choices(population, weights=weights)[0]
print(chosen, end='', flush=True)
#Clip the first item from the lead and tack on the new item
generated_count += 1

return

if __name__ == '__main__':
#File with N-gram frequencies is the first argument
raw_freq_fp = open(sys.argv[1])
length = int(sys.argv[2])

#Figure out the N-gram order. Just pull the first N-gram and check its length
order = len(next(iter(raw_freqs)))
sequencer, item_freqs = preprocess_frequencies(raw_freqs, order)
generate_letters(sequencer, item_freqs, length, order)```

Run the program as follows:

`python generate_letters_strategy1.py oanc-trigrams.json 500`

The first command-line argument is the JSON file of the N-grams, and the second is the length of the text sequence to generate. You'll get different output each time you do this, of course. The first time I ran it, I got the following output.

As you can see, it looks like a weird patch-up of English, with a few runs of actual words, "by the of" and "right thigh."

### Limitations of the strategy

Let's see how the results improve by using higher-order N-grams. First fifth order:

`python generate_letters_strategy1.py oanc-5grams.json 500`

The first time I got the following.

This is exciting. It's now unmistakably English, with the occasional flourish in style, starting almost right away with "echoes so help meet the locations that a."

Incidentally, you might notice this time that it's slower to get going. This is because the higher order the N-grams the more work that needs to be done to initialize the transition probabilities. This effect is exponential with the order, so working with 7-grams will be much slower again. Luckily, the hardest bit only needs to be done once, so if you have code that is used to generate text in long-running sessions, the delay will only be in launching the program in the first place.

Unfortunately, the excitement of English-like output was short-lived. On my third run, I got the following.

This is even less coherent than the trigram-based example.

It became pretty clear that a large part of the problem is the initialization strategy. The first two letters "TN" are each common enough to have a high chance in the initial selection, but as a lead to find further 5-grams, this rare combination fails, and the program never really finds an equilibrium of 5-grams to launch from.

## A new initialization strategy

What if, instead of selecting random individual letters to initialize, we kept a list of, say the 10,000 most common N-grams and always started with a random selection of one of these. This would give the routine a strong push from the start every time.

The following code uses this new strategy to generate text. Lines that are changed from the previous listing are highlighted.

##### Listing 3. generate_letters_strategy2.py
```import sys
import json
import string
import random

POPULAR_NGRAM_COUNT = 10000

#Population is all the possible items that can be generated
population = ' ' + string.ascii_lowercase

def preprocess_frequencies(frequencies, order):
'''Compile simple mapping from N-grams to frequencies into data structures to help compute
the probability of state transitions to complete an N-gram

Arguments:
frequencies -- mapping from N-gram to frequency recorded in the training text
order -- The N in each N-gram (i.e. number of items)

Returns:
sequencer -- Set of mappings from each N-1 sequence to the frequency of possible
items completing it
popular_ngrams -- list of most common N-grams
'''
sequencer = {}
ngrams_sorted_by_freq = [
k for k in sorted(frequencies, key=frequencies.get, reverse=True)
]
popular_ngrams = ngrams_sorted_by_freq[:POPULAR_NGRAM_COUNT]
for ngram in frequencies:
#Separate the N-1 lead of each N-gram from its item completions
freq = frequencies[ngram]
final = ngram[-1]
return sequencer, popular_ngrams

def generate_letters(sequencer, popular_ngrams, length, order):
'''Generate text based on probabilities derived from statistics for initializing
and continuing sequences of letters

Arguments:
sequencer -- mapping from each leading sequence to frequencies of the next letter
popular_ngrams -- list of the highest frequency N-Grams
length -- approximate number of characters to generate before ending the program
order -- The N in each N-gram (i.e. number of items)

Returns:
nothing
'''
#The lead is the initial part of the N-Gram to be completed, of length N-1
#containing the last N-1 items produced
#Keep track of how many items have been generated
generated_count = 0
while generated_count < length:
#This condition will be true until the initial lead N-gram is constructed
#It will also be true if we get to a dead end where there are no stats
#For the next item from the current lead
#Pick an N-gram at random from the most popular
reset = random.choice(popular_ngrams)
#Drop the final item so that lead is N-1
print(item, end='', flush=True)
else:
weights = [ freq.get(c, 0) for c in population ]
chosen = random.choices(population, weights=weights)[0]
print(chosen, end='', flush=True)
#Clip the first item from the lead and tack on the new item
generated_count += 1

return

if __name__ == '__main__':
#File with N-gram frequencies is the first argument
raw_freq_fp = open(sys.argv[1])
length = int(sys.argv[2])

#Figure out the N-gram order. Just pull the first N-gram and check its length
order = len(next(iter(raw_freqs)))
sequencer, popular_ngrams = preprocess_frequencies(raw_freqs, order)
generate_letters(sequencer, popular_ngrams, length, order)```

Run the program as follows:

`python generate_letters_strategy2.py oanc-5grams.json 500`

This time I ran the program dozens of times and never had it deviate too far from the English-looking.

While we're here, let's have a look at what the 7-gram-based text generator produces. The following shows my first run.

It's now starting to feel like something that should really be intelligible, but just not quite coherent. The ending, "left the documentation…" feels like some sort of technical text in an unfamiliar field.

Just to give a sense of the many other things you can tweak, you might consider introducing a small random chance that the program will output a letter that is not based on one of the N-grams in the training corpus. There are many ways to subtly tweak this basic text-generation program to change its behaviors and characteristics.

## Preparing to work with words

As you saw from the 7-gram output example, this text generation based on letter correlation approaches intelligibility, but there's a subtle linguistic barrier that's really hard to probe until you start working at the granularity of words. In Part 2, I showed how to gather word-correlation statistics. The following code stores them in preparation for word generation.

##### Listing 4. store_word_ngram_counts.py
```import sys
import json
from nltk.probability import FreqDist

from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer that only captures words
#Requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a-z]+")

def count_ngrams(input_fp, frequencies, order, buffer_size=1024):
'''Read the text content of a file and keep a running count of how often
each bigram (sequence of two) characters appears.

Arguments:
input_fp -- file pointer with input text
frequencies -- mapping from each bigram to its counted frequency
order -- The N in each N-gram (i.e. number of items)
buffer_size -- incremental quantity of text to be read at a time,
in bytes (1024 if not otherwise specified)

Returns:
nothing
'''
#Read the first chunk of text, and set all letters to lowercase
#Loop over the file while there is text to read
while text:
#This step is needed to collapse runs of space characters into one
text = ' '.join(text.split())
spans = TOKENIZER.span_tokenize(text)
tokens = (text[begin : end] for (begin, end) in spans)
for ngram in ngrams(tokens, order):
#Join ngrams into a single space separated string
ngram_text = ' '.join(ngram)
#Extra layer to make sure no multiple runs of spaces sneak through
ngram_text = ' '.join(ngram_text.split())
frequencies[ngram_text] += 1
#Read the next chunk of text, and set all letters to lowercase

return

if __name__ == '__main__':
#Initialize the mapping
frequencies = FreqDist()
#The order of the ngrams is the first command-line argument
ngram_order = int(sys.argv[1])
#Pull the input data from the console
count_ngrams(sys.stdin, frequencies, ngram_order)
outputfp = open(sys.argv[2], 'w')
json.dump(dict(frequencies), outputfp)
print('Stored frequencies of {} encountered N-grams.'.format(len(frequencies)))```

Run the program as follows:

`python store_word_ngram_counts.py 3 oanc-3word-grams.json < oanc.txt`

I also generated 5-gram and 7-gram data sets to use later:

```python store_word_ngram_counts.py 5 oanc-5word-grams.json < oanc.txt
python store_word_ngram_counts.py 7 oanc-7word-grams.json < oanc.txt```

There are many more distinct N-grams when you're working at the word level: 8,724,842 trigrams, 13,467,874 5-grams, and 13,897,093 7-grams.

## Generating text as N-grams of words

The following code is a program to take the stored statistics and use them to generate text.

##### Listing 5. store_word_ngram_counts.py
```import sys
import json
import string
import random

POPULAR_NGRAM_COUNT = 10000

def preprocess_frequencies(frequencies, order):
'''Compile simple mapping from N-grams to frequencies into data structures to help compute
the probability of state transitions to complete an N-gram

Arguments:
frequencies -- mapping from N-gram to frequency recorded in the training text
order -- The N in each N-gram (i.e. number of items)

Returns:
sequencer -- Set of mappings from each N-1 sequence to the frequency of possible
items completing it
popular_ngrams -- list of most common N-grams
all_words -- list of all unique words that occur in the training text
'''
sequencer = {}
ngrams_sorted_by_freq = [
k for k in sorted(frequencies, key=frequencies.get, reverse=True)
]
popular_ngrams = ngrams_sorted_by_freq[:POPULAR_NGRAM_COUNT]
all_words = set()
for ngram in frequencies:
#Separate the N-1 lead of each N-gram from its item completions
freq = frequencies[ngram]
words = ngram.split()
final = words[-1]
#A rule of Python means we must convert a list to a tuple before using it as
#A mapping key
for word in words:
#We used the set data structure to keep the words unique
#But the way we need to use it, we must convert to list
all_words = list(all_words)
return sequencer, popular_ngrams, all_words

def generate_letters(sequencer, popular_ngrams, all_words, length, order):
'''Generate text based on probabilities derived from statistics for initializing
and continuing sequences of letters

Arguments:
sequencer -- mapping from each leading sequence to frequencies of the next letter
popular_ngrams -- list of the highest frequency N-Grams
length -- approximate number of characters to generate before ending the program
order -- The N in each N-gram (i.e. number of items)

Returns:
nothing
'''
#The lead is the initial part of the N-Gram to be completed, of length N-1
#containing the last N-1 items produced
#Keep track of how many items have been generated
generated_count = 0
while generated_count < length:
#This condition will be true until the initial lead N-gram is constructed
#It will also be true if we get to a dead end where there are no stats
#For the next item from the current lead
#Pick an N-gram at random from the most popular
reset = random.choice(popular_ngrams)
#Split it up into a list to server as a lead
reset = reset.split()
#Drop the final item so that lead is N-1
#Note we now print a space between items, which are words
print(item, end=' ', flush=True)
else:
weights = [ freq.get(w, 0) for w in all_words ]
chosen = random.choices(all_words, weights=weights)[0]
print(chosen, end=' ', flush=True)
#Clip the first item from the lead and tack on the new item
generated_count += 1

return

if __name__ == '__main__':
#File with N-gram frequencies is the first argument
raw_freq_fp = open(sys.argv[1])
length = int(sys.argv[2])

#Figure out the N-gram order. Just pull the first N-gram and check how many words in it
order = len(next(iter(raw_freqs)).split())
sequencer, popular_ngrams, all_words = preprocess_frequencies(raw_freqs, order)
generate_letters(sequencer, popular_ngrams, all_words, length, order)```

Run the program as follows:

`python generate_words.py oanc-trigrams.json 100`

The second command-line argument is still the length of the text sequence to generate, but this time counted in words, not letters. The results are very interesting, of course. From the trigrams:

From 5-grams:

From 7-grams:

I'll just leave you to marvel, as I did, at the final couple of lines of the 5-gram generated text and all of the 7-gram text.

## Conclusion

I must stress again just how easily it is to make small changes in the workings of these programs to get very interesting variations. It's an area where experimentation and exploration in code pays off greatly.

You have learned how to get AI to do more than just classify things, including generating patterns, and in particular, language based on model text. You can use such N-gram generation techniques to produce all sorts of patterns, including images and sound. After you start to play around with generative AI techniques, you will find myriad possibilities opening up for taking your work in AI to the next level.