Contents


Weaving patterns with artificial intelligence, Part 1

Letter correlation and simple language statistics for AI

Techniques for modeling natural language characteristics in support of generative artificial intelligence

Comments

Content series:

This content is part # of # in the series: Weaving patterns with artificial intelligence, Part 1

Stay tuned for additional content in this series.

This content is part of the series:Weaving patterns with artificial intelligence, Part 1

Stay tuned for additional content in this series.

As artificial intelligence and cognitive computing have gained interest in recent years, a lot of the attention has been on varieties of classifier systems. Classifier systems look for patterns to establish a given probability that a thing is of a certain type. Simple rules and actions can then be run based on this classification. For example, if a certain pattern of credit card transactions is classified as suspicious with greater than 80% probability, issue a fraud alert. Or if a certain pattern of shapes and movements from a self-driving car's camera looks like a road obstacle, press the brakes.

These techniques are very important, but there is much more to AI. In fact, the iconic milestone for AI, the Turing Test, requires not only that the machine recognizes patterns, but also generates its own patterns of activity, as humans do.

This isn't just a matter of theoretical interest. Some recent AI applications such as chatbots have run into difficulty in practice because their interactions with people are too mechanical. There is increasing interest in generative AI techniques, which start with a learned model of patterns and uses these models to try to generate imitations of the patterns. This not only leads to more flexible and customer-friendly AI applications, but also more efficient ones. Generative AI techniques often converge very quickly on the subset of pattern features that characterize the real world, allowing data sets to be optimized.

To understand how generative techniques can take AI to the next level, it is important to understand their basics. The classic area of interest is still one of the most interesting: natural language.

Monkeys at the typewriter

A very old and interesting idea was expressed by Sir Arthur Eddington in lectures at Cambridge University in the 1920s. The idea is that if you got an army of monkeys to pound away at typewriters, and waited long enough, there is some probability that they would produce, say all the works of Shakespeare. In other words, in any long enough sequence of randomly produced letters you will eventually find familiar language. Of course, waiting long enough in the case of the monkeys at the typewriter might mean waiting trillions upon trillions of years. The probabilities of truly random strings of letters producing familiar language are just too small.

But what if the monkeys were not just using regular typewriters, but ones that were rigged so that the probability of striking a key depends on how often that letter appears in familiar language? E is the most common letter in English, and other common letters include T, A, N, and I. If the monkeys were more likely to strike these keys than others, you might improve the probability of producing familiar language.

In such a case, each monkey's typewriter serves as a model of the familiar language you are hoping to imitate with the army of monkeys. However, to construct such models you would want to observe the patterns in samples of familiar language. This is just like training models for neural networks or other learning systems. Before you figure out how to generate language from such models, you must learn how to construct them in the first place.

Basic letter frequency

The following code listing is a Python program (written for Python 3) that takes a text file and computes the frequency of each character that appears in the file.

Listing 1. charcounter1.py
import sys
import pprint

def count_chars(input_fp, frequencies, buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each character appears.

    Arguments:
        input_fp -- file pointer with input text
        frequencies -- mapping from each character to its counted frequency
        buffer_size -- incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)

    Returns:
        nothing
    '''
    #Read the first chunk of text
    text = input_fp.read(buffer_size)
    #Loop over the file while there is text to read
    while text:
        for c in text:
            #Accommodate the character if seen for the first time
            frequencies.setdefault(c, 0)
            #Increment the count for the present character
            frequencies[c] += 1
        #Read the next chunk of text
        text = input_fp.read(buffer_size)

    return


if __name__ == '__main__':
    #Initialize the mapping
    frequencies = {}
    #Pull the input data from the console
    count_chars(sys.stdin, frequencies)
    #Display the resulting frequencies in readable format
    pprint.pprint(frequencies)

I fed it with the text of "Big-brained data, Part 1," a developerWorks tutorial by me published in 2017. On Linux, I ran it as follows.

python charcounter1.py < bigbraineddata1.txt

Here are some snippets from the output.

{'\n': 109,
' ': 3740,
'"': 22,
"'": 32,
'(': 11,
')': 11,
',': 189,
'-': 26,
'.': 146,
'0': 9,
'1': 8,
'2': 11,
…
'9': 2,
':': 2,
'?': 4,
'A': 63,
'B': 6,
'C': 11,
'D': 7,
'E': 6,
'F': 10,
'G': 7,
'H': 14,
'I': 69,
…
'W': 3,
'Y': 13,
'a': 1618,
'b': 255,
'c': 646,
'd': 613,
'e': 2152,
'f': 421,
'g': 372,
'h': 782,
'i': 1388,
…
'y': 291,
'z': 14,
'é': 1,
'—': 2}

Several things leap out right away. The problem statement sounded simple enough, but natural language always brings layers of complications. The highest frequency of all is not for a letter but for the space character. This is just as important a part of the language model as the letter frequencies. A machine imitating language also must imitate the use of word spacing. It should probably also mimic the use of other sorts of punctuation, and you see these also in the output.

The model for a chatbot might not need to incorporate line breaks if all of its responses are 1-liners. On certain operating systems, you might need to normalize different carriage return/line feed character conventions for new lines.

In real-world applications, you might also have to think about accented characters and others from Unicode.

Capitalization

You also need to decide whether your model differentiates uppercase from lowercase letters. Assuming that you want to generate language with normal uppercase/lowercase conventions from the model, you could do that by including uppercase and lowercase statistics separately. You could also code the language generation to convert to capital after a period, in special cases such as an isolated letter "i," and maybe a list of typically proper nouns. Combining uppercase and lowercase can save a lot of storage for the models as the statistics get more sophisticated. From now on in this tutorial series, I'll be converting capital letters to lowercase in all language models.

You can see some of the significance of capitalization in the language in the frequencies above. Among the lowercase letters, "e" is the most common by far. For uppercase it's "I" followed by "A," followed by "T." The "I" is, of course, from the first person pronoun. "E" might be the most common letter in English, but it's not so common at the beginning of a word, meaning it's less likely to enjoy initial sentence capitalization. Instead, you have a lot of "A" and "T" from the indefinite and definite article.

Listing 2 is a version of Listing 1 with some tweaks according to the above observations.

Listing 2. charcounter2.py
import sys
import pprint
import string

WHITELIST_CHARACTERS = string.ascii_lowercase + ' '

def count_chars(input_fp, frequencies, whitelist=WHITELIST_CHARACTERS,
                    buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each character appears.

    Arguments:
        input_fp -- file pointer with input text
        frequencies -- mapping from each character to its counted frequency
        whitelist -- string containing all characters to be included in the stats
            defaults to all lowercase letters and space
        buffer_size -- incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)

    Returns:
        nothing
    '''
    #Read the first chunk of text, and set all letters to lowercase
    text = input_fp.read(buffer_size).lower()
    #Loop over the file while there is text to read
    while text:
        for c in text:
            if c in WHITELIST_CHARACTERS:
                #Accommodate the character if seen for the first time
                frequencies.setdefault(c, 0)
                #Increment the count for the present character
                frequencies[c] += 1
        #Read the next chunk of text, and set all letters to lowercase
        text = input_fp.read(buffer_size).lower()

    return


if __name__ == '__main__':
    #Initialize the mapping
    frequencies = {}
    #Pull the input data from the console
    count_chars(sys.stdin, frequencies)
    #Display the resulting frequencies in readable format
    pprint.pprint(frequencies)

Here is the entire output.

{' ': 3740,
'a': 1681,
'b': 261,
'c': 657,
'd': 620,
'e': 2158,
'f': 431,
'g': 379,
'h': 796,
'i': 1457,
'j': 14,
'k': 99,
'l': 739,
'm': 511,
'n': 1267,
'o': 1419,
'p': 408,
'q': 37,
'r': 1110,
's': 1294,
't': 1762,
'u': 556,
'v': 183,
'w': 256,
'x': 57,
'y': 304,
'z': 14}

Letter sequences

In the third tutorial in the series, I'll demonstrate the sorts of results you get from generating language from a model based on such a simple frequency mapping. But, you would probably imagine needing a little more work to get to a model that can generate anything like familiar language.

The next step is to consider not just how frequently a given letter or character occurs, but how frequently one given letter is followed by another. For example, the letter "e" is frequently followed by the letter "d" because that combination is common in English past tense. You'll often find "h" after "t" because those two letters begin the definite article. These frequencies give the correlation between a letter and the letter likely to follow it.

You can record the letter correlation frequencies mathematically in a 2-dimensional matrix. Figure 1 is a table that illustrates a subset of this matrix with the frequencies of sequences involving space, a, b, and c in the "Big-brained data" tutorial.

Figure 1. Subset of matrix
Front page of Structured Data Markup Helper
Front page of Structured Data Markup Helper

Figure 1 shows that "a" starts a word 160 times (space followed by "a") and ends a word 510 times. The sequence "ca" is pretty common, appearing 104 times, while the sequences "aa," "bc," "cb," and "bb" do not appear at all. Clearly, I didn't mention the bubble sort algorithm in that tutorial.

Of course, there are several ways to represent such a matrix as a data structure. You could use a dimensioned array, with a memory slot allocated for the frequency in each cell in the table. This requires 27*27 or 729 memory slots for the full array. One for each letter plus space on each side. You could also create a mapping from each letter sequence to its frequency, with zero-frequency correlations omitted. This might be more efficient in cases where there are a lot of blank cells, and such arrangements are called sparse arrays.

Note that even the mapping that is used to record the individual letter frequencies is a representation of a matrix, of 1 dimension rather than 2 dimensions.

Computing bigram frequencies

There is a technical name for such sequences of two items that are processed together. They're called bigrams. The matrix in Figure 1 shows frequencies for bigrams such as "ab" (46) and "ba" (17).

In writing the code to compute these frequencies, I switched from the lower-level Python code in Listing 1 and Listing 2 to use a popular Python library, the natural language toolkit (NLTK). It has some useful routines for working with bigrams.

Listing 3. count_bigrams.py
import sys
from collections import Counter
import pprint

from nltk.util import bigrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer which only captures lowercase letters and spaces
#This requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a-z ]")


def count_bigrams(input_fp, frequencies, buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each bigram (sequence of two) characters appears.

    Arguments:
        input_fp -- file pointer with input text
        frequencies -- mapping from each bigram to its counted frequency
        buffer_size -- incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)

    Returns:
        nothing
    '''
    #Read the first chunk of text, and set all letters to lowercase
    text = input_fp.read(buffer_size).lower()
    #Loop over the file while there is text to read
    while text:
        #This step is needed to collapse runs of space characters into one
        text = ' '.join(text.split())
        spans = TOKENIZER.span_tokenize(text)
        tokens = (text[begin : end] for (begin, end) in spans)
        for bigram in bigrams(tokens):
            #Increment the count for the bigram. Automatically handles any
            #bigram not seen before. The join expression turns 2 separate 
            #single-character strings into one 2-character string
            frequencies[''.join(bigram)] += 1
        #Read the next chunk of text, and set all letters to lowercase
        text = input_fp.read(buffer_size).lower()

    return


if __name__ == '__main__':
    #Initialize the mapping
    frequencies = Counter()
    #Pull the input data from the console
    count_bigrams(sys.stdin, frequencies)
    #Uncomment the following line to display all the resulting frequencies
    #in readable format
    #pprint.pprint(frequencies)
    #Print just the 20 most common bigrams and their frequencies
    #in readable format
    pprint.pprint(frequencies.most_common(20))

To run this code on Python 3, you must install NLTK. You can install all the prerequisites for the code in this tutorial series by using the requirements.txt file in the GitHub repository with a command such as:

pip install -r requirements.txt

I also use the Python class collections.Counter from the standard library. The output is just the 20 most common bigrams, as follows.

[('e ', 656),
('s ', 591),
(' t', 581),
(' a', 510),
('th', 458),
('in', 376),
('t ', 351),
('d ', 330),
(' i', 326),
('he', 308),
('at', 302),
('n ', 298),
('er', 287),
(' o', 283),
('an', 273),
('re', 256),
(' s', 245),
('on', 235),
('or', 233),
('ti', 210)]

For example, "e" is the most common letter at the end of a word and "t" the most common at the beginning. The statistics are starting to reflect a bit more of the patterns of English than individual letter counts.

Adding dimensions

If recording the frequencies of 2-letter sequences provides greater insights than single-letter frequencies, then perhaps analyzing 3-letter sequences would yield even more insights. It's useful just knowing that "the," "and," "ing," and "ed " are very common. These are called trigrams.

Individual letter frequencies require a matrix of 1 dimension, also known as a first-order matrix. If you store it fully, without sparse array techniques, it requires 27 storage cells for the frequencies of the 26 letters plus space. The bigram matrix is 2 dimensional, or of second order. It requires 729 cells fully stored. A matrix of 3-letter sequences is of third order, with 3 dimensions. Fully stored it requires 27*27*27 cells, or 19,683. You can begin to see the tradeoff. Higher-order matrices provide more statistical insight at the expense of dramatically higher storage requirements.

A general sequence of N items is called an N-gram, and requires an N-dimensional matrix, a matrix of Nth order. If fully stored, 27^N cells are required, an exponential growth. A fourth-order matrix would require 531,441 cells to store fully. If each is 32 bits, allowing counts of up to around 4 billion, the total storage that is required would be around 2 MB. Of course, you can dramatically reduce the storage requirements by using sparse array techniques, which all the code in this tutorial does.

Just as a note of interest and completeness, there is even the concept of a zero-order matrix. That would simply be the total number of all letters in the input text.

Counting N-grams

Here is a program for computing any arbitrary order of N-grams.

Listing 4. count_ngrams.py
import sys
from collections import Counter
import pprint

from nltk.util import ngrams
from nltk.tokenize import RegexpTokenizer

#Set up a tokenizer which only captures lowercase letters and spaces
#This requires that input has been preprocessed to lowercase all letters
TOKENIZER = RegexpTokenizer("[a-z ]")


def count_ngrams(input_fp, frequencies, order, buffer_size=1024):
    '''Read the text content of a file and keep a running count of how often
    each bigram (sequence of two) characters appears.
    Arguments:
        input_fp -- file pointer with input text
        frequencies -- mapping from each bigram to its counted frequency
        buffer_size -- incremental quantity of text to be read at a time,
            in bytes (1024 if not otherwise specified)
    Returns:
        nothing
    '''
    #Read the first chunk of text, and set all letters to lowercase
    text = input_fp.read(buffer_size).lower()
    #Loop over the file while there is text to read
    while text:
        #This step is needed to collapse runs of space characters into one
        text = ' '.join(text.split())
        spans = TOKENIZER.span_tokenize(text)
        tokens = (text[begin : end] for (begin, end) in spans)
        for bigram in ngrams(tokens, order):
            #Increment the count for the bigram. Automatically handles any
            #bigram not seen before. The join expression turns 2 separate 
            #single-character strings into one 2-character string
            frequencies[''.join(bigram)] += 1
        #Read the next chunk of text, and set all letters to lowercase
        text = input_fp.read(buffer_size).lower()

    return


if __name__ == '__main__':
    #Initialize the mapping
    frequencies = Counter()
    #The order of the ngrams is the first command line argument
    ngram_order = int(sys.argv[1])
    #Pull the input data from the console
    count_ngrams(sys.stdin, frequencies, ngram_order)
    #Uncomment the following line to display all the resulting frequencies
    #in readable format
    #pprint.pprint(frequencies)
    #Print just the 20 most common N-grams and their frequencies
    #in readable format
pprint.pprint(frequencies.most_common(20))

You pass in the order on the command line. For example, to get 5th-order N-gram frequencies, you would run:

python count_ngrams.py 5 < bigbraineddata1.txt

The output, just the 20 most common 5-grams, follows.

[(' the ', 184),
(' and ', 102),
(' that', 64),
('that ', 63),
(' you ', 54),
(' data', 50),
('s of ', 49),
('data ', 49),
('tion ', 47),
(' for ', 45),
(' of t', 43),
('ation', 42),
('ning ', 39),
(' are ', 38),
('of th', 36),
('f the', 35),
('this ', 34),
('e of ', 33),
(' this', 32),
('n the', 30)]

We have interesting patterns emerging such as the commonness of words such as "data" and the grammatical articles and common word prefixes and suffixes. The statistics are starting to capture the characteristic vocabulary and style of the input text.

Limits to the usefulness of order

Now observe what happens when I push it to 7th order.

$ python count_ngrams.py 7 < bigbraineddata1.txt
[(' of the', 32),
 ('of the ', 27),
 (' algori', 21),
 ('algorit', 21),
 ('lgorith', 21),
 ('gorithm', 21),
 ('aining ', 20),
 ('s that ', 20),
 (' learni', 19),
 ('learnin', 19),
 ('earning', 19),
 (' traini', 19),
 ('trainin', 19),
 ('raining', 19),
 (' machin', 17),
 ('machine', 17),
 ('arning ', 17),
 ('in the ', 17),
 ('achine ', 16),
 ('e learn', 16)]

At first glance, this looks awesome. The patterns of the tutorial's vocabulary are really starting to show through in the statistics. There is clearly a lot of talk about machines and algorithms going through training and thus, learning things. But there is a subtle problem here. Even the top N-grams are getting rarer and rarer. The most common bigram occurs 656 times. The most common 5-gram 184 times. The most common 7-gram 32 times.

If you consider only the highest orders of N-grams, they tend to become welded to properties of the input text in especially brittle ways. As it turns out, the value of the statistics gets diluted for purposes of trying to piece language back together to generate something familiar. This is actually a general problem in AI, where our modern ability to store and analyze more and more complex models can be as much a curse as a blessing. It can get harder and harder to rummage through so many weeds for the seedling patterns that should grow into useful responses to situations.

One thing that helps such situations relies on another benefit of the modern age. Creating the models from a much larger and much more varied corpus of natural language tends to smooth out statistical quirks while providing enough richness to support generative AI techniques. Again, this is in parallel with other areas of AI where access to more high quality training data is paramount. You can get more information about data in my tutorial, "Big-brained data, Part 1: Pay attention to the data to get the most out of artificial intelligence, machine learning, and cognitive computing."

Conclusion

You've learned how to gather statistics from a source text as a first step toward having a machine generate text that would be familiar to the reader of the source text. Because it would probably take longer than the age of the universe for monkeys at typewriters to produce anything like familiar text, we've created a model for rigging the typewriters so that the chance a monkey hits a key is not entirely random, but rather based on models of letter sequences in training text. We are ready to sit the monkeys down at their rigged typewriters.

I'll be coming to that step eventually, but before that, in the next tutorial, I'll dig even deeper into the fascinating world of N-grams and N-gram statistics. Now that we have this interesting tool, we can examine a range of interesting applications, including in cognitive computing.

The code in this tutorial uses Python. If you'd like to become more familiar with that easy-to-learn programming language, see "Beginner's guide to Python."


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cognitive computing
ArticleID=1058851
ArticleTitle=Weaving patterns with artificial intelligence, Part 1: Letter correlation and simple language statistics for AI
publish-date=03202018