Contents


Cleansing, processing, and visualizing a data set, Part 2

Gaining invaluable insight from clean data sets

Comments

Content series:

This content is part # of 3 in the series: Cleansing, processing, and visualizing a data set, Part 2

Stay tuned for additional content in this series.

This content is part of the series:Cleansing, processing, and visualizing a data set, Part 2

Stay tuned for additional content in this series.

Data has value, but insight from data is invaluable. When you're presented with a data set, how do you make sense of the data and utilize it to gain knowledge? I'll show you two approaches to processing a cleansed data set so that you can do just that.

This tutorial, the second in a three-part series, continues the discussion of messy data I began in Part 1 and shows you how to apply machine learning methods to a cleansed data set. Part 3 will continue with methods for data visualization.

Figure 1. Data processing pipeline
Three boxes showing data flow
Three boxes showing data flow

Recall from Part 1 that I cleansed a data set of animal features so I can use the set for clustering. Now, I'll use two unsupervised learning algorithms to cluster this data. The first is called vector quantization (VQ), and it was originally developed as a data-compression algorithm based on the principle of block coding (although you can also apply it to clustering problems). The second is called Adaptive Resonance Theory (ART), and it was developed using the concept of memory templates, which you can apply to features given their similarity. ART also includes an interesting capability that allows it to adjust the number of clusters used dynamically, a feature that has distinct advantages in clustering.

Vector quantization

VQ began in signal processing as a way to compress data by segmenting the data to be communicated into groups to minimize the amount of data being transmitted. By clustering similar data into groups, the cluster identifier could be communicated instead of the feature data, minimizing the amount of data transferred. Because features that aren't identical could be clustered, the approach is lossy, with some information lost in the translation. But, by grouping similar feature vectors, the result is classical clustering.

The structure of VQ is shown in Figure 2. You apply codewords, or data to be encoded (that is, compressed), to the network as the input. The algorithm computes the outputs (or codebook) for the entire network by multiplying the feature vector inputs by the weights associated with each output node, then the algorithm chooses the output node with the smallest value as the closest cluster. Finally, the algorithm adjusts the weights by a fraction to allow that VQ centroid to move closer to the input. VQ applies the entire set of codewords to the network in this fashion for a given number of iterations (or until no codeword changes the membership of an output codebook).

Figure 2. A vector quantization network
Box and line diagram of a VQ network
Box and line diagram of a VQ network

Once trained, you can apply new codewords to the network to identify their codebook (that is, their cluster of membership). In the context of data compression, this approach maps a higher dimensional feature space into a lower dimensional subspace.

This approach has also been applied to data correction, where a corrupt codeword is applied to the network, and then the selection of the output codebook indicates the group's centroid to reconstruct the input.

Implementing VQ

Clustering with VQ is defined simply as training a network by using an unlabeled data set. In production, you can then apply new observations to the network and identify the resulting cluster by using a winner-takes-all approach from the output nodes. I provide a sample implementation of VQ on GitHub. This implementation is split across two source files, although one is common across machine learning algorithms. As Figure 3 shows, the file learn.c implements the common elements that both VQ and ART use, including parsing user options from the command line, opening and closing files, and parsing observations from the input files. The file vq.c contains the VQ implementation split into initialize, train, and validate.

Figure 3. High-level flow of the VQ sample implementation
Flow of data in learn.c to implementation in vq.c
Flow of data in learn.c to implementation in vq.c

The discussion that follows focuses on specifics of the VQ implementation.

First, note the set of required structures and symbols, shown below. I define the number of output nodes (codebooks); the learning rate; the arrays for output calculation; weights (2-D weights that cover feature vector to outputs, as shown in Figure 2 above); and the input vector for the network.

Listing 1. Key structures and symbols for VQ
#define OUTPUTS            7
#define MAX_FEATURES      21

#define RATE            0.01

static double  outputs[ OUTPUTS ];
static double  weights[ OUTPUTS ][ MAX_FEATURES ];
static int     input_feature_vector[ MAX_FEATURES ];

The vq_initialize function initializes the output array and the weights array.

The vq_train function implements the training loop for my network (see Listing 2). This function is passed the test data file and the number of iterations to perform in training. The algorithm begins by assigning the first n observations as the class prototypes (although you could also do this randomly). The get_observation function gets a new observation from the input file, and vq_set_input_vector translates this observation into the input vector. The vq_updateweights function updates the weights for the output node (class) to move closer to the input feature (including a small learning rate to minimize the change).

I then perform the training loop for the number of iterations. This loop gets a new observation (and, if it reaches the end of the file, rewinds to start again), then performs a similar training cycle on the observation. The observation is fed into the input vector of the network (with vq_set_input_vector), and the inputs are fed forward through the network to identify the nearest class (through vq_feedforward). This step then returns the nearest class, which I use to update the weights for this output node.

Listing 2. The vq_train function to train the VQ network
void vq_train( FILE *fptr, long iterations )
{
   int result;
   observation obs;
   long iteration = 0;

   // Initialize the first N observations to the N classes.
   for ( int class = 0 ; class < OUTPUTS ; class++ )
   {
      result = get_observation( fptr, &obs );
      vq_set_input_vector( &obs );
      vq_updateweights( class );
   }

   while ( iteration < iterations )
   {
      result = get_observation( fptr, &obs );

      if ( !result )
      {
         // Reset the file position to the beginning.
         fseek( fptr, 0L, SEEK_SET );
         iteration++;
      }
      else
      {
         vq_set_input_vector( &obs );
         int class = vq_feedforward( );
         vq_updateweights( class );
      }
   }

   return;
}

Listing 3 provides the vq_feedforward function, which implements execution of the network and identification of the winning output node that represents the nearest class for the observation. The function does this by iterating each output nodes and computing the sum of the Euclidean distance of the weights from the input vector. The result provides a similarity metric for the input feature vector to each output node's weights, where the lowest sum identifies the closest cluster. The latter half of the function keeps track of the best output node, which is then returned to the caller.

Listing 3. The vq_feedforward function to execute the VQ network
int vq_feedforward( void )
{
   int best;
   double best_value;

   // Given the current input feature vector, compute each output node.
   for ( int output = 0 ; output < OUTPUTS ; output++ )
   {
      outputs[ output ] = 0.0;
      for ( int feature = 0 ; feature < MAX_FEATURES ; feature++ )
      {
         outputs[ output ] +=
            distance( NORMALIZE( weights[ output ][ feature ] ),
                      input_feature_vector[ feature ] );
      }

      // Keep track of the best activation
      if ( output == 0 )
      {
         best = 0;
         best_value = outputs[ output ];
      }
      else
      {
         if ( outputs[ output ] < best_value )
         {
            best = output;
            best_value = outputs[ output ];
         }
      }
   }
   return best;
}

The vq_updateweights function shown in Listing 4 implements the learning portion of VQ. Once I find the output node that is nearest the input vector, I update the weights for this output node to bring them closer to the new feature. I do this by adjusting each weight by the difference of the feature vector value and the weight for this feature (for the winning output node) with a learning rate to reduce the overall change for the observation.

Listing 4. The vq_updateweights function to learn the latest feature vector
void vq_updateweights( int class )
{
   for ( int feature = 0 ; feature < MAX_FEATURES ; feature++ )
   {
      weights[ class ][ feature ] += RATE *
         ( ( double ) NORMALIZE( input_feature_vector[ feature ] ) -
                      weights[ class ][ feature ] );
   }

   return;
}

The validation function (vq_validate, not shown) exists to test new observations, but does not include new learning to validate how well the network generalizes to unseen observations. From the set of observations from my test data file, this function identifies the winning node (or class) for each observation.

Clustering with VQ

Now I need to test my VQ implementation against the data set. To start, I cleanse the data set to remove any erroneous observations by using the cleanse command. I specify the input dataset file (zoo.dat), the output file prefix (output), and the schema. I don't request a split of the data set, so one file will contain the entire set.

./cleanse -i zoo.dat -o output -c "sddddddddddddddddd"

The result is an output.dat file that contains my cleansed data set. I pass this new cleansed file to my learn executable, which implements the machine learning algorithm. I specify the test data file (output.dat), the output file (out), the number of iterations to run, and the algorithm to use (vq, meaning vector quantization).

./learn -t output.dat -o out -i 100000 -a vq

The result is a file that contains the name of the animal and the cluster to which it belongs.

Listing 5. Output format of the VQ cluster algorithm
aardvark,4
antelope,4
bass,2
bear,4
boar,4
...

I've post-processed the data below to show the full clusters of animals and where errors were found. Note that classifying the zoo data set with VQ resulted in 84-percent accuracy, which is reasonable, given the lossy nature of this approach. I overloaded Class 1 with two classes of the original data, and three classes were accurate for their features. Note that with a cooling schedule (to reduce the learning rate over time and therefore minimize the weight changes over time), you could realize higher accuracy.

Listing 6. Final clusters of animals from the zoo data set using VQ
****Class 1
aardvark, antelope, bear, boar, buffalo, calf, cavy, 
cheetah, deer, dolphin, duck, elephant, flamingo, fruitbat, 
giraffe, girl, goat, gorilla, hamster, hare, leopard, lion, 
lynx, mink, mole, mongoose, opossum, oryx, platypus, 
polecat, pony, porpoise, puma, pussycat, raccoon, reindeer, 
seal, sealion, squirrel, vampire, vole, wallaby, wolf.

****Class 2
chicken, crow, dove, gull, hawk, kiwi, lark, ostrich, 
parakeet, penguin, pheasant, rhea, skimmer, skua, sparrow, 
swan, wren, vulture. 

****Class 3
pitviper, seasnake, slowworm, tortoise, tuatara,

****Class 4
bass, carp, catfish, chub, crayfish, dogfish, haddock, 
herring, pike, piranha, seahorse, sole, stingray, tuna.

****Class 5
frog, frog, newt, toad.

****Class 6
flea, gnat, honeybee, housefly, ladybird, moth, termite, 
wasp.

****Class 7
clam, crab, lobster, octopus, scorpion, seawasp, slug, 
starfish, worm.

Although VQ was not 100-percent accurate in its clustering, it did a reasonable job at 84 percent. Now I'll move on to another cluster method—one with a unique advantage.

Adaptive Resonance Theory

ART is a family of algorithms inspired by the way the brain processes information. ART implements the concept of incremental learning, which allows the brain to retain old memories while storing new memories (solving the so-called "plasticity/stability" dilemma). The idea is that learning models require plasticity to integrate new knowledge but stability to retain old ones. Networks trained with back-propagation can suffer loss of information as new samples appear.

Figure 4 shows the basic architecture of the ART model for unsupervised learning. The input is applied to a comparison layer through a set of weights to a recognition layer. If the input is sufficiently similar to a neuron in the recognition layer, the weights are updated toward the new observation. The similarity test is based on a vigilance parameter, which can force the observation to a new, inactive neuron. This parameter allows ART to adapt the number of output classes based on the input, pushing the observation into a new neuron (allowing for stability of the old neuron and the information it possesses as well as plasticity for the new neuron to incorporate new knowledge).

Figure 4. ART1 unsupervised learning architecture
Box and line diagram of an ART network
Box and line diagram of an ART network

The vigilance parameter is an important part of the ART model: A higher vigilance value produces detailed memories, while a lower vigilance parameter results in more general memories. A vigilance parameter of 1.0 forces output neurons to be identical to their input.

Implementing ART

From the previous description, it's easy to see that ART implements vector classification (ART1 specifically focuses on binary-valued data). When input feature vectors are similar to existing stored knowledge, the classifying output node is strengthened for that observation. If the observation is not similar (given vigilance) to any active output neuron, a new neuron can be formed to classify this pattern.

Take a look at a sample implementation of ART1. The basic high-level flow of this implementation is shown below. As with VQ, the file learn.c implements the common elements, and art.c implements the ART algorithm. After initialization, art_train implements ART1 learning, with the ability to cluster an observation to a similar output or create a new cluster for the observation. The art_validation function provides the ability to determine the output node (cluster) for a new observation but performs no new training of the model.

Figure 5. High-level flow of the ART implementation
Flow of data in learn.c to implementation in art.c
Flow of data in learn.c to implementation in art.c

Below, I support up to nine clusters (or prototype vectors available to the algorithm to create). A vector is a multipurpose structure that serves as an observation (feature vector) and the cluster itself, including the observation information (name, features), cluster, and previously defined class. For clusters, the count field indicates how many feature vectors are part of the cluster.

Listing 7. ART1 types and symbols
#define CLUSTERS           9

#define BETA             ( double )7.0
#define RHO              ( double )0.55

typedef struct vector
{
   char name[ MAX_NAME ];
   int  features[ MAX_FEATURES ];
   int  count;    // Prototype cluster count
   int  cluster;  // Feature vector cluster
   int  actual_class;
} vector;

// In-memory observations (feature vectors)
vector feature_vectors[ MAX_FEAT_VECS ];
int    max_feature_vectors;

// Prototype feature vectors
vector clusters[ CLUSTERS ];

Listing 8 provides the main loop for the ART algorithm. While any feature vector changes class, I continue to execute the entire data set through the algorithm. I iterate through each observation and call cluster_observation to see where it belongs. If no similar cluster is found, the code attempts to create a new one; otherwise, the observation is added (or moved) to the identified cluster.

Listing 8. The ART main loop
void art_train( FILE *fptr )
{
   int changes = 1;
   int cluster;

   while ( changes )
   {
      changes = 0;

      for (int feature = 0 ; feature < max_feature_vectors ; feature++)
      {
         cluster = cluster_observation( feature );

         if ( cluster == CLUSTERS )
         {
            cluster_create( feature );
            changes++;
         }
         else
         {
            // If this feature vector has moved, move it.
            if ( feature_vectors[ feature ].cluster != cluster )
            {
               cluster_add( cluster, feature );
               changes++;
            }
         }
      }
   }

   emit_clusters( fptr );

   return;
}

The meat of the ART algorithm is found in cluster_observation in Listing 9. This function determines the similarity of an observation to the set of prototype vectors or whether to create a new prototype vector (cluster) to support the observation. This function relies on vector magnitudes, which are defined as the number of 1's that exist within a vector. This function is split into two parts: The first part finds the prototype cluster most similar to the feature vector, and the second part identifies whether the cluster is sufficiently similar to place into that cluster. Otherwise, a new cluster can be created.

Listing 9. Clustering an observation in ART1
int cluster_observation( int feature )
{
   vector result;
   double best_max = 0.0;
   int best_cluster = CLUSTERS;

   double featureMag = ( double )vMagnitude( &feature_vectors[ feature ] );

   for ( int cluster = 0 ; cluster < CLUSTERS ; cluster++ )
   {
      // If a cluster has no members, skip it.
      if ( clusters[ cluster ].count == 0 ) continue;

      vAnd( &result, &feature_vectors[ feature ], &clusters[ cluster ] );
      double resultMag = ( double )vMagnitude( &result );
      double clusterMag = ( double )vMagnitude( &clusters[ cluster ] );
      double maximum = resultMag / ( BETA + clusterMag );

      if ( maximum > best_max )
      {
         best_max = maximum;
         best_cluster = cluster;
      }

   }

   if ( best_cluster != CLUSTERS )
   {
      vAnd( &result, &feature_vectors[ feature ], &clusters[best_cluster] );
      double resultMag = ( double )vMagnitude( &result );
      double clusterMag = ( double )vMagnitude( &clusters[ best_cluster ] );
      double maximum    = resultMag / ( BETA + clusterMag );
      double similarity = featureMag / ( BETA + ( double ) MAX_FEATURES );

      // See if the feature vector is similar to the cluster
      if ( maximum > similarity )
      {
         if ( ( resultMag / clusterMag ) >= RHO )
         {
            return best_cluster;
         }
      }
   }

   return CLUSTERS;
}

Look at the equations from Listing 9 in the context of an example. Figure 6 illustrates this process for a feature vector and two prototype vectors. Recall that the magnitude of a cluster is just the number of 1's present in the vector, and the vector AND operation is just like a bitwise AND. The value n is the dimension of my feature vector and cluster.

In parts A and B, note that prototype vector 0 maximizes the shown equation. Parts C and D then test whether the feature vector is sufficiently similar to the chosen prototype vector, given that my threshold RHO is met. So, in this case, the feature vector is added to the cluster represented by prototype vector 0.

Figure 6. Illustrating the ART similarity equations
Illustrated example of the Listing 9 equations
Illustrated example of the Listing 9 equations

Listing 10 provides the cluster management functions. The first, cluster_create, finds an empty cluster and loads the feature vector to it as its prototype. It sets the membership count to 1, and then indicates that the particular feature vector is part of the cluster. In cluster_add, I first check to see if the feature vector is currently part of a cluster. If it is, I remove it from the cluster and recalculate the cluster's prototype vector. Then the feature is added to the new cluster and its prototype vector recalculated.

Listing 10. Functions to create or add a feature to an existing cluster
void cluster_create( int feature )
{
   int cluster = find_empty_cluster( );

   if ( cluster != CLUSTERS )
   {
      for ( int i = 0 ; i < MAX_FEATURES ; i++ )
      {
         clusters[ cluster ].features[ i ] = 
            feature_vectors[ feature ].features[ i ];
      }
      clusters[ cluster ].count = 1;
      feature_vectors[ feature ].cluster = cluster;
   }

   return;
}

void cluster_add( int cluster, int vector )
{
   // If the current feature vector has been classified, pull it out.
   if ( feature_vectors[ vector ].cluster != CLUSTERS )
   {
      int prior_cluster = feature_vectors[ vector ].cluster;
      feature_vectors[ vector ].cluster = CLUSTERS;
      clusters[ prior_cluster ].count--;
      recalculate_cluster( prior_cluster );
   }

   // Add the feature vector to the new cluster.
   feature_vectors[ vector ].cluster = cluster;
   clusters[ cluster ].count++;
   recalculate_cluster( cluster );

   return;
}

The final function in this ART demonstration is recalculate_cluster (see Listing 11). This function iterates through the feature vectors and finds those feature vectors that are part of the cluster. The first feature vector found is loaded directly into the cluster's prototype vector, where subsequent feature vectors are masked in (such that the prototype vector feature vector is the bitwise AND operation of all feature members associated with that cluster).

Listing 11. Recomputing the cluster's prototype vector
void recalculate_cluster( int cluster )
{
   int first = 0;

   for ( int vec = 0 ; vec < max_feature_vectors ; vec++ )
   {
      if ( feature_vectors[ vec ].cluster == cluster )
      {
         if ( !first )
         {
            first = 1;

            // Set the cluster to the first feature vector.
            for (int feature = 0 ; feature < MAX_FEATURES ; feature++ )
            {
               clusters[ cluster ].features[ feature ] =
                  feature_vectors[ vec ].features[ feature ];
            }
         }
         else
         {
            // Boolean AND the next feature vectors into the cluster.
            for (int feature = 0 ; feature < MAX_FEATURES ; feature++ )
            {
               clusters[ cluster ].features[ feature ] &=
                  feature_vectors[ vec ].features[ feature ];
            }
         }
      }
   }

   return;
}

A bit more code than vector quantization, but ART can be more robust than VQ. Next, I show you the algorithm in action.

Clustering with ART

As an example, I'll test my ART implementation against the zoo data set. I begin with the cleansed data set I created in earlier. I specify the cleansed data set file, an output file, and the ART algorithm. In this case, I specify my validation data to check how well the algorithm generalizes from the model it learned from the training data set.

Listing 12. Output from the ART training example
$ ./learn -t output.dat -v output.tst -o out -a art
Cluster 0: Count  17 : [ 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 ]
Cluster 1: Count   5 : [ 0 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 ]
Cluster 2: Count   1 : [ 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 1 0 0 ]
Cluster 3: Count   7 : [ 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Cluster 4: Count   3 : [ 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 ]
Cluster 5: Count  36 : [ 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 ]
Cluster 6: Count   2 : [ 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 ]
Cluster 7: Count  19 : [ 0 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 ]
Cluster 8: Count   7 : [ 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 ]
$

Listing 13 shows the output file. As I inspect this file, which shows the actual clustering, I can see the animal name; the original cluster, as defined by the zoo data set; the cluster, as identified by ART; and the validation data set.

Listing 13. Output generated in the ART output file
bass,0,4
carp,0,4
catfish,0,4
chub,0,4
dogfish,0,4
dolphin,0,1
haddock,0,4
...
wasp,8,6

Validation:
Validation:
aardvark (1) -> Cluster 5
ladybird (6) -> Cluster 8
parakeet (2) -> Cluster 2
piranha (4) -> Cluster 0
antelope (1) -> Cluster 5

In this example, you can see that ART took all nine clusters even though only seven were required. The overall accuracy of the training and validation set was an 89-percent correct prediction. ART has the advantage that it can scale the number of clusters as determined by RHO, but it can suffer misclassification based on the order of the data set. Randomizing the data set can improve results.

Going further

This tutorial explored the construction of two unsupervised learning models for a cleansed data set. VQ is a simple algorithm that can quickly and efficiently cluster a data set; ART is slightly more complex, but can adapt the number of clusters based on the data set and its configured parameters. In the final tutorial in this series, I'll explore methods for visualizing data as a way to improve its comprehension.


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=1055454
ArticleTitle=Cleansing, processing, and visualizing a data set, Part 2: Gaining invaluable insight from clean data sets
publish-date=01042018