In today's cognitive computing products and techniques, the perception of greater intelligent responsiveness comes not so much from having true explanatory power, but rather just having strong predictive power over increasingly chaotic and larger data sets.
To give us a third point of reference beside linear regression and neural nets, I'll use some other terms to bring the focus to natural language processing. In 2011, the IBM Watson system demonstrated greater intelligence than the best human opponents in the domain of linguistically challenging factual Q&A. This was based on the ability to quickly produce high confidence answers from a large corpus of unstructured information in response to challenging questions.
The linguistic product that is now based on that system is called the IBM Watson Engagement Advisor. As with other cognitive computing techniques, the product must first be trained to be an effective system in the target domain. The corpus of unstructured information often takes the form of documents, such as instruction manuals, technical reports, journal articles, and wiki pages. During training, the most important entities and relationships expressed in the documents are identified and stored in order to expedite later search and retrieval during Q&A interactions with users of the system. The identification process within a document is often called annotation, and the annotation and storage processes together are called ingestion.
The most important concept in understanding the training is to understand what really drives the identification, or annotation, of the documents. It's simple, really. It's a Q&A linguistic product, and the annotation and ingestion expedite the production of the A's in response to the Q's, so it is imperative to have a strong and large representative sampling of the potential questions in order to train and test the efficacy of the system. The questions encode the key concepts (e.g. entities, relationships and so forth) to which users of the system are interested in getting answers. Annotators for these key concepts are developed and, during ingestion, they are executed upon the documents.
This is the very basic level of explanation about how the linguistic product would learn, or be trained, to be an effective cognitive computing system for a domain, and future entries will dive further into this topic. For now, let it suffice to say that ingestion and training results in a system capable of producing answers from the corpus in response to questions like those used in training.
During a run-time Q&A session with the system, the user begins by posing a natural language question. The question is first analyzed to find the key concepts, and then a multiphased approach is used to dig up the best results from the ingested corpus content. As with training, there's a lot more to be said over time about how the run-time Q&A works, so more interesting future entries to come, and in fact, it's intrinsically related to the training anyway. To conclude and tee up these future entries, I'll say the high order bit here is that a trained Q&A linguistic product seems somehow more intelligent than a linear regression or even a typical neural net application. Why is that? To get a bit more background for that explanation, I'd encourage you to visit or revisit a few of my earlier blog entries about cognitive computing. Compare your perceptions of the intelligence of the minimax algorithm in  with the linear regression method in , and compare  with the neural net in . What's changing?
Modified by John M. Boyer
Explanatory power is a bit of a loaded term. I believe that we can come to a good understanding of what it is and how it related to machine learning by comparing and contrasting linear regression with neural nets.
The IBM Analytics Education Series has a good introductory analytics presentation that includes brief descriptions of neural nets and linear regression. It's a good video worth your viewing, though it does make one point with which I don't agree.
The speaker says that a challenge with neural nets in business applications is that they are black box, meaning that you can understand the inputs and the outputs but not really how it is deriving the outputs. Later, the speaker says that linear regression is a preferred technique because it has a very strong predictive and explanatory power.
It's not really true that linear regression has more explanatory power than neural nets. Rather, it is easier to understand the problems and the answers that can be solved by linear regression. By comparison, neural nets tend to be used to provide cognitive computing power to harder problems than linear regression can solve.
To put this another way, when you use linear regression, you actually begin by assuming linearity of the relation you want to predict. As the speaker points out, you can also make a non-linear assumption, and you can accommodate this using a data transformation, for example. But the high order bit is that you are asked to assume the data relationship, and that assumption is what is giving you the illusion of explanatory power. You can explain that the data follows a line, but this is due to your own assumption. Note that an important aspect of completing a linear regression model is determining the R2 or goodness of fit of the model. This is the part where you make sure that your assumption of linearity is valid. And if the assumption is invalid, then the model has no predictive value, so it does not matter that you can explain how it operates.
Under the interpretation that explanatory power is akin to predictive power, it turns out that neural nets have greater predictive power because they can produce results for a wider array of applications than linear regression can. There a neat table that relates the cognitive power of a neural net to the number of hidden layers. From the table, you can see that when a relationship actually is linear, a neural net can solve it without even using any hidden layers of neurons. When one or two hidden layers of neurons is present, neural nets transcend the capabilities of linear regression, in part because they do not require you to make any assumption about what the data relationship actually is.
And that's where the confusions comes in. The linear regression model requires you to assume linearity and so you know at least what geometric shape the relationship looks like. The neural net requires no such assumption, but nor does the trained neural network give you any hint at what the relationship is. The lack of knowing the relationship is confused for having less explanatory power.
But if you look at this a bit more abstractly, the trained linear regression model has the same exact problem of not providing any additional insight. A neural net is really just a pile of numbers giving constant weights to the neural connections that can convert inputs to outputs. Similarly, a linear regression model is just a pile of numbers that give constant weights to inputs to be linearly combined into an output. Sure you know the data relationship, but that's because you assumed it. The actual linear regression model gives you no insight into why one dimension has a large slope constant where another has a small slope.
An analogy I like to use is that the value of the neural net is not diminished by our inability to explain how it is that the little gray cells which implement our personal neural nets can produce the cognitive results that they do, and who among us would prefer to have cognitive powers defined by linear regression instead?
In terms of explanatory power, our biological neural nets perform an additional key function that we have not hitherto been able to achieve with artificial neural nets. We are able to construct additional information in the output that reveals causal relationships, or insights into the reasons for the phenomena it predicts. Put simply: we say why something is true. We provide a rationale. This is an aspect of explanatory power that, when achieved, dramatically increases the value and utility of any cognitive analytic. Theorem provers and Prolog programs have been able to do this for the applications to which they apply. In the area of unstructured information processing and data mining, you can see a demo of this concept in Watson Paths.
Modified by John M. Boyer
As an interesting possible counterexample to my last blog about MLR models not understanding the knowledge they learn, consider the neural network. Our brains are neural networks, and we are capable of learning at all levels of Bloom's Taxonomy, not just the knowledge level. Shouldn't artificial neural networks be able to achieve the same things?
The answer is no, not really. Our brains biologically, chemically and physically perform in ways that we scarcely understand, so our name for the thing we call "artificial neural network" is no less anthropomorphizing than when we say that a computer program of today "understands" anything.
Still (again), this is not to say that they aren't incredibly useful and effective. It's just that they are based on straightfoward and well-understood mechanical methods such as feed forward activation of neural outputs via sigmoidal threshold functions applied to inputs and back propagation of synaptic weight adjustments based on easily quantified classification errors. Before going any further, let's have a quick look at a diagram of an artificial neural network (ANN):
The ANN has an output layer on the right that is a classifier for input patterns received on the left. For example, an ANN for optical character recognition could have an input layer of an 8x8 matrix of bits, and the output layer could be an 8-bit code that indicates an ASCII character. The hidden layer(s) of neurons help the ANN to represent more sophisticated phenomena, though there is seldom need for more than one hidden layer. The "synaptic" connections between the neurons in the layers are weighted numbers, and the neurons apply the weights to the inputs and then feed the results into a Sigmoid function that essentially decides, like a transistor or switch, whether or not to fire the output.
An ANN is "trained" by giving it a sequence of input patterns for which the correct output pattern is known. The input pattern feeds forward through the ANN to produce an output. If there is a difference between the ANN output and the correct output, then the differential error is back propagated through the ANN to adjust the weights so that future occurrences of that input pattern are more likely to produce the correct output.
The synaptic weights, then, essentially represent the knowledge that the ANN "learns" from the input patterns. This is analogous to the constants that are "learned" by an MLR model. In fact, all elements of the ANN and MLR model architectures are analogous. The ANN input layer maps to the the independent X variables, the ANN output layer maps to the dependent Y variable, and the transition from input X values and the Y value that is achieved in MLR by multiplication and addition is achieved by a feed forward through synaptic connections, hidden layer neurons and Sigmoid functions in an ANN.
With such a one-to-one architecture mapping between ANNs and MLR models, it is easier to see them as having similar intellectual power. That's not to say they're equivalent, as ANNs are far more powerful. It's just that they're roughly the same (low) order of magnitude with respect to human intellect, and in terms of Bloom's Taxonomy, we call that order of magnitude "knowledge storage/retrieval".
Despite being in the lowest order of magnitude of intellect, the realm of today's artificial intelligence includes many interesting knowledge storage/retrieval techniques that are worth comparing and contrasting to see the range and limits of their power and the use cases they address. Stay tuned!
Modified by John M. Boyer
Ever since my first blog entry in this recent series on artificial intelligence, I've been highlighting the lesser, calculational nature of machine intelligence and learning-- as well as the valuable role it nonetheless can play in driving more effective human understanding and decisions. I've been doing this by articulating mainly what machines do, as that is the primary interest of mine and most who would read a developerWorks blog. Still, our interests will be served by taking an entry to discuss human learning as a counterpoint or contrast.
The multiple linear regression example in my last post is a good example to start with because it highlights the difference between accuracy versus understanding. If there is a linear relationship among the data, then an MLR can have a very high predictive accuracy, but it has no explanatory power whatsoever. The MLR model does not have, nor does it convey, any understanding as to why the relationship exists.
Let's see how this predictive accuracy rates in terms of human intelligence and learning. In this case, we can benefit from an instance of that delightful human propensity to apply ideas to themselves. Specifically, we humans have applied our learning abilities to the phenomenon of our learning abilities, with many useful results including Bloom's Taxonomy.
According to Bloom's taxonomy, the very lowest level of cognitive learning is the knowledge level, or the ability to remember and recall what is learned. When you think about it, you realize that an MLR model, like many predictive analytics, is really a storage mechanism for something that has been machine learned from data. In MLR, we store the constants of a linear formula as the representation of what has been learned from linearly related data.
The next higher level of Bloom's taxonomy is comprehension, which is where understanding and true explanatory power begin to surface. But human learning is so much more sophisticated than the knowledge level of machine learning that there are a number of levels above comprehension. There's the application level, in which we can use our knowledge to solve new problems, including being able to explain why the new solution works. The analysis level drills deeper into our ability to make inferences and generalizations. The synthesis level begins to get at our ability to be creative with what we've learned and come up with new ideas and solutions. Finally, the evaluation level gets at our ability to be subjective and judge quality and creativeness of ideas and solutions. We are beginning to see some faint glimmers of some elements of some of these levels in cognitive computing efforts like IBM Watson, but it is early days indeed.
While we're on the subject of human learning and Bloom's Taxonomy, it makes sense to digress for a bit and mention the IBM Social Learning product. This is a SaaS educational platform intended to help enterprises achieve a Smarter Workforce. A few reasons for the digression are
this is a product I currently work on,
both it and the whole Smarter Workforce initiative are being featured at the IBM Connect 2014 conference being held right now, and
learning is a key ingredient of how a human workforce becomes smarter.
The IBM Social Learning product has a very nice feature that enables educational administrators to implement Bloom's Taxonomy in their learning materials. A component of the product is the Kenexa LCMS, or learning content management system, which includes various subcomponents like a course designer and a metadata dictionary. The educational administrator can add any metadata tag, such as "Learning Goal", and any tag values, such as "Basic Knowledge", "Comprehension", "Application", etc. Once this is done, the educational administrator can use the metadata tag values to classify any learning item in the LCMS accord to Learning Goal. Once these classified learning materials are published, learners can use the "Learning Goal" as a new faceted search criterion in the platform's learning library. A learner would be able to isolate and focus on "knowledge" level learning in a subject area before proceeding to comprehension and then application, for example. This will enable learners to effectively use the natural way in which their learning blooms, i.e. Bloom's Taxonomy.
Finally, there is an aspect of human learning that goes beyond Bloom's taxonomy, and it's an area that is highlighted by the IBM Social Learning product. There is a very important word in the product title: Social. This is crucial because it underscores the central role of communication and collaboration in the human learning process. We are an order of magnitude more effective at learning based on our interconnectedness to others who think and learn, rather having access to just data. This is pertinent to the advancement of artificial intelligence because "social" goes quite beyond the computing architecture underlying a lot of today's machine learning efforts.
Modified by John M. Boyer
Machine learning today is every bit as calculated, as simulated, as is machine intelligence. It is easier to use machine intelligence to highlight how much greater human cognition is, which is why I've been using a machine intelligence algorithm over the last several entries. However, the conclusion drawn so far is that, while machine intelligence is only simulated, it is still quite effective and valuable as an aid to human insight and decision making. Machine learning offers another leap forward in the effectiveness and hence value of machine intelligence, so let's see what that is.
Machine learning occurs when the machine intelligence is developed or adapted in response to data from the domain in which the machine intelligence operates. The James Blog entry only does this degenerately, at a very coarse grain level, so it doesn't really count except as a way to begin giving you the idea. The James Blog entry plays a game with you, and if he loses, he adapts by increasing his lookahead level so that his minimax method will play more effectively against you next time. In some sense, he learned that you were a better player. However, this is only a single integer of configurability with only a few settings of adjustment that controls only one aspect of the machine intelligence algorithm's operation. To be considered machine learning, a method must typically have a more profound impact on the operation of the algorithm, with much more adaptation and configurability based on many instances of input data. An example will clarify the more fine grain nature of machine learning.
The easiest example of which I can think is a predictive analytic algorithm called linear regression. Let's say you'd like to be able to predict or approximate the purchase price of a person's new car based on their age. Perhaps you want to do this so that you can figure out what automobile advertisements are most appropriate to show the person. Now, as soon as you hear this example, your human cognition kicks in and you rattle off several other likely variables that would impact the most likely amount of money a person is willing to spend on a car, such as their income level, debt level, nuclear familial factors, etc. This analytic technique is typically called multiple linear regression (MLR) exactly because we humans most often dream up many more than two variables that we want to simultaneously consider. Like most machine learning techniques, MLR does not learn of new factors to consider by itself. It only considers those factors that a human has programmed it to consider. When they are well chosen, additional variables typically do make an MLR model more effective, but for the purpose of discussing the concept of machine learning, the simple two-variable example suffices since your mind will have no problem generalizing the concept.
Suppose you have records of many prior car purchases, including a wide and nicely distributed selection of prices of the cars and ages of their buyers. This is referred to as "training data". If you plotted the training data, it might look something like the blue points in the image below. Let purchase price be on the vertical Y axis since it is the "dependent" variable that we want to predict, and let age be on the X-axis since it is a predictor, or "independent" variable. MLR uses a standard formula to compute a "line of best fit" through the given data points, again like the one shown in red in the picture.
A line has a formula that looks like this: Y=C1X1+C0, where C1 is a constant that governs the slant (slope) of the line, and C0 is a constant that governs how high or low the line is (C0 happens to be the point where the line meets the Y-axis, and the line slopes up or down from there). If we had more variables, then MLR would just compute more constants to go with each of them. For example, if we wanted to use two variable predictors of a dependent variable, then we'd be using MLR to create a line of the form Y=C2X2+C1X1+C0.
Technically, MLR computes the constants like C1 and C0 of the line Y=C1X1+C0 in such a way that the line minimizes the sum of the squares of the vertical (Y) distances between each data point and the line. For each point, we take its distance from the line as an amount of "error" in the prediction. We square it because that gets rid of the negative sign (and, less importantly, magnifies the error resulting from being further from the line). We sum the squares of the errors to get a total measure of the error produced by the line, and the line is computed so as to minimize that total error.
Once the constants have been computed, it is a trivial matter to use the MLR model as a predictor. You simply plug the known values of the predictor variables into the formula to compute the predicted Y-value. In the car buying example, X1 is the age of a potential buyer, and so you multiply that by the C1 constant, then add C0 to obtain the Y-value, which is the predicted value of the car.
In this way, hopefully you can see that the MLR "learns" the values of the constants like C1 and C0 from the given data points. Furthermore, the actual algorithm that produces the machine intelligence only computes the result of a simple linear equation, so hopefully you can also see that the predictive power comes mainly from the constants, which were "learned" from the data. In the case of the minimax method, most of the machine intelligence came from the algorithm, but with MLR-- as with most machine learning-- the machine intelligence is for the most part an emergent property of the training data.
Lastly, it's worth noting that there are a lot of "best practices" around using MLR. However, these are orthogonal to topic of this post. Suffice it to say that just like the minimax method has a very limited domain in which it is effective as a machine intelligence, MLR also has a limited domain. For example, the predictor variables (the X's) do need to be linearly related to the dependent variable in reality. However, within the limited domain of its linearly related data, MLR is quite effective and an excellent example of a simple machine learning technique that produces machine intelligence within that domain.