In this article, we’ll cover how to measure the quality of the TensorFlow regression model covered in a prior post. As usual, the code for the quality measurements can be obtained from my TensorFlow Samples repository, and you can use this code in IBM Data Science Experience / Watson Studio. The code is also written generically so that you can apply it to models built with other libraries, too.
A regression model solves a kind of problem that can’t be solved with a classification algorithm. A data scientist trains and uses a regression model when the variable being predicted is a continuous quantity or an ordinal quantity with a large value space. For example, if the input is an image of one of ten numeric digits, a classification model would predict which digit it is. Even though numbers are comparable, there’s nothing about an image of a two that makes the image less than an image of a three, any more than an image of a cat would be less than an image of a dog (as if!). On the other hand, a regression model would be used to predict a property value (essentially continuous) or the number of hours after a medical procedure that a patient will need to stay in an intensive care unit (ordinal with a high value space).
The linear regression model in the prior post was a linear regression model that used matrix operations to determine a line of ‘best fit’ for the housing data. There were 9 variables including the median house value that the linear regression model learned how to predict. So, the ‘best fit’ line is calculated to flow through 9-dimensional space in a way that is closest, overall, to all the 9-dimensional data points in the housing data.
But how good is the fit of that ‘best fit’ line? Sometimes the ‘best fit’ line not a good fit because variables are not linearly related to the dependent variable. At other times, there might be a linear relationship at a statistically significant level, but the model is still not that great of a fit because the relationship, and hence the data, is noisy. So, how do we measure whether we have a good regression model, an excellent one, or a poor one?
The R squared metric is a ratio that indicates the amount of the data’s variance from the mean that is accounted for by the regression model’s predictions. Before we unpack the meaning of that statement, let’s just first have a look at the library method you’d normally use to get the measurement. The variable ‘predicted_values’ contains a one-dimensional array of predicted median house values generated using the trained linear regression model. To prepare for the R squared calculation, we flatten the actual median house prices into the same one-dimensional format, and then we use the scikit learn method that calculates R squared for us:
y_actual = np.ndarray.flatten(housing_target) from sklearn.metrics import r2_score R2 = r2_score(y_actual, predicted_values)
The result in this case is a touch more than 0.637. One may have a rough sense that this is good because, well, more than half of the variance from the mean is explained or accounted for by the regression model. In other words, if you were given each house’s predictor variable values and you always answered the average price, then your answers would reflect a balance between sometimes be high and sometimes low. The total variance of the actual house prices from your constantly mean answers (yes, I meant that) is called the total sum of squares, and you can calculate it yourself very easily like this:
y_bar = np.mean(y_actual) SStot = 0.0 for y_i in y_actual: diff = float(y_i - y_bar) SStot += (diff * diff)
On the other hand, residual variance is the variance that is unexplained or not accounted for by the regression model. In other words, it is the variance that’s left over if you use the regression model to predicted values instead of using the mean. It is easily computed like this:
SSres = 0.0 for i, f_i in enumerate(predicted_values): diff = float(f_i - y_actual[i]) SSres += (diff * diff)
The ratio of the residual to total variance is the portion of unexplained error, and subtracting that from 1 gives the portion of variance explained by the regression model, which is R squared and is calculated easily as follows:
R_squared = 1.0 - SSres / SStot
The following illustration graphically depicts the difference that a linear regression model makes in accounting for variance. On the left, you see the results of constantly using the mean as the predicted value. Each data point is some distance from the mean line, and the square of that distance is the variance for that data point. The sum of the large reddish squares’ areas gives the total variance from the actual data values. On the right, you can see smaller blue squares of residual variance of the actual data points from the predicted values of the linear regression model.
Now you have a rough intuitive sense that the housing price model was a good model based on an R squared of 0.637. But, the precision of the intuition is like a rare steak. It tastes good, but we all know that data scientists are people, and people shouldn’t eat undercooked meat.
So, what is good, fair, poor, or excellent for R Squared? A number of sources that you will find out there will say that an R squared of 0.25 is a large effect size. However, this is large for detecting the effect of a treatment (e.g. a psychological technique, educational module, or medication). But a good R squared for a treatment's effect size is different from (and less than) the R squared that would correspond to a good predictive model.
In a 2015 study, a group of medical researchers created a new regression model for predicting the required length of stay in intensive care after heart surgery. The benchmark model in use at the time had an R squared of 0.356. This is consistent with answers I received while interviewing a few data scientists, who indicated that R squared values in the 0.3’s and 0.4’s would correspond to serviceable predictive models. Since they also said they’d want to keep experimenting to get better results, it would be fair to say that 0.3’s and 0.4’s are ‘fair’ values for R squared for a predictive model.
The purpose of the 2015 study, though, was to present the researchers’ new regression model, which had a much-improved R squared of 0.535. The “delighted tone” (Lewis, 2016, p. 79) the researchers had when describing the new model was due to the magnitude of improvement in R squared, but in that case it’s reasonable to conclude that the new R squared should be described with a qualitatively higher qualifier. As such, it is a ‘good’ R squared value. More generally. 0.5’s and 0.6’s would be considered ‘good’ to ‘quite good’ according to the data scientists I interviewed.
When asked ‘what is a good R squared,’ the data scientists I interviewed did, of course, start with admittedly reasonable disclaimers like “It depends on what you’re doing” and “it depends on the current benchmark.” But, the characterizations above and next are based on not having answers for those dependencies. R squared values in the 0.7’s were generally regarded as excellent, and the 0.8’s were outstanding. This left the 0.9’s in the realm of practically unachievable. Put another way, in real-world scenarios, it’ll be practically as rare as is the frequency that one should eat undercooked meat.
Lewis, N.D., 2016. Deep Learning Step by Step with Python. www.AusCov.com