In a prior article, we got a pretty good R squared quality metric for predicting house values with a linear regression model that we trained in an earlier article. However, there still may be a lot of room for training a higher quality predictive model, even using the same housing data. One way to explore whether, or to what extent, this may be true is by visually analyzing the data set. Specifically, the variable being predicted (house value) will be examined against each predictor variable in isolation to see if any patterns stand out. This is especially important because the machine learning algorithm was linear regression, so if a clear non-linear data relations exist, then we will know there is room for creating an improved model that gets a higher R squared value.
As a prerequisite to doing data analysis on the housing data, we first read it in, and then we import a couple of libraries that will enable us to look at scatterplots of the dependent variable (house price) against each predictor variable:
import pandas as pd df_data_1 = pd.read_csv('cal_housing_data with headers.csv') import numpy as np import matplotlib.pyplot as plt
Next we put the data into a numpy array and then isolate the data for the dependent variable into vector y:
data = np.array([x for x in df_data_1.values]) y = np.delete(data, slice(0, 8), axis=1)
Then, to get a scatterplot for any predictor variable x, we can use code like that which is below to plot that variable's data against the dependent variable y:
x = np.delete(data, slice(1, 9), axis=1) color = "#0000BF" plt.scatter(x, y, c=color, s=1) plt.title('Longitude vs. House Price') plt.xlabel('Longitude') plt.ylabel('House Price') plt.show()
Since the code is just one click away, I won't keep repeating minor variations of it for the other predictor variables. Instead, we'll analyze the data patterns now.
In the plot of house longitude versus price below, there is clearly a pattern to the data, and it is also not linear. In other words, a linear curve may be somewhat helpful (after all, we did get a decent R squared overall), but the pattern for this predictor variable is more reminiscent of a quadratic (parabolic) curve or an inverted quartic curve, which is a degree four polynomial with two concave down humps. Although it is not often immediately evident why a particular pattern in data exists, in this case, it's fairly obvious that the two price 'humps' correspond to high value properties in the Los Angeles and Silicon Valley areas.
When we examine the scatterplot below of latitude versus house prices, we see a similar looking pattern of two humps. This is because scanning northward toward increasing latitude also hits the same regions where house prices are highest. If you look a little closer, you can even see a more intricate pattern involving higher prices around San Diego, then Los Angeles, then tapering off until the San Jose / San Francisco area, then tapering off except more slowly because of places like Sacramento. Who knows how intricate a pattern we might discern if we look hard enough with our neural networks?
At the opposite end of the recognizable pattern spectrum, we have below the plot of house median age versus house price. The best we can say is that it looks like a hot mess. The linear regression model may be getting a tiny bit of R squared mileage out of a line, but adding virtually any variable can slightly boost R squared without really capturing any kind of useful relationship to the dependent variable. When a variable shows a plot that looks this much like randomness, it's worth testing whether it would be better to just leave it out and save compute resources for processing better predictors.
The plot below for the total rooms square footage against the house prices shows an interesting and different pattern. It's quite reasonable to see the pattern as a steeply sloped line and hence that linear regression would be appropriate. And it's not that surprising a pattern: more living space, higher cost, simple as that. However, it may be possible to get increase R squared with, say, a thin concave down parabola. Experimentation would be the only way to tell.
Also not surprisingly, there is a similar pattern in the scatterplot below comparing total bedroom square footage and (column 4) and housing prices. It's remotely possible, in this case, that a cubic relationship could do slightly better than a parabola, but more time-consuming experimentation would be needed to test this possibility. If only there were a way to automate the testing for such patterns... :-)
Although for different reasons, there is a similar looking pattern between population density and house prices in the plot below. It's easy to see a line with a sharp upward slope being reasonably reflective of this data just based on supply and demand, but again there are nuances suggesting that a possible quadratic or cubic curve might be a somewhat better fit. Still, it's not a high priority to do manual work to find a better fit when you see a pattern like this.
As another measure of population density, the density of households shows a similar pattern with house prices in the plot below, so again, linear regression is a good model.
The scatterplot for median income value relative to house price shows the nicest example of a linear relationship. One might easily assume that there would be a linearly increasing trend between earning more money and buying a more expensive house, but it's still best to look at the data to make sure it matches your assumptions. And still, as with any variable, it may be possible to do better with a polynomial, such as one that produces slight bends in the line. But if you have to do it manually, it would be the lowest priority when you see such a clearly linear pattern as this.
In summary, we've now seen that a number of the variables perform reasonably well against the assumption of linearity, and that helps understand why the linear regression model had a good R squared metric. And yet, we've also seen that some variables, especially the longitude and latitude, have clearly non-linear patterns, which suggests there is a better predictive model out there. In the upcoming work, we'll explore how to build it... stay tuned!