This is the second in a series of posts about developing predictive models. In response to a challenge on kaggle.com, we are developing a model to predict home sale prices from data points describing various features of each home. Root Mean Squared Error determines model accuracy. RMSE is a measure of how far off our predictions are, so a smaller number is better.
In the first post in this series we:
- Prepared data for modeling
- Used a machine learning technique to help us select which variables to include in our model
- Trained our model using one-half of the data with sale prices included
- Predicted sale prices for the second half of the data (that does not include the sale prices)
- Submitted our predictions to the Kaggle website
- Beat the Kaggle benchmark and placed in the top 50% of all submissions
In this post, we'll look at model residuals. Residuals are the errors - in this case, the differences between the actual sale prices and the model's predictions from the training data. Looking at residuals is one of the first things we do to evaluate models and identify potential improvements.
A key assumption of linear regression is that the residuals are normally distributed with an average value of zero. The Residuals vs. Predictions plot below shows points evenly distributed around the red line, indicating that the residuals are normally distributed. A quick check of the average of all the residuals show that it is effectively zero.
The average of model residuals = -0.000000000000000004614934
Outliers, Leverage, and Influence
As you can see on the graph below, there are a few points that are much farther from the red line than the others. These might be outliers.
Outliers have two properties of interest, leverage and influence. Leverage is the potential to influence the model's prediction, and influence is the actual impact of an observation on the prediction algorithm. Large values of leverage or influence can have a negative impact on prediction accuracy, so we may want to exclude some outliers from our training data.
Cook's Distance is a metric that combines both leverage and influence. It quantifies how much the model changes if you remove a particular data point.
We used Tableau to make a more detailed plot. The Cook's Distance value is the size of the circle:
We can see that there were a few observations with very large values. The largest, observation 1299, had a predicted sale price of $442,158 with and actual price of $160,000. There could be any number of reasons for this, including a data entry error in the original data. Whatever the reason, this outlier impairs model accuracy, so we removed it.
What to remove?
We wanted to remove only the outliers that hurt our model's predictive power. If we removed too many or too few the model would not predict as well on the test data. Through evaluating model RMSE, we identified a Cook's Distance threshold of .35 for our training data. This meant we were going to exclude six outliers from the training data. The plot below shows which six we removed in red.
We re-ran our model using the adjusted training data, and used the revised algorithm to predict sale prices for the test data. Removing the outliers from our training data made a big difference! We moved up 173 positions on the competition leaderboard, and the RMSE improved from .13 to .122. We are now in the top 25% of submissions for the competition overall.
The data, code, and images for the analysis described in this post are available here.