This is the third post in a series about developing predictive models. In response to a challenge on kaggle.com, we are developing a model to predict home sale prices from data points describing various features of each home. The following data challenge comes from a kaggle.com competition. Root Mean Squared Error (RMSE) determines model accuracy. RMSE is a measure of how far off our predictions are, so a smaller number is better.
In the first two posts we:
- Prepared the data for modeling
- Used a machine learning technique to help us select which variables to include in our model
- Trained our model using one-half of the data with sale prices included
- Evaluated the model residuals (errors) to identify outliers in our training data
- Used our model to Predicted the sale prices for the second half of the data (that does not include thewithout sale prices)
- Submitted our predictions to the Kaggle website
- Placed in the top 25% of all submissions
In this post, we will use some basic feature engineering to improve our model. The general idea is to use the features in our data set to create new features. The new features should squeeze some extra information out of the data, and therefore improve our model. Feature engineeringis a very broad topic, and is a central idea in machine learning. Exactly how to do it combines both art and science. Although it can be automated, we feel that applying a little common sense to data we understand well serves us best in most cases.
Location, Location, Location
Most real estate agents will tell you that the most important factor in the selling price of a home is its location. We have the neighborhood as a feature; can we combine that piece of information with other features in our data? We broke several quantitative features into equal partitions and evaluated their relationships to neighborhood and sale price. The chart below breaks down home square footage to small, medium, and large and then plots the sizes by neighborhood. We sorted the neighborhoods from left to right in ascending order of average sale price:
We see that there are many more medium sized homes (yellow) than either small or large homes. In addition, there are more small homes in neighborhoods where the average sale price is lower, and more large homes in the neighborhoods where it is higher.
This next plot shows the distribution of sale prices by size, by neighborhood. We can see that there is a strong relationship between sale price and the combination of neighborhood and home size:
- The sale prices are increasing as we move from left to right
- The average sale price increases by size withinneighborhoods
We revised our model to include the interaction of neighborhood and living area, and used the new algorithm to predict sale prices for the test data. We posted our updated predictions on Kaggle.com and moved up over 200 positions into the top 11% of all submissions. With three iterations of our model, we have moved from the 50th to the 25th to the 11% percentiles of all submissions for the competition. Because of this latest iteration, we also have detailed information on the value of square footage by neighborhood. The chart below indexes the neighborhoods by the relative value of square footage in each.
The data, code, and images for the analysis described in this post are here.