In a previous blog post we explored machine learning approaches to processing visual images. Here, we were able to successfully sort safe from unsafe driving behavior using dashboard camera images.
This time, we will predict home sale prices from data points describing various features of each home. Similar to our safe driving exercise, this data challenge comes from a kaggle.com competition. More specific information is available here.
To participate in the competition we will:
- Prepare the data for modeling, typically the majority of the work.
- Use a machine learning technique to help us select which variables to include
- Train our model using one-half of the data with sale prices included
- Predict the sale prices for second half of the data that does not include the sale prices
- Evaluate model accuracy by posting our predictions to the Kaggle website
- Fine tune our model, resubmit, and see how accurate we can get
Dean De Cock, professor at Truman State University, compiled the Ames Housing dataset for use in data science education. The data includes 2,919 individual home sales that occurred in Ames, Iowa between 2006 and 2010. A complete description of the original data is here.
Root Mean Squared Error (RMSE) determines model accuracy, which sounds a lot more complicated than it actually is. To calculate RMSE:
- Square the differences between each of our predictions and the actual values
- Add all of those up to get a total
- Divide by the total number of sales we predicted to get the average
- Take the square root of that average
RMSE is a measure of how far off our predictions are, so a smaller number is better. More info on RMSE from Kaggle is here.
To get started, we did some light processing of the original data set in Alteryx, our go-to tool for ingesting and cleansing data. We wound up with 252 different characteristics of each property to estimate its sale price.
With the dataset is cleaned up, we switched to R (open source statistical software) to get started creating our model.
Check for Normality of data
To create predictions, we fit the data with a multi-variate regression model. A key assumption of this approach is that the dependent variable (what we are trying to predict) is normally distributed. A graph of the distribution of home prices should look something like this:
We can see from the actual graph below that the Sale Prices are not normally distributed. We have a much longer tail on the right side than on the left.
To correct this problem, we replaced the Sale Price with the natural log of the Sale Price. This is a common fix for this problem.
We did the same transformation for three of the predictor variables as well. Taking the time to get the data in top shape often leads to better model accuracy.
To build our model, we needed to evaluate which of the 252 variables to use. We used a stepwise regression process to sort through the possible combinations of predictor variables and return an optimized model. This reduced the number predictors from 252 to 132. We then used R’s linear model function to fit the model.
Our first model explains about 92% of the variance in home sale prices in the training data.
Some of the top factors influencing sale price include:
- Overall quality assessment
- Above ground living area
- Condition of the basement
- The year built
Further analysis of the model output will provide insights into how influential each variable is on the eventual sale price. This is a key advantage of the multi-variate regression approach. The information provided by the model gives us results that are well understood, and relatively easy to interpret.
Because of this advantage, we often start with a regression-based approach for client modeling projects. The insights gained from these types of models often lead to a better understanding of the factors influencing business results.
Where do we rank?
To put our model to the test, we used it to predict sale prices for the test data and submitted them to the kaggle.com website.
The RMSE for our first submission was just over .13. This is significantly better than the Kaggle benchmark submission of .4. It also places us into the top half of all submissions. As with most Kaggle competitions, the difference between model accuracy gets smaller and smaller as we approach the top of the leaderboard, so we did well with our first try.
There is a lot left to do to improve the performance of our model. We will continue to track our progress up the competition leaderboard in subsequent blog posts.
The data, descriptions of the variables, our Alteryx workflow to cleanse the data, and all of the R code to create our first submission are available here for reference.