As part of the Flatiron School’s Data Science program, we were tasked with building a linear regression model using a dataset containing King County Housing Data from 2014–15. I decided to analyze this data from the perspective of a company that would like to build an app to help home buyers in their search for a house in the very competitive King County housing market.
The dataset contains just over 20,000 rows representing house sales in King County. The features include things like number of bedrooms, bathrooms, square feet, zip code, etc. Only 3 features contained null values- ‘waterfront’, ‘sqft_basement’ and ‘renovated’. For the ‘waterfront’ and ‘renovated’ features, I changed them to categorical features with ‘yes’, ‘no’ and ‘unknown’ options. For ‘sqft_basement’, I first checked if ‘sqft_living’ equaled ‘sqft_above’ plus ‘sqft_basement’ for most of the rows. It did for almost all rows, so I replaced the ‘sqft_basement’ null values with ‘sqft_living’ minus ‘sqft_above’ and rechecked them. Now, it was true for all rows.
There were many features with outliers, which I opted to deal with later in the process using the Robust Scaler and then, in another iteration, LocalOutlierFactor from sklearn.neighbors.
Finally, I checked the numerical features for multicollinearity and removed ‘sqft_living’ and ‘grade’ based on the results.
Using an iterative process to create and test linear regression models, I found that using the ‘zip code’ feature significantly increased the R2 value of the model. However, there are 70+ zip codes in King County and, while they are represented by numbers, they cannot be considered a continuous feature, so required one-hot-encoding. In an effort to avoid creating so many extra features, I found a dataset containing the median incomes for King County zip codes and tried using the median income information in place of the zip codes. This model had a higher R2 value than a zip code-less model, but it only increased R2 by about 25% as much as the zip codes and had no effect when added to the model along with the zip codes.
The regression coefficients indicated that zip code, above ground square footage, condition and number of bathrooms had the largest impact on house prices. I used the zip code coefficients to create the above visualization showing how each zip code impacts home prices.
My final model was able to explain about 85% of the variance in King County house prices in 2014–15. The most important features affecting home prices that a buyer should consider are zip code/neighborhood, square footage above ground (basement size has a significantly lesser affect on home price), and number of bathrooms. The number of bedrooms doesn’t seems to have an effect when comparing houses of similar size. This model can thus be quite useful for home buyers in determining their priorities in house selection.
You can find the full project in this github repository.