data, photography, education
uc berkeley | machine learning
Contributors: Allison Godfrey, Ian Anderson, Jacky Ma, and Surya Gutta
The House Price Prediction Kaggle competition is based on the Ames Housing Dataset. The goal of this project is to predict sale price of homes on the given training and test data sets containing 90 features of each home. In this Final Notebook, my team and I used machine learning approaches to try to most accurately predict home price based on relevant features.
The main components of the notebook are:
Machine Learning, Data Cleansing, Correlation Analysis, Feature Engineering
Python, SciKit Learn, Jupyter Notebooks, Matplotlib
We measured each model’s accuracy by its Root Mean Squared Log Error (RMSLE). Looking at the following chart, the RMSLE was lowest in our Blended Model with a distinct set of model weights. This RMSLE (0.09252) achieved our best competition score, but was not our lowest RMSLE within our notebook, illustrating that our models were prone to overfitting. The RMSLE of 0.09252 represents our best balance between overfitting and generalizability.
From the beginning, our focus was more on know your data. Therefore, we were very intentional about how we encoded each categorical and ordinal feature, how we assigned missing values, how we aggregated some features to avoid multicolinearity, and how we iteratively performed our feature selection process. See the following Slide Deck to discover more about our iterative process and some further extensions of the model.
We started at a Kaggle placement of 3500 and have worked our way up to a placement of 525 (top 12% of submissions).