Predicting Home Prices Using Linear Regression and Scikit-Learn

Supervised Learning Using Scikit-learn

9 min readOct 31, 2022

Linear Regression

Linear regression is one of the most widely used techniques in predictive analytics. In regression tasks, the target value is a continuously varying variable, like the price of a house. The overarching concept of regression is to examine if a set of predictor variables do a good job of predicting said target value. Another major outcome of linear regression is the ability to explain the significance of each predictor variable in relation to the target value.

Using scikit-learn we can implement linear regression to predict housing prices and also explain the relationship between housing prices and variables such as sqft, bedrooms, and bathrooms.

The purpose of this tutorial is to demonstrate how to apply predictive analytic techniques using popular packages like scikit-learn. The reader should expect to have low-level familiarity with the following subjects:

Pandas and Data Manipulation
Statistics (optional)
Numpy
Data Visualization

But overall the goal is to show the steps needed to perform basic predictive analytic tasks using scikit-learn.

Data Set

Using a dataset sourced from Kaggle, our data will consist of home sales in King County, USA.

kc_house_data = pd.read_csv('kc_house_data.csv')

Pre-Processing

Before we continue we must do some preprocessing. Looking at the dataset’s columns, we can start to get an idea of some of the features that are going to be driving our model. Using df.describe you can see all the datasets columns as well as some useful information about our data set.

Some basic steps you might want to consider when preprocessing your data are:

Checking Null Values — kc_house_data.notnull().count()
Imputing Missing Values — from sklearn.impute import SimpleImputer
Binning — pd.qcut(range(5), 3, labels=[“1-10”, "10-20”, “20+”])
Normalization — normalized_df=(df-df.mean())/df.std()

We will also need to limit our data to only include homes that have sold. We will do this by dropping all rows where the price is equal to 0. We will also be dropping all rows where the sqft_living is equal to 0.

kc_house_data = kc_house_data[kc_house_data[‘sqft_living’] > 0]
kc_house_data = kc_house_data[kc_house_data[‘price’] > 0]

Additionally, you would want to limit your data to include a relevant time period. For example, if you were trying to predict housing prices in 2019, you would want to limit your data to only include homes that have sold in 2019. Thankfully, we don’t have to do that in this tutorial since our data set only includes homes that have sold between May 2014 and May 2015.

Feature Selection

Feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection techniques are used for the following reasons:

Reduces Overfitting: Removing irrelevant features reduces the complexity of the model and makes it easier to generalize.
Improves Accuracy: The accuracy of the model improves as irrelevant features are removed and features that are more useful to the model are retained.
Reduces Training Time: By removing irrelevant features, the training time is reduced.

There are many different feature selection techniques. In this tutorial, we will be using the following techniques:

Domain Knowledge (assumptions)
Correlation Matrix with Heatmap
Pair plots
Feature Engineering

Domain Knowledge

Since we are going to be generalizing our model to predict housing prices, we will be dropping the id column. We will also be dropping the date column since we are not going to be using it in our model. Ideally, we would use the date column in our model to predict future housing prices but we will not be doing that in this tutorial. We will also be dropping the lat and long columns since we are not going to be using them in our model. In the future, you could use the lat and long columns to predict housing prices in a specific area.

kc_house_data = kc_house_data.drop([‘id’, ‘date’, ‘lat’, ‘long’], axis=1)

Correlation Matrix

First, we will use a correlation matrix to see which features are most correlated with price. Removing features that are not correlated with price will help us reduce the complexity of our model and improve its accuracy. Also, removing highly correlated features will help reduce the chance of multicollinearity which can cause our model to overfit.

sns.heatmap(kc_house_data.corr(), annot=True, fmt=”.2f”)
plt.show()

Correlation Matrix for Housing Price Dataset

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, find patterns, and determine relationships between variables.

In this case, we are looking for features that are highly correlated with price. We can see that sqft_living, grade, sqft_above, and sqft_living15 are highly correlated with price. Typically, you would want to eliminate features that are highly correlated with each other. Usually, the threshold for correlation is 0.7 or higher but you can adjust this threshold based on your needs and the size of your dataset. In this case, we will be eliminating sqft_above and sqft_living15 since they are highly correlated with sqft_living. We will also be eliminating yr_built since it is highly correlated with yr_renovated and sqft_lot15 since it is highly correlated with sqft_lot.

kc_house_data = kc_house_data.drop([‘sqft_above’, ‘sqft_living15’, ‘yr_built’, ‘sqft_lot15’], axis=1)

Pair Plot

A pair plot is a great way to visualize the relationship between multiple variables. Here we can see the relationship between price and sqft_living. We can see that there is a positive correlation between price and sqft_living.

sns.pairplot(kc_house_data)
plt.show()

While it might seem a bit redundant to use a pair plot and a correlation matrix, they are both useful in different ways. A pair plot is useful for visualizing the relationship between multiple variables. A correlation matrix is useful for determining which variables are highly correlated with each other. The pair plot is also useful for analyzing the distribution of each variable in relation to the target variable.

Train/Test Split

Before we can begin training our model, we must split our data into a training and testing set.

Using our kc_house_data, we can split our data into a training and testing set using the train_test_split function from sklearn.model_selection.

First, we separate our data into our X y variables. X will be our features and y will be our target variable. We will be using the sqft_living, bedrooms, and bathrooms as our features and the price as our target variable. We will also set our test_size to 0.2 which means 20% of our data will be used for testing and 80% will be used for training. We will also set our random_state to 42 which will ensure that our results are reproducible.

X = kc_house_data.drop(‘price’, axis=1)y = kc_house_data[‘price’]

Then we split our data into our training and testing sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training

Now that we have our data split, we can begin training our model.

from sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor.fit(X_train, y_train)

Predictions

Now that we have trained our model, we can use it to make predictions.

We will pass our X_test data to our model and store the predictions in the y_pred variable. X_test contains the features for our testing along with the actual price of the homes. Here is what our X_test data looks like:

X_test.head(1)+----------+-------------+----------+-----------+
|  price   | sqft_living | bedrooms | bathrooms | 
+----------+-------------+----------+-----------|
| 297000.0 |        1430 |        3 |      2.50 |
+----------+-------------+----------+-----------+

We will use our model to predict the price of the homes and compare the predicted price to the actual price.

y_pred = regressor.predict(X_test)

Model Accuracy and Scoring

We can use mean absolute error to evaluate our model performance. MAE is the mean of the absolute value of the errors (i.e. the difference between the actual and predicted). We use this metric because it is easy to understand and interpret. The lower the value of MAE, the better our model is at making predictions. The value of MAE is 0 when there are no errors i.e. when all the predicted values are equal to the actual values. The value of MAE is relative to your expected output value. An MAE of 15,000 means that our model's predictions are off by about 15,000 dollars.

from sklearn import metricsprint('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

We can also use the r2_score function from sklearn.metrics to calculate the r2 score. The r2 score is a statistical measure of how close the data is to the fitted regression line. The r2 score ranges from 0 to 1. The higher the r2 score, the better our model is at predicting the price of the homes.

print('R2 Score:', metrics.r2_score(y_test, y_pred))Mean Absolute Error: 151840.42751141978 
R2 Score: 0.6223846393612906

As you can see our model does not perform very well. While there are many other ways to improve our model, we wont be covering them in this tutorial. The purpose of this tutorial was to give you a basic understanding of how to use linear regression to predict any continuous variable.

Model Explanation

Now that we have trained our model, we can use it to make predictions. We can also use it to explain the relationship between our features and our target variable.

print(pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']))

A coefficient is used to measure the strength of the relationship between a feature and the target variable. A positive coefficient means that as the feature increases, the target variable also increases. A negative coefficient means that as the feature increases, the target variable decreases. A coefficient of 0 means that there is no relationship between the feature and the target variable.

We use the coefficient to explain the relationship between our features and our target variable. For example, if we increase the number of bedrooms by 1, the price of the home may increase by 136,000. If we increase the number of bathrooms by 1, the price of the home may increase by 150,000. If we increase the square footage of the home by 1, the price of the home may increase by 432.

As you can see from the coefficient, the number of bedrooms and bathrooms has a much greater impact on the price of the home than the square footage of the home. This is because the number of bedrooms and bathrooms are categorical variables. The square footage of the home is a continuous variable.

However, there are logical fallacies in our model. For example, we know that as you increase the number of bedrooms in a house the value usually goes up. This is why it's important to monitor your model and make sure that it is performing as expected. You can fix these logical fallacies by removing outliers from your data. You can also use a different model such as a decision tree or random forest. There are many other ways to improve your model. We will be covering some of these methods in future tutorials.

Aside from coefficients, we can also explain our model using charts. We can use a scatter plot to show the relationship between our features and our target variable. We can also use a residual plot to show the residuals of our model. The residuals are the difference between the actual values and the predicted values. The residuals should be normally distributed. If the residuals are not normally distributed, then our model is not a good fit for our data.

plt.scatter(y_test, y_pred)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Actual Price vs Predicted Price')
plt.show()

plt.scatter(y_pred, y_test — y_pred)
plt.xlabel(‘Predicted Price’)
plt.ylabel(‘Residuals’)
plt.title(‘Predicted Price vs Residuals’)

Residuals can be calculated by subtracting the predicted value from the actual value. They are also known as the error. They can be used to measure the accuracy of our model. The residuals should be normally distributed. If the residuals are not normally distributed, then our model is not a good fit for our data.

In conclusion, you can use linear regression to develop models that can predict just about anything. You can use linear regression to predict the price of a home, the price of a stock, the number of sales, the number of customers, and so much more. Linear regression is a very powerful tool that can be used to make predictions and explain the relationship between features and a target variable.

If you would like to learn more about linear regression, I recommend the following resources: