Introduction
Deep learning is one of the most interesting and promising areas of artificial intelligence (AI) and machine learning currently. With great advances in technology and algorithms in recent years, deep learning has opened the door to a new era of AI applications.
In many of these applications, deep learning algorithms performed equal to human experts and sometimes surpassed them.
Python has become the go-to language for Machine Learning and many of the most popular and powerful deep learning libraries and frameworks like TensorFlow, Keras, and PyTorch are built in Python.
In this article, we'll be performing Exploratory Data Analysis (EDA) on a dataset before Data Preprocessing and finally, building a Deep Learning Model in Keras and evaluating it.
Why Keras?
Keras is a deep learning API built on top of TensorFlow. TensorFlow is an end-to-end machine learning platform that allows developers to create and deploy machine learning models. TensorFlow was developed and used by Google; though it released under an open-source license in 2015.
Keras provides a high-level API for TensorFlow. It makes it really easy to build different types of machine learning models while taking the benefits of TensorFlow's infrastructure and scalability.
It allows you to define, compile, train, and evaluate deep learning models using simple and concise syntax as we will see later in this series.
Keras is very powerful; it is the most used machine learning tool by top Kaggle champions in the different competitions held on Kaggle.
House Price Prediction with Deep Learning
We will build a regression deep learning model to predict a house price based on the house characteristics such as the age of the house, the number of floors in the house, the size of the house, and many other features.
In the first article of the series, we'll be importing the packages and data and doing some Exploratory Data Analysis (EDA) to get familiar with the dataset we're working with.
Importing the Required Packages
In this preliminary step, we import the packages needed in the next steps:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
We're importing tensorflow
which includes Keras and some other useful tools. For code brevity, we're importing keras
and layers
separately so instead of tf.keras.layers.Dense
we can write layers.Dense
.
We're also importing pandas
and numpy
which are extremely useful and widely used to store and handle the data as well as manipulate it.
And for visualizing and exploring the data, we import plt
from the matplotlib
package and seaborn
. Matplotlib is a fundamental library for visualization, while Seaborn makes it much simpler to work with.
Loading the Data
For this tutorial, we will work with a dataset that reports sales of residential units between 2006 and 2010 in a city called Ames which is located in Iowa, United States.
For each sale, the dataset describes many characteristics of the residential unit and lists the sale price of that unit. This sale price will be the target variable that we want to predict using the different characteristics of the unit.
The dataset actually contains a lot of characteristics data on each unit including the unit area, the year in which the unit was built, the size of the garage, the number of kitchens, the number of bathrooms, the number of bedrooms, the roof style, the type of the electrical system, the class of the building, and many others.
You can read more about the dataset on this page on Kaggle.
To download the exact dataset file that we will be using in this tutorial, visit its Kaggle page and click on the download button. This will download a CSV file containing the data.
We'll rename this file to AmesHousing.csv
and load it inside our program using Pandas' read_csv()
function:
df = pd.read_csv('AmesHousing.csv')
The loaded dataset contains 2,930 rows (entries) and 82 columns (characteristics). Here's a truncated view of only a few rows and columns:
Order | PID | MS SubClass | MS Zoning | Lot Frontage | Lot Area | Street | |
0 | 1 | 526301100 | 20 | RL | 141 | 31770 | Pave |
1 | 2 | 526350040 | 20 | RH | 80 | 11622 | Pave |
2 | 3 | 526351010 | 20 | RL | 81 | 14267 | Pave |
As we said earlier, each row describes a residential unit sale by specifying many characteristics of the unit and its sale price. And, again, to get more information about the meaning of each variable in this dataset, please visit this page on Kaggle.
Before we proceed, we will remove some features (columns) from the dataset because they don't provide any useful information to the mode. These features are Order
and PID
:
df.drop(['Order', 'PID'], axis=1, inplace=True)
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) helps us understand the data better and spot patterns in it. The most important variable to explore in the data is the target variable: SalePrice
.
A machine learning model is as good as the training data - you want to understand it if you want to understand your model. The first step in building any model should be good data exploration.
Since the end-goal is predicting house values, we'll focus on the SalePrice
variable and the variables that have high correlation with it.
Sale Price Distribution
First, let's take a look at the distribution of SalePrice
. Histograms are a great and simple way to take a look at distributions of variables. Let's use Matplotlib to plot a histogram that displays the distribution of the SalePrice
:
fig, ax = plt.subplots(figsize=(14,8))
sns.distplot(df['SalePrice'], kde=False, ax=ax)
The image below shows the resulting histogram after applying some formatting to enhance the appearance:
We can also look at the SalePrice
distribution using different types of plots. For example, let's make a swarm plot of SalePrice
:
fig, ax = plt.subplots(figsize=(14,8))
sns.swarmplot(df['SalePrice'], color='#2f4b7c', alpha=0.8, ax=ax)
This would result in:
By looking at the histogram and swarm plot above, we can see that for most units, the sale price ranges from $100,000 to $200,000. If we generate a description of the SalePrice
variable using Pandas' describe()
function:
print(df['SalePrice'].describe().apply(lambda x: '{:,.1f}'.format(x)))
We'll receive:
count 2,930.0
mean 180,796.1
std 79,886.7
min 12,789.0
25% 129,500.0
50% 160,000.0
75% 213,500.0
max 755,000.0
Name: SalePrice, dtype: object
From here, we know that:
- The average sale price is $180,796
- The minimum sale price is $12,789
- The maximum sale price is $755,000
Correlation with Sale Price
Now, let's see how predictor variables in our data correlate with the target SalePrice
. We will calculate these correlation values using Pearson's method and then visualize the correlations using a heatmap:
fig, ax = plt.subplots(figsize=(10,14))
saleprice_corr = df.corr()[['SalePrice']].sort_values(
by='SalePrice', ascending=False)
sns.heatmap(saleprice_corr, annot=True, ax=ax)
And here is the heatmap that shows how predictor variables are correlated with SalePrice
.
Lighter colors in the map indicate higher positive correlation values and darker colors indicate lower positive correlation values and sometimes negative correlation values:
Obviously, the SalePrice
variable has a 1:1 correlation with itself. Though, there are some other variables that are highly correlated with the SalePrice
that we can draw some conclusions from.
For example, we can see that SalePrice
is highly correlated with the Overall Qual
variable which describes the overall quality of material and finish of the house. We can also see a high correlation with Gr Liv Area
which specifies the above-ground living area of the unit.
Examining the Different Correlation Degrees
Now that we have some variables that are highly correlated with SalePrice
in mind, let's examine the correlations more deeply.
Some variables are highly correlated with the SalePrice
, and some aren't. By checking these out, we can draw conclusions on what's prioritized when people are buying properties.
High Correlation
First, let's look at two variables that have high positive correlation with SalePrice
- namely Overall Qual
which has a correlation value of 0.8
and Gr Liv Area
which has a correlation value of 0.71
.
Overall Qual
represents the overall quality of material and finish of the house. Let's explore their relationship further by plotting a scatter plot, using Matplotlib:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Overall Qual'], y=df['SalePrice'], color="#388e3c",
edgecolors="#000000", linewidths=0.1, alpha=0.7);
plt.show()
Here is the resulting scatter plot:
We can clearly see that as the overall quality increases, the house sale price tends to increase as well. The increase isn't quite linear, but if we drew a trendline, it would be relatively close to linear.
Now, let's see how Gr Liv Area
and SalePrice
relate to each other with another scatter plot:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Gr Liv Area'], y=df['SalePrice'], color="#388e3c",
edgecolors="#000000", linewidths=0.1, alpha=0.7);
plt.show()
Here is the resulting scatter plot:
Again, we can clearly see the high positive correlation between Gr Liv Area
and SalePrice
in this scatter plot. They tend to increase with each other, with a few outliers.
Moderate Correlation
Next, let's look at variables that have a moderate positive correlation with SalePrice
. We will look at Lot Frontage
which has a correlation value of 0.36
and Full Bath
which has a correlation value of 0.55
.
Lot Frontage
represents the length of the lot in front of the house, all the way to the street. And Full Bath
represents the number of full bathrooms above ground.
Similar to what we have done with Overall Qual
and Gr Liv Area
, we will plot two scatter plots to visualize the relationships between these variables and the SalePrice
.
Let's start with Lot Frontage
:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Lot Frontage'], y=df['SalePrice'], color="orange",
edgecolors="#000000", linewidths=0.5, alpha=0.5);
plt.show()
Here, you can see a much weaker correlation. Even with larger lots in front of the properties, the price doesn't go up by much. There is a positive correlation between the two, but it doesn't seem to be as important to buyers as some other variables.
Then, let's show the scatter plot for Full Bath
:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Full Bath'], y=df['SalePrice'], color="orange",
edgecolors="#000000", linewidths=0.5, alpha=0.5);
plt.show()
Here, you can also see a positive correlation, which isn't that weak, but also isn't too strong. A good portion of houses with two full bathrooms have the exact same price as the houses with only one bathroom. The number of bathrooms does influence the price, but not too much.
Low Correlation
Finally, let's look at variables that have a low positive correlation with SalePrice
and compare them with what we saw so far. We will look at Yr Sold
which has a correlation value of -0.031
and Bsmt Unf SF
which has a correlation value of 0.18
.
Yr Sold
represents the year in which the house was sold. And Bsmt Unf SF
represents the unfinished basement area in square feet.
Let's start with Yr Sold
:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Yr Sold'], y=df['SalePrice'], color="#b71c1c",
edgecolors="#000000", linewidths=0.1, alpha=0.5);
ax.xaxis.set_major_formatter(
ticker.FuncFormatter(func=lambda x,y: int(x)))
plt.show()
The correlation here is so weak that it's fairly safe to assume that there's basically no correlation between these two variables. It's safe to assume that the prices of properties haven't changed much between 2006 and 2010.
Let's also make a plot for Bsmt Unf SF
:
fig, ax = plt.subplots(figsize=(14,8))
ax.scatter(x=df['Bsmt Unf SF'], y=df['SalePrice'], color="#b71c1c",
edgecolors="#000000", linewidths=0.1, alpha=0.5);
plt.show()
Here, we can see some properties with lower Bsmt Unf SF
being sold for higher than ones with a high value. Then again, this could be due to pure chance, and there isn't an apparent correlation between the two.
It's safe to assume that Bsmt Unf SF
doesn't have much to do with the SalePrice
.
Conclusion
In this article, we've made the first steps in most machine learning projects. We started off with downloading and loading in a dataset that we're interested in.
Then, we've performed Exploratory Data Analysis on the data to get a good understanding of what we're dealing with. A machine learning model is as good as the training data - you want to understand it if you want to understand your model.
Finally, we've chosen a few variables and checked for their correlation with the main variable we're eyeing - the SalePrice
variable.