Decision Tree Regression With Hyper Parameter Tuning

In this post, we will go through Decision Tree model building. We will use air quality data. Here is the link to data.

importpandasaspdimportnumpyasnp

# Reading our csv datacombine_data=pd.read_csv('data/Real_combine.csv')combine_data.head(5)

T == Average Temperature (°C)

TM == Maximum temperature (°C)

Tm == Minimum temperature (°C)

SLP == Atmospheric pressure at sea level (hPa)

H == Average relative humidity (%)

VV == Average visibility (Km)

V == Average wind speed (Km/h)

VM == Maximum sustained wind speed (Km/h)

PM2.5== Fine particulate matter (PM2.5) is an air pollutant that is a concern for people's health when levels in air are high

Data Cleaning

Let us drop first the unwanted columns.

combine_data.drop(['Unnamed: 0'],axis=1,inplace=True)

Data Analysis

combine_data.head(2)

# combine data top 5 rowscombine_data.head()

# combine data bottom 5 featurescombine_data.tail()

Let us print the statistical data using describe() function.

# To get statistical data combine_data.describe()

Let us check if there are any null values in our data.

combine_data.isnull().sum()

T         0
TM        0
Tm        0
SLP       0
H         0
VV        0
V         0
VM        0
PM 2.5    0
dtype: int64

we can also visualize null values with seaborn too. From the heatmap, it is clear that there are no null values.

importseabornassnssns.heatmap(combine_data.isnull(),yticklabels=False)

<AxesSubplot:>

Let us check outliers in our data using seaborn boxplot.

# To check outliers importmatplotlib.pyplotasplta4_dims=(11.7,8.27)fig,ax=plt.subplots(figsize=a4_dims)g=sns.boxplot(data=combine_data,linewidth=2.5,ax=ax)g.set_yscale("log")

From the plot, we can see that there are few outliers present in column Tm, W, V, VM and PM 2.5.

We can also do a searborn pairplot multivariate analysis. Using multivariate analysis, we can find out relation between any two variables. Since plot is so big, i am skipping the pairplot, but the command to draw pairplots are shown below.

sns.pairplot(combine_data)

We can also check the corelation between dependent and independent features using dataframe.corr() function. The correlation can be plotted using 'pearson', 'kendall, or 'spearman'. By default corr() function runs 'pearson'.

combine_data.corr()

If we observe the above correlation table, it is clear that correlation between 'PM 2.5' feature and only SLP is positive. Corelation tells us if 'PM 2.5' increases what is the behaviour of other features. So if correlation is negative that means if one variable increases other variable decreases.

We can also Visualize Correlation Using Seaborn Heatmap.

relation=combine_data.corr()relation_index=relation.index

relation_index

Index(['T', 'TM', 'Tm', 'SLP', 'H', 'VV', 'V', 'VM', 'PM 2.5'], dtype='object')

sns.heatmap(combine_data[relation_index].corr(),annot=True)

<AxesSubplot:>

Upto now, we have done only feature engineering. In next section, we will do feature selection.

Feature Selection

fromsklearn.ensembleimportRandomForestRegressorfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportmean_squared_errorasmse

Splitting the data into train and test data sets.

X_train,X_test,y_train,y_test=train_test_split(combine_data.iloc[:,:-1],combine_data.iloc[:,-1],test_size=0.3,random_state=0)

# size of train data setX_train.shape

(450, 8)

# size of test data setX_test.shape

(193, 8)

Feature selection by ExtraTreesRegressor(model based). ExtraTreesRegressor helps us find the features which are most important.

# Feature selection by ExtraTreesRegressor(model based)fromsklearn.ensembleimportExtraTreesRegressorfromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_scoreasacc

X_train,X_test,y_train,y_test=train_test_split(combine_data.iloc[:,:-1],combine_data.iloc[:,-1],test_size=0.3,random_state=0)

reg=ExtraTreesRegressor()

reg.fit(X_train,y_train)

ExtraTreesRegressor()

Letusprintthefeaturesimportance.

reg.feature_importances_

array([0.17525632, 0.09237557, 0.21175783, 0.22835392, 0.0863817 ,
       0.05711284, 0.07977977, 0.06898204])

feat_importances=pd.Series(reg.feature_importances_,index=X_train.columns)feat_importances.nlargest(5).plot(kind='barh')plt.show()

Based on plot above, we can select the features which will be most important for our prediction model.

Before Train the data we need to do feature normalization because models such as decision trees are very sensitive to the scale of features.

Decision Tree Model Training

# Traning model with all features fromsklearn.model_selectionimporttrain_test_splitX_train,X_test,y_train,y_test=train_test_split(combine_data.iloc[:,:-1],combine_data.iloc[:,-1],test_size=0.3,random_state=0)

X_train

X_test

fromsklearn.treeimportDecisionTreeRegressor

Let us creat a Decision tree regression model.

reg_decision_model=DecisionTreeRegressor()

# fit independent varaibles to the dependent variablesreg_decision_model.fit(X_train,y_train)

DecisionTreeRegressor()

reg_decision_model.score(X_train,y_train)

1.0

reg_decision_model.score(X_test,y_test)

0.05768194549539718

We got 100% score on training data.

On test data we got 5.7% score because we did not provide any tuning parameters while intializing the tree as a result of which algorithm split the training data till the leaf node. Due to which depth of tree increased and our model did the overfitting.

That's why we are getting high score on our training data and less score on test data.

So to solve this problem we would use hyper parameter tuning.

We can use GridSearch or RandomizedSearch for hyper parameters tuning.

Decision Tree Model Evaluation

prediction=reg_decision_model.predict(X_test)

Let us do a distribution plot between our label y and predicted y values.

# checking difference between labled y and predicted ysns.distplot(y_test-prediction)

/home/abhiphull/anaconda3/envs/condapy36/lib/python3.6/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

<AxesSubplot:xlabel='PM 2.5', ylabel='Density'>

We are getting nearly bell shape curve that means our model working good? No we can't make that conclusion. Good bell curve only tell us the range of predicted values are with in the same range as our original data range values are.

checkingpredictedyandlabeledyusingascatterplot.

plt.scatter(y_test,prediction)

<matplotlib.collections.PathCollection at 0x7fa05aeb0320>

Hyper Parameter tuning

# Hyper parameters range intialization for tuning parameters={"splitter":["best","random"],"max_depth":[1,3,5,7,9,11,12],"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],"max_features":["auto","log2","sqrt",None],"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90]}

Above we intialized hyperparmeters random range using Gridsearch to find the best parameters for our decision tree model.

# calculating different regression metricsfromsklearn.model_selectionimportGridSearchCV

tuning_model=GridSearchCV(reg_decision_model,param_grid=parameters,scoring='neg_mean_squared_error',cv=3,verbose=3)

# function for calculating how much time take for hyperparameter tuningdeftimer(start_time=None):ifnotstart_time:start_time=datetime.now()returnstart_timeelifstart_time:thour,temp_sec=divmod((datetime.now()-start_time).total_seconds(),3600)tmin,tsec=divmod(temp_sec,60)#print(thour,":",tmin,':',round(tsec,2))

X=combine_data.iloc[:,:-1]

y=combine_data.iloc[:,-1]

%%capture
from datetime import datetime

start_time=timer(None)

tuning_model.fit(X,y)

timer(start_time)

Hyper parameter tuning took around 17 minues. It might vary depending upon your machine.

# best hyperparameters tuning_model.best_params_

{'max_depth': 5,
 'max_features': 'auto',
 'max_leaf_nodes': 40,
 'min_samples_leaf': 2,
 'min_weight_fraction_leaf': 0.1,
 'splitter': 'random'}

# best model scoretuning_model.best_score_

-3786.5642998048047

Training Decision Tree With Best Hyperparameters

tuned_hyper_model=DecisionTreeRegressor(max_depth=5,max_features='auto',max_leaf_nodes=50,min_samples_leaf=2,min_weight_fraction_leaf=0.1,splitter='random')

# fitting modeltuned_hyper_model.fit(X_train,y_train)

DecisionTreeRegressor(max_depth=5, max_features='auto', max_leaf_nodes=50,
                      min_samples_leaf=2, min_weight_fraction_leaf=0.1,
                      splitter='random')

# prediction tuned_pred=tuned_hyper_model.predict(X_test)

plt.scatter(y_test,tuned_pred)

<matplotlib.collections.PathCollection at 0x7fa05ac52c50>

Ok the above scatter plot looks lot better.

Let us compare now Error rate of our model with hyper tuning of paramerters to our original model which is without the tuning of parameters.

# With hyperparameter tuned fromsklearnimportmetricsprint('MAE:',metrics.mean_absolute_error(y_test,tuned_pred))print('MSE:',metrics.mean_squared_error(y_test,tuned_pred))print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,tuned_pred)))

MAE: 48.814175526595086
MSE: 4155.120637935324
RMSE: 64.46022523956401

# without hyperparameter tuning fromsklearnimportmetricsprint('MAE:',metrics.mean_absolute_error(y_test,prediction))print('MSE:',metrics.mean_squared_error(y_test,prediction))print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,prediction)))

MAE: 59.15023747989637
MSE: 6426.809819039633
RMSE: 80.16738625550688

Conclusion

If you observe the above metrics for both the models, We got good metric values(MSE 4155) with hyperparameter tuning model compare to model without hyper parameter tuning.

	Unnamed: 0	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
0	1	26.7	33.0	20.0	1012.4	60.0	5.1	4.4	13.0	284.795833
1	3	29.1	35.0	20.5	1011.9	49.0	5.8	5.2	14.8	219.720833
2	5	28.4	36.0	21.0	1011.3	46.0	5.3	5.7	11.1	182.187500
3	7	25.9	32.0	20.0	1011.8	56.0	6.1	6.9	11.1	154.037500
4	9	24.8	31.1	20.6	1013.6	58.0	4.8	8.3	11.1	223.208333

	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
0	26.7	33.0	20.0	1012.4	60.0	5.1	4.4	13.0	284.795833
1	29.1	35.0	20.5	1011.9	49.0	5.8	5.2	14.8	219.720833

	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
0	26.7	33.0	20.0	1012.4	60.0	5.1	4.4	13.0	284.795833
1	29.1	35.0	20.5	1011.9	49.0	5.8	5.2	14.8	219.720833
2	28.4	36.0	21.0	1011.3	46.0	5.3	5.7	11.1	182.187500
3	25.9	32.0	20.0	1011.8	56.0	6.1	6.9	11.1	154.037500
4	24.8	31.1	20.6	1013.6	58.0	4.8	8.3	11.1	223.208333

	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
638	28.5	33.4	20.9	1012.6	59.0	5.3	6.3	14.8	185.500000
639	24.9	33.2	14.8	1011.5	48.0	4.2	4.6	13.0	166.875000
640	26.4	32.0	20.9	1011.2	70.0	3.9	6.7	9.4	200.333333
641	20.8	25.0	14.5	1016.8	78.0	4.7	5.9	11.1	349.291667
642	23.3	28.0	14.9	1014.0	71.0	4.5	3.0	9.4	310.250000

	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
count	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000	643.000000
mean	27.609953	33.974028	20.669207	1009.030327	51.716952	5.057698	7.686936	16.139036	111.378895
std	3.816030	4.189773	4.314514	4.705001	16.665038	0.727143	3.973736	6.915630	82.144946
min	18.900000	22.000000	9.000000	998.000000	15.000000	2.300000	1.100000	5.400000	0.000000
25%	24.900000	31.000000	17.950000	1005.100000	38.000000	4.700000	5.000000	11.100000	46.916667
50%	27.000000	33.000000	21.400000	1009.400000	51.000000	5.000000	6.900000	14.800000	89.875000
75%	29.800000	37.000000	23.700000	1013.100000	64.000000	5.500000	9.400000	18.300000	159.854167
max	37.700000	45.000000	31.200000	1019.200000	95.000000	7.700000	25.600000	77.800000	404.500000

John Ludhi/nbshare.io: Decision Tree Regression With Hyper Parameter Tuning In Python

Decision Tree Regression With Hyper Parameter Tuning

Data Cleaning

Data Analysis

Feature Selection

Decision Tree Model Training

Decision Tree Model Evaluation

Hyper Parameter tuning

Training Decision Tree With Best Hyperparameters

Conclusion

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...

	T	TM	Tm	SLP	H	VV	V	VM	PM 2.5
T	1.000000	0.920752	0.786809	-0.516597	-0.477952	0.572818	0.160582	0.192456	-0.441826
TM	0.920752	1.000000	0.598095	-0.342692	-0.626362	0.560743	-0.002735	0.074952	-0.316378
Tm	0.786809	0.598095	1.000000	-0.735621	0.058105	0.296954	0.439133	0.377274	-0.591487
SLP	-0.516597	-0.342692	-0.735621	1.000000	-0.250364	-0.187913	-0.610149	-0.506489	0.585046
H	-0.477952	-0.626362	0.058105	-0.250364	1.000000	-0.565165	0.236208	0.145866	-0.153904
VV	0.572818	0.560743	0.296954	-0.187913	-0.565165	1.000000	0.034476	0.081239	-0.147582
V	0.160582	-0.002735	0.439133	-0.610149	0.236208	0.034476	1.000000	0.747435	-0.378281
VM	0.192456	0.074952	0.377274	-0.506489	0.145866	0.081239	0.747435	1.000000	-0.319558
PM 2.5	-0.441826	-0.316378	-0.591487	0.585046	-0.153904	-0.147582	-0.378281	-0.319558	1.000000

	T	TM	Tm	SLP	H	VV	V	VM
334	28.9	36.0	15.0	1009.2	21.0	5.3	4.8	11.1
46	32.8	39.0	26.0	1006.6	41.0	5.6	7.0	77.8
246	30.3	37.0	24.2	1003.7	38.0	4.7	21.9	29.4
395	28.4	36.6	23.0	1003.1	63.0	4.7	10.7	18.3
516	26.9	31.0	22.9	1003.0	76.0	4.0	7.8	16.5
...	...	...	...	...	...	...	...	...
9	23.7	30.4	17.0	1015.8	46.0	5.1	5.2	14.8
359	33.6	40.0	25.0	1006.9	36.0	5.8	6.1	11.1
192	24.9	30.4	19.0	1008.9	57.0	4.8	4.6	9.4
629	26.1	29.0	22.4	1001.2	87.0	5.0	14.1	22.2
559	23.8	30.2	17.9	1010.6	55.0	4.5	3.7	7.6

	T	TM	Tm	SLP	H	VV	V	VM
637	28.4	33.5	20.9	1013.1	63.0	5.3	6.1	66.5
165	20.7	30.1	9.0	1010.5	35.0	4.5	4.6	14.8
467	26.7	33.5	21.0	1010.9	37.0	5.1	5.7	11.1
311	26.0	31.0	20.4	1011.5	63.0	4.8	3.9	9.4
432	26.4	30.9	22.6	1010.0	75.0	4.2	7.6	16.5
...	...	...	...	...	...	...	...	...
249	27.2	32.3	22.0	1003.7	55.0	4.8	20.0	29.4
89	29.7	34.0	22.6	1003.8	56.0	5.5	13.5	27.8
293	22.3	30.3	11.4	1012.6	37.0	5.1	7.2	20.6
441	27.1	33.0	20.0	1010.7	49.0	4.2	6.1	18.3
478	25.6	32.0	19.0	1012.1	59.0	3.9	6.1	11.1