Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22462

Robin Wilson: Regression in Python using R-style formula – it’s easy!

$
0
0
.highlight .hll {background-color:#ffffcc}.highlight {background:#e4e4e4;}.highlight .c {color:#408080;font-style:italic}.highlight .err {border:1px solid #FF0000}.highlight .k {color:#008000;font-weight:bold}.highlight .o {color:#666666}.highlight .cm {color:#408080;font-style:italic}.highlight .cp {color:#BC7A00}.highlight .c1 {color:#408080;font-style:italic}.highlight .cs {color:#408080;font-style:italic}.highlight .gd {color:#A00000}.highlight .ge {font-style:italic}.highlight .gr {color:#FF0000}.highlight .gh {color:#000080;font-weight:bold}.highlight .gi {color:#00A000}.highlight .go {color:#888888}.highlight .gp {color:#000080;font-weight:bold}.highlight .gs {font-weight:bold}.highlight .gu {color:#800080;font-weight:bold}.highlight .gt {color:#0044DD}.highlight .kc {color:#008000;font-weight:bold}.highlight .kd {color:#008000;font-weight:bold}.highlight .kn {color:#008000;font-weight:bold}.highlight .kp {color:#008000}.highlight .kr {color:#008000;font-weight:bold}.highlight .kt {color:#B00040}.highlight .m {color:#666666}.highlight .s {color:#BA2121}.highlight .na {color:#7D9029}.highlight .nb {color:#008000}.highlight .nc {color:#0000FF;font-weight:bold}.highlight .no {color:#880000}.highlight .nd {color:#AA22FF}.highlight .ni {color:#999999;font-weight:bold}.highlight .ne {color:#D2413A;font-weight:bold}.highlight .nf {color:#0000FF}.highlight .nl {color:#A0A000}.highlight .nn {color:#0000FF;font-weight:bold}.highlight .nt {color:#008000;font-weight:bold}.highlight .nv {color:#19177C}.highlight .ow {color:#AA22FF;font-weight:bold}.highlight .w {color:#bbbbbb}.highlight .mb {color:#666666}.highlight .mf {color:#666666}.highlight .mh {color:#666666}.highlight .mi {color:#666666}.highlight .mo {color:#666666}.highlight .sb {color:#BA2121}.highlight .sc {color:#BA2121}.highlight .sd {color:#BA2121;font-style:italic}.highlight .s2 {color:#BA2121}.highlight .se {color:#BB6622;font-weight:bold}.highlight .sh {color:#BA2121}.highlight .si {color:#BB6688;font-weight:bold}.highlight .sx {color:#008000}.highlight .sr {color:#BB6688}.highlight .s1 {color:#BA2121}.highlight .ss {color:#19177C}.highlight .bp {color:#008000}.highlight .vc {color:#19177C}.highlight .vg {color:#19177C}.highlight .vi {color:#19177C}.highlight .il {color:#666666}

table, th, td {
border: 0px
}

I remember experimenting with doing regressions in Python using R-style formulae a long time ago, and I remember it being a bit complicated. Luckily it’s become really easy now – and I’ll show you just how easy.

Before running this you will need to install the pandas, statsmodels and patsy packages. If you’re using conda you should be able to do this by running the following from the terminal:

conda install statsmodels patsy

(and then say yes when it asks you to confirm it)

importpandasaspdfromstatsmodels.formula.apiimportols

Before we can do any regression, we need some data – so lets read some data on cars:

df=pd.read_csv("http://web.pdx.edu/~gerbing/data/cars.csv")

You may have noticed from the code above that you can just give a URL to the read_csv function and it will download it and open it – handy!

Anyway, here is the data:

df.head()
ModelMPGCylindersEngine DispHorsepowerWeightAccelerateYearOrigin
0amc ambassador dpl15.08390.019038508.570American
1amc gremlin21.06199.090264815.070American
2amc hornet18.06199.097277415.570American
3amc rebel sst16.08304.0150343312.070American
4buick estate wagon (sw)14.08455.0225308610.070American

Before we do our regression it might be a good idea to look at simple correlations between columns. We can get the correlations between each pair of columns using the corr() method:

df.corr()
MPGCylindersEngine DispHorsepowerWeightAccelerateYear
MPG1.000000-0.777618-0.805127-0.778427-0.8322440.4233290.580541
Cylinders-0.7776181.0000000.9508230.8429830.897527-0.504683-0.345647
Engine Disp-0.8051270.9508231.0000000.8972570.932994-0.543800-0.369855
Horsepower-0.7784270.8429830.8972571.0000000.864538-0.689196-0.416361
Weight-0.8322440.8975270.9329940.8645381.000000-0.416839-0.309120
Accelerate0.423329-0.504683-0.543800-0.689196-0.4168391.0000000.290316
Year0.580541-0.345647-0.369855-0.416361-0.3091200.2903161.000000

Now we can do some regression using R-style formulae. In this case we’re trying to predict MPG based on the year that the car was released:

model=ols("MPG ~ Year",data=df)results=model.fit()

The ‘formula’ that we used above is the same as R uses: on the left is the dependent variable, on the right is the independent variable. The ols method is nice and easy, we just give it the formula, and then the DataFrame to use to get the data from (in this case, it’s called df). We then call fit() to actually do the regression.

We can easily get a summary of the results here – including all sorts of crazy statistical measures!

results.summary()
OLS Regression Results
Dep. Variable:MPGR-squared:0.337
Model:OLSAdj. R-squared:0.335
Method:Least SquaresF-statistic:198.3
Date:Sat, 20 Aug 2016Prob (F-statistic):1.08e-36
Time:10:42:17Log-Likelihood:-1280.6
No. Observations:392AIC:2565.
Df Residuals:390BIC:2573.
Df Model:1
Covariance Type:nonrobust
coefstd errtP>|t|[95.0% Conf. Int.]
Intercept-70.01176.645-10.5360.000-83.076 -56.947
Year1.23000.08714.0800.0001.058 1.402
Omnibus:21.407Durbin-Watson:1.121
Prob(Omnibus):0.000Jarque-Bera (JB):15.843
Skew:0.387Prob(JB):0.000363
Kurtosis:2.391Cond. No.1.57e+03

We can do a more complex model easily too. First lets list the columns of the data to remind us what variables we have:

df.columns
Index(['Model', 'MPG', 'Cylinders', 'Engine Disp', 'Horsepower', 'Weight',
       'Accelerate', 'Year', 'Origin'],
      dtype='object')

We can now add in more variables – doing multiple regression:

model=ols("MPG ~ Year + Weight + Horsepower",data=df)results=model.fit()results.summary()
OLS Regression Results
Dep. Variable:MPGR-squared:0.808
Model:OLSAdj. R-squared:0.807
Method:Least SquaresF-statistic:545.4
Date:Sat, 20 Aug 2016Prob (F-statistic):9.37e-139
Time:10:42:17Log-Likelihood:-1037.4
No. Observations:392AIC:2083.
Df Residuals:388BIC:2099.
Df Model:3
Covariance Type:nonrobust
coefstd errtP>|t|[95.0% Conf. Int.]
Intercept-13.71944.182-3.2810.001-21.941 -5.498
Year0.74870.05214.3650.0000.646 0.851
Weight-0.00640.000-15.7680.000-0.007 -0.006
Horsepower-0.00500.009-0.5300.597-0.024 0.014
Omnibus:41.952Durbin-Watson:1.423
Prob(Omnibus):0.000Jarque-Bera (JB):69.490
Skew:0.671Prob(JB):8.14e-16
Kurtosis:4.566Cond. No.7.48e+04

We can see that bringing in some extra variables has increased the $R^2$ value from ~0.3 to ~0.8 – although we can see that the P value for the Horsepower is very high. If we remove Horsepower from the regression then it barely changes the results:

model=ols("MPG ~ Year + Weight",data=df)results=model.fit()results.summary()
OLS Regression Results
Dep. Variable:MPGR-squared:0.808
Model:OLSAdj. R-squared:0.807
Method:Least SquaresF-statistic:819.5
Date:Sat, 20 Aug 2016Prob (F-statistic):3.33e-140
Time:10:42:17Log-Likelihood:-1037.6
No. Observations:392AIC:2081.
Df Residuals:389BIC:2093.
Df Model:2
Covariance Type:nonrobust
coefstd errtP>|t|[95.0% Conf. Int.]
Intercept-14.34734.007-3.5810.000-22.224 -6.470
Year0.75730.04915.3080.0000.660 0.855
Weight-0.00660.000-30.9110.000-0.007 -0.006
Omnibus:42.504Durbin-Watson:1.425
Prob(Omnibus):0.000Jarque-Bera (JB):71.997
Skew:0.670Prob(JB):2.32e-16
Kurtosis:4.616Cond. No.7.17e+04

We can also see if introducing categorical variables helps with the regression. In this case, we only have one categorical variable, called Origin. Patsy automatically treats strings as categorical variables, so we don’t have to do anything special – but if needed we could wrap the variable name in C() to force it to be a categorical variable.

model=ols("MPG ~ Year + Origin",data=df)results=model.fit()results.summary()
OLS Regression Results
Dep. Variable:MPGR-squared:0.579
Model:OLSAdj. R-squared:0.576
Method:Least SquaresF-statistic:178.0
Date:Sat, 20 Aug 2016Prob (F-statistic):1.42e-72
Time:10:42:17Log-Likelihood:-1191.5
No. Observations:392AIC:2391.
Df Residuals:388BIC:2407.
Df Model:3
Covariance Type:nonrobust
coefstd errtP>|t|[95.0% Conf. Int.]
Intercept-61.26435.393-11.3600.000-71.868 -50.661
Origin[T.European]7.47840.69710.7340.0006.109 8.848
Origin[T.Japanese]8.42620.67112.5640.0007.108 9.745
Year1.07550.07115.1020.0000.935 1.216
Omnibus:10.231Durbin-Watson:1.656
Prob(Omnibus):0.006Jarque-Bera (JB):10.589
Skew:0.402Prob(JB):0.00502
Kurtosis:2.980Cond. No.1.60e+03

You can see here that Patsy has automatically created extra variables for Origin: in this case, European and Japanese, with the ‘default’ being American. You can configure how this is done very easily – see here.

Just for reference, you can easily get any of the statistical outputs as attributes on the results object:

results.rsquared
0.57919459237581172
results.params
Intercept            -61.264305
Origin[T.European]     7.478449
Origin[T.Japanese]     8.426227
Year                   1.075484
dtype: float64

You can also really easily use the model to predict based on values you’ve got:

results.predict({'Year':90,'Origin':'European'})
array([ 43.00766095])

Viewing all articles
Browse latest Browse all 22462

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>