Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 24360

John Ludhi/nbshare.io: How to Analyze the CSV data in Pandas

$
0
0

How to Analyze the CSV data in Pandas

For this exercise, I am using College.csv data. The brief explantion of data is given below.

In [1]:
importpandasaspd
In [2]:
df=pd.read_csv('College.csv')
In [3]:
df.head()
Out[3]:
Unnamed: 0PrivateAppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
0Abilene Christian UniversityYes1660123272123522885537744033004502200707818.112704160
1Adelphi UniversityYes218619245121629268312271228064507501500293012.2161052756
2Adrian CollegeYes1428109733622501036991125037504001165536612.930873554
3Agnes Scott CollegeYes41734913760895106312960545045087592977.7371901659
4Alaska Pacific UniversityYes193146551644249869756041208001500767211.921092215

Description Of Data Private : Public/private indicator

Apps : Number of

applications received

Accept : Number of applicants accepted

Enroll : Number of new students enrolled

Top10perc : New students from top 10% of high school class

Top25perc : New students from top 25% of high school class

F.Undergrad : Number of full-time undergraduates

P.Undergrad : Number of part-time undergraduates

Outstate : Out-of-state tuition

Room.Board : Room and board costs

Books : Estimated book costs

Personal : Estimated personal spending

PhD : Percent of faculty with Ph.D.’s

Terminal : Percent of faculty with terminal degree

S.F.Ratio : Student/faculty ratio

perc.alumni : Percent of alumni who donate

Expend : Instructional expenditure per student

Grad.Rate : Graduation rate

Lets look at the summary of data by using describe() method of pandas

In [5]:
df.describe()
Out[5]:
AppsAcceptEnrollTop10percTop25percF.UndergradP.UndergradOutstateRoom.BoardBooksPersonalPhDTerminalS.F.Ratioperc.alumniExpendGrad.Rate
count777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.000000777.00000
mean3001.6383532018.804376779.97297327.55855955.7966543699.907336855.29858410440.6692414357.526384549.3809521340.64221472.66023279.70270314.08970422.7438879660.17117165.46332
std3870.2014842451.113971929.17619017.64036419.8047784850.4205311522.4318874023.0164841096.696416165.105360677.07145416.32815514.7223593.95834912.3918015221.76844017.17771
min81.00000072.00000035.0000001.0000009.000000139.0000001.0000002340.0000001780.00000096.000000250.0000008.00000024.0000002.5000000.0000003186.00000010.00000
25%776.000000604.000000242.00000015.00000041.000000992.00000095.0000007320.0000003597.000000470.000000850.00000062.00000071.00000011.50000013.0000006751.00000053.00000
50%1558.0000001110.000000434.00000023.00000054.0000001707.000000353.0000009990.0000004200.000000500.0000001200.00000075.00000082.00000013.60000021.0000008377.00000065.00000
75%3624.0000002424.000000902.00000035.00000069.0000004005.000000967.00000012925.0000005050.000000600.0000001700.00000085.00000092.00000016.50000031.00000010830.00000078.00000
max48094.00000026330.0000006392.00000096.000000100.00000031643.00000021836.00000021700.0000008124.0000002340.0000006800.000000103.000000100.00000039.80000064.00000056233.000000118.00000

Lets fixed the University name column which is showing up as Unnamed.

In [19]:
df.rename(columns={'Unnamed: 0':'University'},inplace=True)

Lets check if the colunm has been fixed

In [20]:
df.head(1)
Out[20]:
UniversityPrivateAppsAcceptEnrollTop10percTop25percF_UndergradP_UndergradOutstateRoom_BoardBooksPersonalPhDTerminalS_F_Ratioperc_alumniExpendGrad_Rate
0Abilene Christian UniversityYes1660123272123522885537744033004502200707818.112704160

We can plot few columns to understand more about the data

Lets look at the plot between column Phd and column Grad.Rate

Lets fix the column names which have dot in it and replace them with underscore _

In [7]:
df.rename(columns=lambdax:x.replace(".","_"),inplace=True)

Lets checkout the column names now

In [8]:
df.columns
Out[8]:
Index(['Unnamed: 0', 'Private', 'Apps', 'Accept', 'Enroll', 'Top10perc',
       'Top25perc', 'F_Undergrad', 'P_Undergrad', 'Outstate', 'Room_Board',
       'Books', 'Personal', 'PhD', 'Terminal', 'S_F_Ratio', 'perc_alumni',
       'Expend', 'Grad_Rate'],
      dtype='object')

Ok we see dot now replaced with underscore now. We can do the plotting now. We will use library seaborn to plot.

In [9]:
importseabornassns
In [13]:
sns.scatterplot('PhD','Grad_Rate',data=df)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f067ce6cb90>

Above is a simple plot which shows Grad_Rate on Y axis and PhD on x axis. In the command sns.scatterplot('PhD','Grad_Rate',data=df) , we supplied the column names and supplied dataframe df to the data option

Lets do another query to see how many of these colleges are private. This is equilavent to SQL select statement which is 'select count(colleges) from df where private="yes"'. Let us see how can we do this in pandas very easily

In [16]:
len(df[df.Private=="Yes"])
Out[16]:
565

Lets do another query. How many universities have more than 50% of students which were among the top 10% in the high school.

To run this query, we will have to look at variable Top10perc. Let us create a new column and call it Elite.

In [24]:
df['elite']=df.Top10perc>50

Lets print the first 5 rows to see what we got. We should see elite column with True and False values.

In [26]:
df.head(5)
Out[26]:
UniversityPrivateAppsAcceptEnrollTop10percTop25percF_UndergradP_UndergradOutstateRoom_BoardBooksPersonalPhDTerminalS_F_Ratioperc_alumniExpendGrad_Rateelite
0Abilene Christian UniversityYes1660123272123522885537744033004502200707818.112704160False
1Adelphi UniversityYes218619245121629268312271228064507501500293012.2161052756False
2Adrian CollegeYes1428109733622501036991125037504001165536612.930873554False
3Agnes Scott CollegeYes41734913760895106312960545045087592977.7371901659True
4Alaska Pacific UniversityYes193146551644249869756041208001500767211.921092215False

Yes thats what we got.

Lets check out how many elite universities we got. We can again use the describe() function. But since elite is not a numerical method, therefore we can't use directly the describe() method. elite is a category variable. Therefore we will have to use groupby() method first and then apply count() method. lets see how it works.

In [35]:
df.groupby('elite')['University'].count()
Out[35]:
elite
False    699
True      78
Name: University, dtype: int64

How to Use Searborn Plots to Analyze the CSV data

Lets see now how can we use plot to analyze the data. As we saw above seaborn is a great utility to plot data.

Lets do historgram plot for the query df.groupby('elite')['University'].count()

In [49]:
importmatplotlib.pyplotaspltsns.countplot(df['elite'],hue=df['elite'])plt.show()

As we see above, historgram is showing us True and False count for the column elite

Lets do a scattorplot matrix using seaborn

In [52]:
sns.pairplot(df)

I got following error

TypeError: numpy boolean subtract, the - operator, is deprecated, use the bitwise_xor, the ^ operator, or the logical_xor function instead.

The above error is because we have wrong data type that is the new category variable "elite" we created. Lets exclude that variable and plot it again.

But how would we just exclude one column in Pandas. Lets try following...

In [54]:
df.loc[:,df.columns!='elite'].head(1)
Out[54]:
UniversityPrivateAppsAcceptEnrollTop10percTop25percF_UndergradP_UndergradOutstateRoom_BoardBooksPersonalPhDTerminalS_F_Ratioperc_alumniExpendGrad_Rate
0Abilene Christian UniversityYes1660123272123522885537744033004502200707818.112704160

Ok Lets check we can pass this dataframe to seaborn.

In [56]:
sns.pairplot(df.loc[:,df.columns!='elite'])

The above command worked, not showing the plot because of the size of the plot, lets just select 2 columns and then plot it.

In [73]:
sns.pairplot(df.loc[:,['Apps','Accept']])
Out[73]:
<seaborn.axisgrid.PairGrid at 0x7f065f53b390>

Viewing all articles
Browse latest Browse all 24360

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>