How To Get Measures Of Spread With Python

Measures of spread tell how spread the data points are. Some examples of measures of spread are quantiles, variance, standard deviation and mean absolute deviation.

In this excercise we are going to get the measures of spread using python.

We will use a dataset from kaggle, follow https://www.kaggle.com/datasets/himanshunakrani/student-study-hours to access the data

Quantiles
Quantiles are values that split sorted data or a probability distribution into equal parts. There several different types of quantlies, here are some of the examples:
- Quartiles - Divides the data into 4 equal parts.
- Quintiles - Divides the data into 5 equal parts.
- Deciles - Divides the data into 10 equal parts
- Percentiles - Divides the data into 100 equal parts

Let us download the libraries we will use

importnumpyasnpimportpandasaspd

We will now load the data that we'll use.

df=pd.read_csv('score.csv')print(df.head())

   Hours  Scores
0    2.5      21
1    5.1      47
2    3.2      27
3    8.5      75
4    3.5      30

Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the scores into 4 equal parts.

print(np.quantile(df['Scores'],[0,0.25,0.5,0.75,1]))

[17. 30. 47. 75. 95.]

Quantiles using linspace( )

It can become quite tedious to list all the points when getting quantiles, more so in cases of higher quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )

Let's get the quartiles of the scores

print(np.quantile(df['Scores'],np.linspace(0,1,5)))

[17. 30. 47. 75. 95.]

Let's get the quintiles

print(np.quantile(df['Scores'],np.linspace(0,1,6)))

[17.  26.6 38.6 60.8 77.  95. ]

Let's get the deciles

print(np.quantile(df['Scores'],np.linspace(0,1,11)))

[17.  22.2 26.6 30.  38.6 47.  60.8 68.6 77.  85.6 95. ]

Interquartile Range (IQR)

This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half of the data.

Let's get the IQR for the scores

IQR=np.quantile(df['Scores'],0.75)-np.quantile(df['Scores'],0.25)print(IQR)

45.0

Another way we can get IQR is by using iqr( ) from the scipy library

fromscipy.statsimportiqrIQR=iqr(df['Scores'])print(IQR)

45.0

Outliers

These are data points that are usually different or detached from the rest of the data points.

A data point is an outlier if:

data < 1st quartile − 1.5 * IQR
```
      or
```
data > 3rd quartile + 1.5 * IQR

Let's get the outliers in the scores

# first get iqriqr=iqr(df['Scores'])# then get lower & upper thresholdlower_threshold=np.quantile(df['Scores'],0.25)upper_threshold=np.quantile(df['Scores'],0.75)# then find outliers outliers=df[(df['Scores']<lower_threshold)|(df['Scores']>upper_threshold)]print(outliers)

    Hours  Scores
0     2.5      21
2     3.2      27
5     1.5      20
6     9.2      88
8     8.3      81
9     2.7      25
10    7.7      85
14    1.1      17
15    8.9      95
17    1.9      24
23    6.9      76
24    7.8      86

Variance

Varience is the average of the squared distance between each data point and the mean of the data.

Let's calculate the variance of the scores. We will use np.var( )

print(np.var(df['Scores'],ddof=1))

639.4266666666666

with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded then we get the population variance.

Let's see that here below.

print(np.var(df['Scores']))

613.8496

Standard deviation

This is the squareroot of the variance.

Let's get the standard deviation of the scores

print(np.sqrt(np.var(df['Scores'],ddof=1)))

25.28688724747802

Another way we can get standard deviation is by np.std( )

Let's use that

print(np.std(df['Scores'],ddof=1))

25.28688724747802

Mean Absolute Deviation

This is the average of the distance between each data point and the mean of the data.

Let's find the mean absolute distance of the scores

# first find the distance between the data points and the meandists=df['Scores']-np.mean(df['Scores'])# find the mean absolute print(np.mean(np.abs(dists)))

22.4192

decsribe( ) method

The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The dataframe must contain numerical data for the describe( ) method to be used.

We can make use of it to get some of the measurements that have been mentioned above.

df['Scores'].describe()

count    25.000000
mean     51.480000
std      25.286887
min      17.000000
25%      30.000000
50%      47.000000
75%      75.000000
max      95.000000
Name: Scores, dtype: float64

John Ludhi/nbshare.io: How To Get Measures Of Spread With Python

How To Get Measures Of Spread With Python

Posted by Purity on 09/02/2022

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112