Measures of spread tell how spread the data points are. Some examples of measures of spread are quantiles, variance, standard deviation and mean absolute deviation.
In this excercise we are going to get the measures of spread using python.
We will use a dataset from kaggle, follow https://www.kaggle.com/datasets/himanshunakrani/student-study-hours to access the data
Quantiles
Quantiles are values that split sorted data or a probability distribution into equal parts. There several different types of quantlies, here are some of the examples:
- Quartiles - Divides the data into 4 equal parts.
- Quintiles - Divides the data into 5 equal parts.
- Deciles - Divides the data into 10 equal parts
- Percentiles - Divides the data into 100 equal parts
Let us download the libraries we will use
importnumpyasnpimportpandasaspd
We will now load the data that we'll use.
df=pd.read_csv('score.csv')print(df.head())
Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the scores into 4 equal parts.
print(np.quantile(df['Scores'],[0,0.25,0.5,0.75,1]))
Quantiles using linspace( )
It can become quite tedious to list all the points when getting quantiles, more so in cases of higher quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )
Let's get the quartiles of the scores
print(np.quantile(df['Scores'],np.linspace(0,1,5)))
Let's get the quintiles
print(np.quantile(df['Scores'],np.linspace(0,1,6)))
Let's get the deciles
print(np.quantile(df['Scores'],np.linspace(0,1,11)))
Interquartile Range (IQR)
This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half of the data.
Let's get the IQR for the scores
IQR=np.quantile(df['Scores'],0.75)-np.quantile(df['Scores'],0.25)print(IQR)
Another way we can get IQR is by using iqr( ) from the scipy library
fromscipy.statsimportiqrIQR=iqr(df['Scores'])print(IQR)
Outliers
These are data points that are usually different or detached from the rest of the data points.
A data point is an outlier if:
data < 1st quartile − 1.5 * IQR
or
data > 3rd quartile + 1.5 * IQR
Let's get the outliers in the scores
# first get iqriqr=iqr(df['Scores'])# then get lower & upper thresholdlower_threshold=np.quantile(df['Scores'],0.25)upper_threshold=np.quantile(df['Scores'],0.75)# then find outliers outliers=df[(df['Scores']<lower_threshold)|(df['Scores']>upper_threshold)]print(outliers)
- Variance
Varience is the average of the squared distance between each data point and the mean of the data.
Let's calculate the variance of the scores. We will use np.var( )
print(np.var(df['Scores'],ddof=1))
with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded then we get the population variance.
Let's see that here below.
print(np.var(df['Scores']))
- Standard deviation
This is the squareroot of the variance.
Let's get the standard deviation of the scores
print(np.sqrt(np.var(df['Scores'],ddof=1)))
Another way we can get standard deviation is by np.std( )
Let's use that
print(np.std(df['Scores'],ddof=1))
- Mean Absolute Deviation
This is the average of the distance between each data point and the mean of the data.
Let's find the mean absolute distance of the scores
# first find the distance between the data points and the meandists=df['Scores']-np.mean(df['Scores'])# find the mean absolute print(np.mean(np.abs(dists)))
decsribe( ) method
The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The dataframe must contain numerical data for the describe( ) method to be used.
We can make use of it to get some of the measurements that have been mentioned above.
df['Scores'].describe()