PySpark Distinct Examples

In this notebook, we will go through PySpark Distinct. For this exercise, I will be using following data from Kaggle...
https://www.kaggle.com/code/kirichenko17roman/recommender-systems/data

If you don't have PySpark installed, install Pyspark on Linux by clicking here.

frompyspark.sqlimportSparkSessionfrompyspark.sql.typesimport*spark=SparkSession \
    .builder \
    .appName("Purchase") \
    .config('spark.ui.showConsoleProgress',False) \
    .getOrCreate()

Let us look at the data first.

df=spark.read.csv("/home/notebooks/kz.csv",header=True,sep=",")#show 3 rows of our DataFramedf.show(3)

+--------------------+-------------------+-------------------+-------------------+--------------------+-------+------+-------------------+
|          event_time|           order_id|         product_id|        category_id|       category_code|  brand| price|            user_id|
+--------------------+-------------------+-------------------+-------------------+--------------------+-------+------+-------------------+
|2020-04-24 11:50:...|2294359932054536986|1515966223509089906|2268105426648170900|  electronics.tablet|samsung|162.01|1515915625441993984|
|2020-04-24 11:50:...|2294359932054536986|1515966223509089906|2268105426648170900|  electronics.tablet|samsung|162.01|1515915625441993984|
|2020-04-24 14:37:...|2294444024058086220|2273948319057183658|2268105430162997728|electronics.audio...| huawei| 77.52|1515915625447879434|
+--------------------+-------------------+-------------------+-------------------+--------------------+-------+------+-------------------+
only showing top 3 rows

df.columns

['event_time',
 'order_id',
 'product_id',
 'category_id',
 'category_code',
 'brand',
 'price',
 'user_id']

This is transaction data.

PySpark Distinct

Let us check how many rows are in our data.

df.count()

2633521

To count the distinct rows, we can use distinct() method on the pyspark dataframe.

df.distinct().count()

2632846

PySpark countDistinct

frompyspark.sql.functionsimportcountDistinct

CountDistinct can be passed to pySpark aggregate function. In the below snippet, we are counting number of unique brands.

df.agg(countDistinct('brand').alias('cnt')).collect()[0].cnt

23021

We can apply the above command on multiple columns as shown below.

items=df.agg(*(countDistinct(col(c)).alias(c)forcin['category_code','brand'])).collect()[0]

print('category_code\tbrand\n')print('%s\t\t%s\n'%(items.category_code,items.brand))

category_code	brand

510		23021

We can also use groupby, agg and countDistinct together. Let us say we want to calculate average price of each brand and also find out how many categories are there for each brand.

frompyspark.sqlimportfunctionsasF

avg_price=[F.avg('price')]cnt=[F.countDistinct(c)forcin['category_code','brand']]df.groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)

+-------------------+------------------+--------------------+
|              brand|        avg(price)|count(category_code)|
+-------------------+------------------+--------------------+
|1515915625450324494|              null|                   3|
|1515915625484616467|              null|                   1|
|1515915625484629529|              null|                   1|
|           sibrtekh| 16.85457142857143|                   2|
|            edifier|15.202325581395337|                   2|
+-------------------+------------------+--------------------+
only showing top 5 rows

Looks like there are lot of rows in data with no price. Let us re-run above command without null rows.

avg_price=[F.avg('price')]cnt=[F.countDistinct(c)forcin['category_code','brand']]df.dropna().groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)

+--------+------------------+--------------------+
|   brand|        avg(price)|count(category_code)|
+--------+------------------+--------------------+
|sibrtekh|1.9322222222222223|                   2|
| edifier|15.029576719576713|                   2|
|  vortex| 6.505000000000001|                   1|
| ruggear|54.053461538461534|                   1|
|   sigma| 744.8535714285715|                   1|
+--------+------------------+--------------------+
only showing top 5 rows

PySpark Select Distinct

We can also perform Distinct using SQL select method.

df.select('brand').distinct().count()

23022

df.select('category_code').distinct().count()

511

We can repeat above command on multiple columns.

df.select('category_code','brand').distinct().count()

37631

Conclusion

I hope above examples gave you enough to get started on PySpark Distinct.

John Ludhi/nbshare.io: PySpark Distinct Examples

PySpark Distinct Examples

PySpark Distinct

PySpark countDistinct

PySpark Select Distinct

Conclusion

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...