In this notebook, we will go through PySpark Distinct. For this exercise, I will be using following data from Kaggle...
https://www.kaggle.com/code/kirichenko17roman/recommender-systems/data
If you don't have PySpark installed, install Pyspark on Linux by clicking here.
frompyspark.sqlimportSparkSessionfrompyspark.sql.typesimport*spark=SparkSession \
.builder \
.appName("Purchase") \
.config('spark.ui.showConsoleProgress',False) \
.getOrCreate()Let us look at the data first.
df=spark.read.csv("/home/notebooks/kz.csv",header=True,sep=",")#show 3 rows of our DataFramedf.show(3)df.columnsThis is transaction data.
Let us check how many rows are in our data.
df.count()To count the distinct rows, we can use distinct() method on the pyspark dataframe.
df.distinct().count()frompyspark.sql.functionsimportcountDistinctCountDistinct can be passed to pySpark aggregate function. In the below snippet, we are counting number of unique brands.
df.agg(countDistinct('brand').alias('cnt')).collect()[0].cntWe can apply the above command on multiple columns as shown below.
items=df.agg(*(countDistinct(col(c)).alias(c)forcin['category_code','brand'])).collect()[0]print('category_code\tbrand\n')print('%s\t\t%s\n'%(items.category_code,items.brand))We can also use groupby, agg and countDistinct together. Let us say we want to calculate average price of each brand and also find out how many categories are there for each brand.
frompyspark.sqlimportfunctionsasFavg_price=[F.avg('price')]cnt=[F.countDistinct(c)forcin['category_code','brand']]df.groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)Looks like there are lot of rows in data with no price. Let us re-run above command without null rows.
avg_price=[F.avg('price')]cnt=[F.countDistinct(c)forcin['category_code','brand']]df.dropna().groupby('brand').agg(F.avg('price'),F.countDistinct('category_code')).show(5)We can also perform Distinct using SQL select method.
df.select('brand').distinct().count()df.select('category_code').distinct().count()We can repeat above command on multiple columns.
df.select('category_code','brand').distinct().count()