PySpark expr()

expr(str) function takes in and executes a sql-like expression. It returns a pyspark Column data type. This is useful to execute statements that are not available with Column type and functional APIs. Using expr(), we can use the Pyspark column names in the expressions as shown in the examples below.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,expr)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

columns=["Name","Salary","Age","Classify"]data=[("Sam",1000,20,0),("Alex",120000,40,0),("Peter",5000,30,0)]

Let us convert our data to rdds. To learn more about Pyspark rdd. check out following link ...
How To Analyze Data Using Pyspark RDD

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       0|
| Alex|120000| 40|       0|
|Peter|  5000| 30|       0|
+-----+------+---+--------+

1) Here we are changing the "Classify" column upon some condition using the case expression (rather than the built-in pyspark.sql.functions 'when' API which can also be used to achieve the same result):

If Salary less than 5000, it will change column value to 1

If Salary is less than 10000, it will change column value to 2

else, it will change it to 3

# here we update the column "Classify" using the CASE expression. # The conditions are based on the values in the Salary columnmodified_dfFromRDD2=dfFromRDD2.withColumn("Classify",expr("CASE WHEN Salary < 5000 THEN 1 "+"WHEN Salary < 10000 THEN 2 "+"ELSE 3 END"))

# visualizing the modified dataframe modified_dfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       1|
| Alex|120000| 40|       3|
|Peter|  5000| 30|       2|
+-----+------+---+--------+

2) We can also give a column alias to the SQL expression

# here we updated the column "Classify", CASE expression conditions based on the values in the Salary columnmodified_dfFromRDD2=dfFromRDD2.select("Name","Salary","Age",expr("CASE WHEN Salary < 5000 THEN 1 "+"WHEN Salary < 10000 THEN 2 "+"ELSE 3 END as Classify"))

# visualizing the modified dataframe by using the 'as' for aliasing the resulting column. # As you can see, it is exactly the same as the previous output. You can also see the column name by removing the 'as Classify'modified_dfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       1|
| Alex|120000| 40|       3|
|Peter|  5000| 30|       2|
+-----+------+---+--------+

3) We can also use arithmetic operators to perform operations on columns. Below we add 500 to the salary column and add a new column called New_Salary

modified_dfFromRDD3=dfFromRDD2.withColumn("New_Salary",expr("Salary + 500"))

modified_dfFromRDD3.show()

+-----+------+---+--------+----------+
| Name|Salary|Age|Classify|New_Salary|
+-----+------+---+--------+----------+
|  Sam|  1000| 20|       0|      1500|
| Alex|120000| 40|       0|    120500|
|Peter|  5000| 30|       0|      5500|
+-----+------+---+--------+----------+

We can also use SQL functions with existing column values in expr()

# Here we use the SQL function 'concat' to concatenate the values in two columns i.e. Name and Salary and also a constant string '_'modified_dfFromRDD4=dfFromRDD2.withColumn("Name_Salary",expr("concat(Name, '_', Salary)"))

# visualizing the resulting dataframe modified_dfFromRDD4.show()

+-----+------+---+--------+-----------+
| Name|Salary|Age|Classify|Name_Salary|
+-----+------+---+--------+-----------+
|  Sam|  1000| 20|       0|   Sam_1000|
| Alex|120000| 40|       0|Alex_120000|
|Peter|  5000| 30|       0| Peter_5000|
+-----+------+---+--------+-----------+

spark.stop()

John Ludhi/nbshare.io: Pyspark Expr Example

PySpark expr()

First we load the important libraries

Then load our initial records

1) Here we are changing the "Classify" column upon some condition using the case expression (rather than the built-in pyspark.sql.functions 'when' API which can also be used to achieve the same result):

If Salary less than 5000, it will change column value to 1

If Salary is less than 10000, it will change column value to 2

else, it will change it to 3

2) We can also give a column alias to the SQL expression

3) We can also use arithmetic operators to perform operations on columns. Below we add 500 to the salary column and add a new column called New_Salary

We can also use SQL functions with existing column values in expr()

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112