PySpark expr()

expr(str) function takes in and executes a sql-like expression. It returns a pyspark Column data type. This is useful to execute statements that are not available with Column type and functional APIs. Using expr(), we can use the Pyspark column names in the expressions as shown in the examples below.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,expr)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

columns=["Name","Salary","Age","Classify"]data=[("Sam",1000,20,0),("Alex",120000,40,0),("Peter",5000,30,0)]

Let us convert our data to rdds. To learn more about Pyspark rdd. check out following link ...
How To Analyze Data Using Pyspark RDD

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       0|
| Alex|120000| 40|       0|
|Peter|  5000| 30|       0|
+-----+------+---+--------+

1) Here we are changing the "Classify" column upon some condition using the case expression (rather than the built-in pyspark.sql.functions 'when' API which can also be used to achieve the same result):

If Salary less than 5000, it will change column value to 1

If Salary is less than 10000, it will change column value to 2

else, it will change it to 3

# here we update the column "Classify" using the CASE expression. # The conditions are based on the values in the Salary columnmodified_dfFromRDD2=dfFromRDD2.withColumn("Classify",expr("CASE WHEN Salary < 5000 THEN 1 "+"WHEN Salary < 10000 THEN 2 "+"ELSE 3 END"))

# visualizing the modified dataframe modified_dfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       1|
| Alex|120000| 40|       3|
|Peter|  5000| 30|       2|
+-----+------+---+--------+

2) We can also give a column alias to the SQL expression

# here we updated the column "Classify", CASE expression conditions based on the values in the Salary columnmodified_dfFromRDD2=dfFromRDD2.select("Name","Salary","Age",expr("CASE WHEN Salary < 5000 THEN 1 "+"WHEN Salary < 10000 THEN 2 "+"ELSE 3 END as Classify"))

# visualizing the modified dataframe by using the 'as' for aliasing the resulting column. # As you can see, it is exactly the same as the previous output. You can also see the column name by removing the 'as Classify'modified_dfFromRDD2.show()

+-----+------+---+--------+
| Name|Salary|Age|Classify|
+-----+------+---+--------+
|  Sam|  1000| 20|       1|
| Alex|120000| 40|       3|
|Peter|  5000| 30|       2|
+-----+------+---+--------+

3) We can also use arithmetic operators to perform operations on columns. Below we add 500 to the salary column and add a new column called New_Salary

modified_dfFromRDD3=dfFromRDD2.withColumn("New_Salary",expr("Salary + 500"))

modified_dfFromRDD3.show()

+-----+------+---+--------+----------+
| Name|Salary|Age|Classify|New_Salary|
+-----+------+---+--------+----------+
|  Sam|  1000| 20|       0|      1500|
| Alex|120000| 40|       0|    120500|
|Peter|  5000| 30|       0|      5500|
+-----+------+---+--------+----------+

We can also use SQL functions with existing column values in expr()

# Here we use the SQL function 'concat' to concatenate the values in two columns i.e. Name and Salary and also a constant string '_'modified_dfFromRDD4=dfFromRDD2.withColumn("Name_Salary",expr("concat(Name, '_', Salary)"))

# visualizing the resulting dataframe modified_dfFromRDD4.show()

+-----+------+---+--------+-----------+
| Name|Salary|Age|Classify|Name_Salary|
+-----+------+---+--------+-----------+
|  Sam|  1000| 20|       0|   Sam_1000|
| Alex|120000| 40|       0|Alex_120000|
|Peter|  5000| 30|       0| Peter_5000|
+-----+------+---+--------+-----------+

spark.stop()

PySpark concat_ws()

split(str) function is used to convert a string column into an array of strings using a delimiter for the split. concat_ws() is the opposite of split. It creates a string column from an array of strings. The resulting array is concatenated with the provided delimiter.

pyspark functions used in this notebook are
rdd.createOrReplaceTempView, rdd.drop, spark.sql

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,concat_ws,split)

# initializing spark session instancespark=SparkSession.builder.appName('pyspark concat snippets').getOrCreate()

Then load our initial records

columns=["Full_Name","Salary"]data=[("Sam A Smith",1000),("Alex Wesley Jones",120000),("Steve Paul Jobs",5000)]

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|      Sam A Smith|  1000|
|Alex Wesley Jones|120000|
|  Steve Paul Jobs|  5000|
+-----------------+------+

1) Here we are splitting the Full_Name Column containing first name, middle name and last name and adding a new column called Name_Parts

# here we add a new column called 'Name_Parts' and use space ' ' as the delimiter stringmodified_dfFromRDD2=dfFromRDD2.withColumn("Name_Parts",split(col('Full_Name'),' '))

# visualizing the modified dataframe modified_dfFromRDD2.show()

+-----------------+------+--------------------+
|        Full_Name|Salary|          Name_Parts|
+-----------------+------+--------------------+
|      Sam A Smith|  1000|     [Sam, A, Smith]|
|Alex Wesley Jones|120000|[Alex, Wesley, Jo...|
|  Steve Paul Jobs|  5000| [Steve, Paul, Jobs]|
+-----------------+------+--------------------+

2) We can also use a SQL query to split the Full_Name column. For this, we need to use createOrReplaceTempView() to create a create a temporary view from the Dataframe. This view can be accessed till SparkContaxt is active.

# Below we use the SQL query to select the required columns. This includes the new column we create# by splitting the Full_Name column. dfFromRDD2.createOrReplaceTempView("SalaryData")modified_dfFromRDD3=spark.sql("select Full_Name, Salary, SPLIT(Full_Name,' ') as Name_Parts from SalaryData")

# visualizing the modified dataframe after executing the SQL query.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD3.show(truncate=False)

+-----------------+------+---------------------+
|Full_Name        |Salary|Name_Parts           |
+-----------------+------+---------------------+
|Sam A Smith      |1000  |[Sam, A, Smith]      |
|Alex Wesley Jones|120000|[Alex, Wesley, Jones]|
|Steve Paul Jobs  |5000  |[Steve, Paul, Jobs]  |
+-----------------+------+---------------------+

Now we will use the above data frame for concat_ws function but will drop the Full_Name column. We will be recreating it using the concatenation operation

# Removing the Full_Name column using the drop functionmodified_dfFromRDD4=modified_dfFromRDD3.drop('Full_Name')

# visualizing the modified data framemodified_dfFromRDD4.show()

+------+--------------------+
|Salary|          Name_Parts|
+------+--------------------+
|  1000|     [Sam, A, Smith]|
|120000|[Alex, Wesley, Jo...|
|  5000| [Steve, Paul, Jobs]|
+------+--------------------+

1) Here we are concatenating the Name_Parts Column containing first name, middle name and last name string elements and adding a new column called Full_Name

# here we add a new column called 'Full_Name' and use space ' ' as the delimiter string to concatenate the Name_Partsmodified_dfFromRDD5=modified_dfFromRDD4.withColumn("Full_Name",concat_ws(' ',col('Name_Parts')))

# visualizing the modified dataframe. # The Full_Name column is same as the one in the original data frame we started with above.modified_dfFromRDD5.show()

+------+--------------------+-----------------+
|Salary|          Name_Parts|        Full_Name|
+------+--------------------+-----------------+
|  1000|     [Sam, A, Smith]|      Sam A Smith|
|120000|[Alex, Wesley, Jo...|Alex Wesley Jones|
|  5000| [Steve, Paul, Jobs]|  Steve Paul Jobs|
+------+--------------------+-----------------+

2) We can also use a SQL query to concatenate the Name_Parts column like we did for split() above. For this, we need to use createOrReplaceTempView() to create a create a temporary view from the Dataframe like we did before. We will then use that view to execute the concatenate query on.

# Below we use the SQL query to select the required columns. This includes the new column we create# by splitting the Full_Name column. modified_dfFromRDD4.createOrReplaceTempView("SalaryData2")modified_dfFromRDD6=spark.sql("select Salary, Name_Parts, CONCAT_WS(' ', Name_Parts) as Full_Name from SalaryData2")

# visualizing the modified dataframe after executing the SQL query.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD6.show(truncate=False)

+------+---------------------+-----------------+
|Salary|Name_Parts           |Full_Name        |
+------+---------------------+-----------------+
|1000  |[Sam, A, Smith]      |Sam A Smith      |
|120000|[Alex, Wesley, Jones]|Alex Wesley Jones|
|5000  |[Steve, Paul, Jobs]  |Steve Paul Jobs  |
+------+---------------------+-----------------+