regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.
First we load the important libraries
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,regexp_replace,translate,overlay,when,expr)
# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()
columns=["Full_Name","Salary","Last_Name_Pattern","Last_Name_Replacement"]data=[('Sam A Smith','1,000.01','Sm','Griffi'),('Alex Wesley Jones','120,000.89','Jo','Ba'),('Steve Paul Jobs','5,000.90','Jo','Bo')]
# converting data to rddsrdd=spark.sparkContext.parallelize(data)
# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)
# visualizing current data before manipulationdfFromRDD2.show()
regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value
1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'
# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteriamodified_dfFromRDD2=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','Jo','Ba'))
# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteriamodified_dfFromRDD2.show()
2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally
# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria# based on the conditionsmodified_dfFromRDD3=dfFromRDD2.withColumn("Full_Name",when(col('Full_Name').endswith('th'),regexp_replace('Full_Name','Smith','Griffith'))\
.otherwise(regexp_replace('Full_Name','Jo','Ba')))
# visualizing the modified dataframe we see how all the column values are updated based on the conditions providedmodified_dfFromRDD3.show()
3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.
modified_dfFromRDD4=dfFromRDD2.withColumn("Salary",regexp_replace('Salary','\\.\d\d$','.00 \\$'))
# visualizing the modified dataframe, we see how the Salary column is updatedmodified_dfFromRDD4.show(truncate=False)
4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'
# Replace only the lowercase characters in the Full_Name with --modified_dfFromRDD5=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','[a-z]+','--'))
# visualizing the modified data frame. We see that all the lowercase characters are replaced.# The uppercase characters are same as they were beforemodified_dfFromRDD5.show()
5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.
# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'modified_dfFromRDD6=modified_dfFromRDD2.withColumn("Full_Name",expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))
# visualizing the modified dataframe. # The Full_Name column has been updated with some characters from Last_Name_Replacementmodified_dfFromRDD6.show()
# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9alphabets='abcdefjhij'digits='0123456789'modified_dfFromRDD7=dfFromRDD2.withColumn("Full_Name",translate('Full_Name',alphabets,digits))
# visualizing the modified dataframe we see the replacements has been done character by charactermodified_dfFromRDD7.show(truncate=False)
overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)
# Here the first two characters are replaced by the replacement string in Last_Name_Replacement columnmodified_dfFromRDD8=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",1,2).alias("FullName_Overlayed"))
# Visualizing the modified dataframemodified_dfFromRDD8.show()
# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the # length of the replacement stringmodified_dfFromRDD9=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",5).alias("FullName_Overlayed"))
# Visualizing the modified dataframemodified_dfFromRDD9.show()
spark.stop()