PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,regexp_replace,translate,overlay,when,expr)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

columns=["Full_Name","Salary","Last_Name_Pattern","Last_Name_Replacement"]data=[('Sam A Smith','1,000.01','Sm','Griffi'),('Alex Wesley Jones','120,000.89','Jo','Ba'),('Steve Paul Jobs','5,000.90','Jo','Bo')]

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Jones|120,000.89|               Jo|                   Ba|
|  Steve Paul Jobs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark regex_replace

regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value

1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'

# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteriamodified_dfFromRDD2=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','Jo','Ba'))

# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteriamodified_dfFromRDD2.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally

# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria# based on the conditionsmodified_dfFromRDD3=dfFromRDD2.withColumn("Full_Name",when(col('Full_Name').endswith('th'),regexp_replace('Full_Name','Smith','Griffith'))\
                                                         .otherwise(regexp_replace('Full_Name','Jo','Ba')))

# visualizing the modified dataframe we see how all the column values are updated based on the conditions providedmodified_dfFromRDD3.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|   Sam A Griffith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.

modified_dfFromRDD4=dfFromRDD2.withColumn("Salary",regexp_replace('Salary','\\.\d\d$','.00 \\$'))

# visualizing the modified dataframe, we see how the Salary column is updatedmodified_dfFromRDD4.show(truncate=False)

+-----------------+------------+-----------------+---------------------+
|Full_Name        |Salary      |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+------------+-----------------+---------------------+
|Sam A Smith      |1,000.00 $  |Sm               |Griffi               |
|Alex Wesley Jones|120,000.00 $|Jo               |Ba                   |
|Steve Paul Jobs  |5,000.00 $  |Jo               |Bo                   |
+-----------------+------------+-----------------+---------------------+

4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'

# Replace only the lowercase characters in the Full_Name with --modified_dfFromRDD5=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','[a-z]+','--'))

# visualizing the modified data frame. We see that all the lowercase characters are replaced.# The uppercase characters are same as they were beforemodified_dfFromRDD5.show()

+-----------+----------+-----------------+---------------------+
|  Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------+----------+-----------------+---------------------+
|  S-- A S--|  1,000.01|               Sm|               Griffi|
|A-- W-- J--|120,000.89|               Jo|                   Ba|
|S-- P-- J--|  5,000.90|               Jo|                   Bo|
+-----------+----------+-----------------+---------------------+

5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.

# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'modified_dfFromRDD6=modified_dfFromRDD2.withColumn("Full_Name",expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))

# visualizing the modified dataframe. # The Full_Name column has been updated with some characters from Last_Name_Replacementmodified_dfFromRDD6.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|  Sam A Griffiith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark translate()

translate(): This function is used to do character by character replacement of column values

# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9alphabets='abcdefjhij'digits='0123456789'modified_dfFromRDD7=dfFromRDD2.withColumn("Full_Name",translate('Full_Name',alphabets,digits))

# visualizing the modified dataframe we see the replacements has been done character by charactermodified_dfFromRDD7.show(truncate=False)

+-----------------+----------+-----------------+---------------------+
|Full_Name        |Salary    |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|S0m A Sm8t7      |1,000.01  |Sm               |Griffi               |
|Al4x W4sl4y Jon4s|120,000.89|Jo               |Ba                   |
|St4v4 P0ul Jo1s  |5,000.90  |Jo               |Bo                   |
+-----------------+----------+-----------------+---------------------+

PySpark overlay()

overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)

# Here the first two characters are replaced by the replacement string in Last_Name_Replacement columnmodified_dfFromRDD8=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",1,2).alias("FullName_Overlayed"))

# Visualizing the modified dataframemodified_dfFromRDD8.show()

+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|   Griffim A Smith|
|Alex Wesley Jones| Baex Wesley Jones|
|  Steve Paul Jobs|   Boeve Paul Jobs|
+-----------------+------------------+

# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the # length of the replacement stringmodified_dfFromRDD9=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",5).alias("FullName_Overlayed"))

# Visualizing the modified dataframemodified_dfFromRDD9.show()

+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|       Sam Griffih|
|Alex Wesley Jones| AlexBaesley Jones|
|  Steve Paul Jobs|   StevBoPaul Jobs|
+-----------------+------------------+

spark.stop()

John Ludhi/nbshare.io: PySpark Replace Values In DataFrames

PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

Then load our initial records

PySpark regex_replace

PySpark translate()

translate(): This function is used to do character by character replacement of column values

PySpark overlay()

Trending Articles

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Password Reset on SX6036?

Outlook でメールを保存または送信時に...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Teen Shot In Miami Drive-By Dies From Injuries

Nahitaji matokeo ya kidato cha nne ya mwaka 1998

Practice Sheet of Right form of verbs for HSC Students

PRC MOE SCHOOL TEACHER CHARGED FOR SEXUALLY PENETRATING 12 YEAR-OLD WITH FINGERS

Muloraki Au

SEAGCD2 - Editorial

ESENT データベース USS.jtx で、エラーイベント ID 490、454、489、455 が記録される事象について

Felony Arrest of Joseph A. White and Heather Coomer-White

the range cannot be deleted (6028) in microsoft word

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Arrow Flash 2 – Sinhala Dubbed – Episode 17 – 28th February 2016

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

Bhiknur Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers List...

Arrest logs for Wednesday, March 20, 2019

Bureau of Internal Revenue: Regional Offices (Directory)