Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22849

John Ludhi/nbshare.io: PySpark Replace Values In DataFrames

$
0
0

PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

First we load the important libraries

In [1]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,regexp_replace,translate,overlay,when,expr)
In [25]:
# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

In [3]:
columns=["Full_Name","Salary","Last_Name_Pattern","Last_Name_Replacement"]data=[('Sam A Smith','1,000.01','Sm','Griffi'),('Alex Wesley Jones','120,000.89','Jo','Ba'),('Steve Paul Jobs','5,000.90','Jo','Bo')]
In [4]:
# converting data to rddsrdd=spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulationdfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Jones|120,000.89|               Jo|                   Ba|
|  Steve Paul Jobs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark regex_replace

regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value

1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'

In [7]:
# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteriamodified_dfFromRDD2=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','Jo','Ba'))
In [8]:
# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteriamodified_dfFromRDD2.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally

In [9]:
# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria# based on the conditionsmodified_dfFromRDD3=dfFromRDD2.withColumn("Full_Name",when(col('Full_Name').endswith('th'),regexp_replace('Full_Name','Smith','Griffith'))\
                                                         .otherwise(regexp_replace('Full_Name','Jo','Ba')))
In [10]:
# visualizing the modified dataframe we see how all the column values are updated based on the conditions providedmodified_dfFromRDD3.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|   Sam A Griffith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.

In [11]:
modified_dfFromRDD4=dfFromRDD2.withColumn("Salary",regexp_replace('Salary','\\.\d\d$','.00 \\$'))
In [12]:
# visualizing the modified dataframe, we see how the Salary column is updatedmodified_dfFromRDD4.show(truncate=False)
+-----------------+------------+-----------------+---------------------+
|Full_Name        |Salary      |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+------------+-----------------+---------------------+
|Sam A Smith      |1,000.00 $  |Sm               |Griffi               |
|Alex Wesley Jones|120,000.00 $|Jo               |Ba                   |
|Steve Paul Jobs  |5,000.00 $  |Jo               |Bo                   |
+-----------------+------------+-----------------+---------------------+

4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'

In [13]:
# Replace only the lowercase characters in the Full_Name with --modified_dfFromRDD5=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','[a-z]+','--'))
In [14]:
# visualizing the modified data frame. We see that all the lowercase characters are replaced.# The uppercase characters are same as they were beforemodified_dfFromRDD5.show()
+-----------+----------+-----------------+---------------------+
|  Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------+----------+-----------------+---------------------+
|  S-- A S--|  1,000.01|               Sm|               Griffi|
|A-- W-- J--|120,000.89|               Jo|                   Ba|
|S-- P-- J--|  5,000.90|               Jo|                   Bo|
+-----------+----------+-----------------+---------------------+

5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.

In [15]:
# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'modified_dfFromRDD6=modified_dfFromRDD2.withColumn("Full_Name",expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))
In [16]:
# visualizing the modified dataframe. # The Full_Name column has been updated with some characters from Last_Name_Replacementmodified_dfFromRDD6.show()
+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|  Sam A Griffiith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark translate()

translate(): This function is used to do character by character replacement of column values

In [17]:
# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9alphabets='abcdefjhij'digits='0123456789'modified_dfFromRDD7=dfFromRDD2.withColumn("Full_Name",translate('Full_Name',alphabets,digits))
In [18]:
# visualizing the modified dataframe we see the replacements has been done character by charactermodified_dfFromRDD7.show(truncate=False)
+-----------------+----------+-----------------+---------------------+
|Full_Name        |Salary    |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|S0m A Sm8t7      |1,000.01  |Sm               |Griffi               |
|Al4x W4sl4y Jon4s|120,000.89|Jo               |Ba                   |
|St4v4 P0ul Jo1s  |5,000.90  |Jo               |Bo                   |
+-----------------+----------+-----------------+---------------------+

PySpark overlay()

overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)

In [19]:
# Here the first two characters are replaced by the replacement string in Last_Name_Replacement columnmodified_dfFromRDD8=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",1,2).alias("FullName_Overlayed"))
In [20]:
# Visualizing the modified dataframemodified_dfFromRDD8.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|   Griffim A Smith|
|Alex Wesley Jones| Baex Wesley Jones|
|  Steve Paul Jobs|   Boeve Paul Jobs|
+-----------------+------------------+

In [21]:
# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the # length of the replacement stringmodified_dfFromRDD9=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",5).alias("FullName_Overlayed"))
In [22]:
# Visualizing the modified dataframemodified_dfFromRDD9.show()
+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|       Sam Griffih|
|Alex Wesley Jones| AlexBaesley Jones|
|  Steve Paul Jobs|   StevBoPaul Jobs|
+-----------------+------------------+

In [23]:
spark.stop()

Viewing all articles
Browse latest Browse all 22849

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>