PySpark Substr and Substring

substring(col_name, pos, len) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Let us load our initial records.

columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley Jones",120000),("Jane Tom James",5000)]

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|     John A Smith|  1000|
|Alex Wesley Jones|120000|
|   Jane Tom James|  5000|
+-----------------+------+

PySpark substring

1) Here we are taking a substring for the first name from the Full_Name Column. The Full_Name contains first name, middle name and last name. We are adding a new column for the substring called First_Name

# here we add a new column called 'First_Name' and use substring() to get partial string from 'Full_Name' columnmodified_dfFromRDD2=dfFromRDD2.withColumn("First_Name",substring('Full_Name',1,4))

# visualizing the modified dataframe modified_dfFromRDD2.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

2) We can also get a substring with select and alias to achieve the same result as above

modified_dfFromRDD3=dfFromRDD2.select("Full_Name",'Salary',substring('Full_Name',1,4).alias('First_Name'))

# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD3.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

3) We can also use substring with selectExpr to get a substring of 'Full_Name' column. selectExpr takes SQL expression(s) in a string to execute. This way we can run SQL-like expressions without creating views.

modified_dfFromRDD4=dfFromRDD2.selectExpr("Full_Name",'Salary','substring(Full_Name, 1, 4) as First_Name')

# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD4.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

4) Here we are going to use substr function of the Column data type to obtain the substring from the 'Full_Name' column and create a new column called 'First_Name'

modified_dfFromRDD5=dfFromRDD2.withColumn("First_Name",col('Full_Name').substr(1,4))

# visualizing the modified dataframe yields the same output as seen for all previous examples.modified_dfFromRDD5.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

5) Let us consider now a example of substring when the indices are beyond the length of column. In that case, the substring() function only returns characters that fall in the bounds i.e (start, start+len). This can be seen in the example below

# In this example we are going to get the four characters of Full_Name column starting from position 14.#  As can be seen in the example, 4 or fewer characters are returned depending on the string lengthmodified_dfFromRDD6=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',14,4))

modified_dfFromRDD6.show()

+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|         |
|Alex Wesley Jones|120000|     ones|
|   Jane Tom James|  5000|        s|
+-----------------+------+---------+

The above method produces wrong last name. We can fix it by following approach.

6) Another example of substring when we want to get the characters relative to end of the string. In this example, we are going to extract the last name from the Full_Name column.

# In this example we are going to get the five characters of Full_Name column relative to the end of the string.#  As can be seen in the example, last 5 charcters are returnedmodified_dfFromRDD7=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',-5,5))

modified_dfFromRDD7.show()

+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|    Smith|
|Alex Wesley Jones|120000|    Jones|
|   Jane Tom James|  5000|    James|
+-----------------+------+---------+

Note above approach works only if the last name in each row is of constant characters length. What if the last name is of different characters length, the solution is not that simple.
I will need the index at which the last name starts and also the length of 'Full_Name'. If you are curious, I have provided the solution below without the explanation.

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring,lit,substring_index,length)

Let us create an example with last names having variable character length.

columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley leeper",120000),("Jane Tom kinderman",5000)]rdd=spark.sparkContext.parallelize(data)dfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)dfFromRDD2.show()

+------------------+------+
|         Full_Name|Salary|
+------------------+------+
|      John A Smith|  1000|
|Alex Wesley leeper|120000|
|Jane Tom kinderman|  5000|
+------------------+------+

Pyspark substr

dfFromRDD2.withColumn('Last_Name',col("Full_Name").substr((length('Full_Name')-length(substring_index('Full_Name',"",-1))),length('Full_Name'))).show()

+------------------+------+----------+
|         Full_Name|Salary| Last_Name|
+------------------+------+----------+
|      John A Smith|  1000|     Smith|
|Alex Wesley leeper|120000|    leeper|
|Jane Tom kinderman|  5000| kinderman|
+------------------+------+----------+

spark.stop()

PySpark Replace Values In DataFrames Using regexp_replace(), translate() and Overlay() Functions

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,regexp_replace,translate,overlay,when,expr)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Then load our initial records

columns=["Full_Name","Salary","Last_Name_Pattern","Last_Name_Replacement"]data=[('Sam A Smith','1,000.01','Sm','Griffi'),('Alex Wesley Jones','120,000.89','Jo','Ba'),('Steve Paul Jobs','5,000.90','Jo','Bo')]

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Jones|120,000.89|               Jo|                   Ba|
|  Steve Paul Jobs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark regex_replace

regex_replace: we will use the regex_replace(col_name, pattern, new_value) to replace character(s) in a string column that match the pattern with the new_value

1) Here we are replacing the characters 'Jo' in the Full_Name with 'Ba'

# here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteriamodified_dfFromRDD2=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','Jo','Ba'))

# visualizing the modified dataframe. We see that only the last two names are updated as those meet our criteriamodified_dfFromRDD2.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|      Sam A Smith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

2) In the above example, we see that only two values (Jones, Jobs) are replaced but not Smith. We can use when function to replace column values conditionally

# Here we update the column called 'Full_Name' by replacing some characters in the name that fit the criteria# based on the conditionsmodified_dfFromRDD3=dfFromRDD2.withColumn("Full_Name",when(col('Full_Name').endswith('th'),regexp_replace('Full_Name','Smith','Griffith'))\
                                                         .otherwise(regexp_replace('Full_Name','Jo','Ba')))

# visualizing the modified dataframe we see how all the column values are updated based on the conditions providedmodified_dfFromRDD3.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|   Sam A Griffith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

3) We can also use a regex to replace characters. As an example we are making the decimal digits in the salary column to '00'.

modified_dfFromRDD4=dfFromRDD2.withColumn("Salary",regexp_replace('Salary','\\.\d\d$','.00 \\$'))

# visualizing the modified dataframe, we see how the Salary column is updatedmodified_dfFromRDD4.show(truncate=False)

+-----------------+------------+-----------------+---------------------+
|Full_Name        |Salary      |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+------------+-----------------+---------------------+
|Sam A Smith      |1,000.00 $  |Sm               |Griffi               |
|Alex Wesley Jones|120,000.00 $|Jo               |Ba                   |
|Steve Paul Jobs  |5,000.00 $  |Jo               |Bo                   |
+-----------------+------------+-----------------+---------------------+

4) Now we will use another regex example to replace varialbe number of characters where the pattern matches regex. Here we replace all lower case characters in the Full_Name column with '--'

# Replace only the lowercase characters in the Full_Name with --modified_dfFromRDD5=dfFromRDD2.withColumn("Full_Name",regexp_replace('Full_Name','[a-z]+','--'))

# visualizing the modified data frame. We see that all the lowercase characters are replaced.# The uppercase characters are same as they were beforemodified_dfFromRDD5.show()

+-----------+----------+-----------------+---------------------+
|  Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------+----------+-----------------+---------------------+
|  S-- A S--|  1,000.01|               Sm|               Griffi|
|A-- W-- J--|120,000.89|               Jo|                   Ba|
|S-- P-- J--|  5,000.90|               Jo|                   Bo|
+-----------+----------+-----------------+---------------------+

5) We can also use regex_replace with expr to replace a column's value with a match pattern from a second column with the values from third column i.e 'regexp_replace(col1, col2, col3)'. Here we are going to replace the characters in column 1, that match the pattern in column 2 with characters from column 3.

# Here we update the column called 'Full_Name' by replacing some characters in the 'Full_Name' that match the values# in 'Last_Name_Pattern' with characters in 'Last_Name_Replacement'modified_dfFromRDD6=modified_dfFromRDD2.withColumn("Full_Name",expr("regexp_replace(Full_Name, Last_Name_Pattern, Last_Name_Replacement)"))

# visualizing the modified dataframe. # The Full_Name column has been updated with some characters from Last_Name_Replacementmodified_dfFromRDD6.show()

+-----------------+----------+-----------------+---------------------+
|        Full_Name|    Salary|Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|  Sam A Griffiith|  1,000.01|               Sm|               Griffi|
|Alex Wesley Banes|120,000.89|               Jo|                   Ba|
|  Steve Paul Babs|  5,000.90|               Jo|                   Bo|
+-----------------+----------+-----------------+---------------------+

PySpark translate()

translate(): This function is used to do character by character replacement of column values

# here we update the column called 'Full_Name' by replacing the lowercase characters in the following way:# each 'a' is replaced by 0, 'b' by 1, 'c' by 2, .....'i' by 8 and j by 9alphabets='abcdefjhij'digits='0123456789'modified_dfFromRDD7=dfFromRDD2.withColumn("Full_Name",translate('Full_Name',alphabets,digits))

# visualizing the modified dataframe we see the replacements has been done character by charactermodified_dfFromRDD7.show(truncate=False)

+-----------------+----------+-----------------+---------------------+
|Full_Name        |Salary    |Last_Name_Pattern|Last_Name_Replacement|
+-----------------+----------+-----------------+---------------------+
|S0m A Sm8t7      |1,000.01  |Sm               |Griffi               |
|Al4x W4sl4y Jon4s|120,000.89|Jo               |Ba                   |
|St4v4 P0ul Jo1s  |5,000.90  |Jo               |Bo                   |
+-----------------+----------+-----------------+---------------------+

PySpark overlay()

overlay(src_col, replace_col, src_start_pos, src_char_len <default -1>): This function is used to replace the values in a src_col column from src_start_pos with values from replace_col. This replacement starts from src_start_pos and replaces src_char_len characters (by default replaces replace_col length characters)

# Here the first two characters are replaced by the replacement string in Last_Name_Replacement columnmodified_dfFromRDD8=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",1,2).alias("FullName_Overlayed"))

# Visualizing the modified dataframemodified_dfFromRDD8.show()

+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|   Griffim A Smith|
|Alex Wesley Jones| Baex Wesley Jones|
|  Steve Paul Jobs|   Boeve Paul Jobs|
+-----------------+------------------+

# Here we replace characters starting from position 5 (1-indexed) and replace characters equal to the # length of the replacement stringmodified_dfFromRDD9=dfFromRDD2.select('Full_Name',overlay("Full_Name","Last_Name_Replacement",5).alias("FullName_Overlayed"))

# Visualizing the modified dataframemodified_dfFromRDD9.show()

+-----------------+------------------+
|        Full_Name|FullName_Overlayed|
+-----------------+------------------+
|      Sam A Smith|       Sam Griffih|
|Alex Wesley Jones| AlexBaesley Jones|
|  Steve Paul Jobs|   StevBoPaul Jobs|
+-----------------+------------------+