PySpark Substr and Substring

substring(col_name, pos, len) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

First we load the important libraries

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring)

# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Let us load our initial records.

columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley Jones",120000),("Jane Tom James",5000)]

# converting data to rddsrdd=spark.sparkContext.parallelize(data)

# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)

# visualizing current data before manipulationdfFromRDD2.show()

+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|     John A Smith|  1000|
|Alex Wesley Jones|120000|
|   Jane Tom James|  5000|
+-----------------+------+

PySpark substring

1) Here we are taking a substring for the first name from the Full_Name Column. The Full_Name contains first name, middle name and last name. We are adding a new column for the substring called First_Name

# here we add a new column called 'First_Name' and use substring() to get partial string from 'Full_Name' columnmodified_dfFromRDD2=dfFromRDD2.withColumn("First_Name",substring('Full_Name',1,4))

# visualizing the modified dataframe modified_dfFromRDD2.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

2) We can also get a substring with select and alias to achieve the same result as above

modified_dfFromRDD3=dfFromRDD2.select("Full_Name",'Salary',substring('Full_Name',1,4).alias('First_Name'))

# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD3.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

3) We can also use substring with selectExpr to get a substring of 'Full_Name' column. selectExpr takes SQL expression(s) in a string to execute. This way we can run SQL-like expressions without creating views.

modified_dfFromRDD4=dfFromRDD2.selectExpr("Full_Name",'Salary','substring(Full_Name, 1, 4) as First_Name')

# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD4.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

4) Here we are going to use substr function of the Column data type to obtain the substring from the 'Full_Name' column and create a new column called 'First_Name'

modified_dfFromRDD5=dfFromRDD2.withColumn("First_Name",col('Full_Name').substr(1,4))

# visualizing the modified dataframe yields the same output as seen for all previous examples.modified_dfFromRDD5.show()

+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

5) Let us consider now a example of substring when the indices are beyond the length of column. In that case, the substring() function only returns characters that fall in the bounds i.e (start, start+len). This can be seen in the example below

# In this example we are going to get the four characters of Full_Name column starting from position 14.#  As can be seen in the example, 4 or fewer characters are returned depending on the string lengthmodified_dfFromRDD6=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',14,4))

modified_dfFromRDD6.show()

+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|         |
|Alex Wesley Jones|120000|     ones|
|   Jane Tom James|  5000|        s|
+-----------------+------+---------+

The above method produces wrong last name. We can fix it by following approach.

6) Another example of substring when we want to get the characters relative to end of the string. In this example, we are going to extract the last name from the Full_Name column.

# In this example we are going to get the five characters of Full_Name column relative to the end of the string.#  As can be seen in the example, last 5 charcters are returnedmodified_dfFromRDD7=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',-5,5))

modified_dfFromRDD7.show()

+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|    Smith|
|Alex Wesley Jones|120000|    Jones|
|   Jane Tom James|  5000|    James|
+-----------------+------+---------+

Note above approach works only if the last name in each row is of constant characters length. What if the last name is of different characters length, the solution is not that simple.
I will need the index at which the last name starts and also the length of 'Full_Name'. If you are curious, I have provided the solution below without the explanation.

frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring,lit,substring_index,length)

Let us create an example with last names having variable character length.

columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley leeper",120000),("Jane Tom kinderman",5000)]rdd=spark.sparkContext.parallelize(data)dfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)dfFromRDD2.show()

+------------------+------+
|         Full_Name|Salary|
+------------------+------+
|      John A Smith|  1000|
|Alex Wesley leeper|120000|
|Jane Tom kinderman|  5000|
+------------------+------+

Pyspark substr

dfFromRDD2.withColumn('Last_Name',col("Full_Name").substr((length('Full_Name')-length(substring_index('Full_Name',"",-1))),length('Full_Name'))).show()

+------------------+------+----------+
|         Full_Name|Salary| Last_Name|
+------------------+------+----------+
|      John A Smith|  1000|     Smith|
|Alex Wesley leeper|120000|    leeper|
|Jane Tom kinderman|  5000| kinderman|
+------------------+------+----------+

spark.stop()

John Ludhi/nbshare.io: PySpark Substr and Substring

PySpark Substr and Substring

First we load the important libraries

PySpark substring

Pyspark substr

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...