Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22849

John Ludhi/nbshare.io: PySpark Substr and Substring

$
0
0

PySpark Substr and Substring

substring(col_name, pos, len) - Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

First we load the important libraries

In [1]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring)
In [24]:
# initializing spark session instancespark=SparkSession.builder.appName('snippets').getOrCreate()

Let us load our initial records.

In [3]:
columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley Jones",120000),("Jane Tom James",5000)]
In [4]:
# converting data to rddsrdd=spark.sparkContext.parallelize(data)
In [5]:
# Then creating a dataframe from our rdd variabledfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)
In [6]:
# visualizing current data before manipulationdfFromRDD2.show()
+-----------------+------+
|        Full_Name|Salary|
+-----------------+------+
|     John A Smith|  1000|
|Alex Wesley Jones|120000|
|   Jane Tom James|  5000|
+-----------------+------+

PySpark substring

1) Here we are taking a substring for the first name from the Full_Name Column. The Full_Name contains first name, middle name and last name. We are adding a new column for the substring called First_Name

In [7]:
# here we add a new column called 'First_Name' and use substring() to get partial string from 'Full_Name' columnmodified_dfFromRDD2=dfFromRDD2.withColumn("First_Name",substring('Full_Name',1,4))
In [8]:
# visualizing the modified dataframe modified_dfFromRDD2.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

2) We can also get a substring with select and alias to achieve the same result as above

In [9]:
modified_dfFromRDD3=dfFromRDD2.select("Full_Name",'Salary',substring('Full_Name',1,4).alias('First_Name'))
In [10]:
# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD3.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

3) We can also use substring with selectExpr to get a substring of 'Full_Name' column. selectExpr takes SQL expression(s) in a string to execute. This way we can run SQL-like expressions without creating views.

In [11]:
modified_dfFromRDD4=dfFromRDD2.selectExpr("Full_Name",'Salary','substring(Full_Name, 1, 4) as First_Name')
In [12]:
# visualizing the modified dataframe after executing the above.# As you can see, it is exactly the same as the previous output.modified_dfFromRDD4.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

4) Here we are going to use substr function of the Column data type to obtain the substring from the 'Full_Name' column and create a new column called 'First_Name'

In [13]:
modified_dfFromRDD5=dfFromRDD2.withColumn("First_Name",col('Full_Name').substr(1,4))
In [14]:
# visualizing the modified dataframe yields the same output as seen for all previous examples.modified_dfFromRDD5.show()
+-----------------+------+----------+
|        Full_Name|Salary|First_Name|
+-----------------+------+----------+
|     John A Smith|  1000|      John|
|Alex Wesley Jones|120000|      Alex|
|   Jane Tom James|  5000|      Jane|
+-----------------+------+----------+

5) Let us consider now a example of substring when the indices are beyond the length of column. In that case, the substring() function only returns characters that fall in the bounds i.e (start, start+len). This can be seen in the example below

In [15]:
# In this example we are going to get the four characters of Full_Name column starting from position 14.#  As can be seen in the example, 4 or fewer characters are returned depending on the string lengthmodified_dfFromRDD6=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',14,4))
In [16]:
modified_dfFromRDD6.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|         |
|Alex Wesley Jones|120000|     ones|
|   Jane Tom James|  5000|        s|
+-----------------+------+---------+

The above method produces wrong last name. We can fix it by following approach.

6) Another example of substring when we want to get the characters relative to end of the string. In this example, we are going to extract the last name from the Full_Name column.

In [17]:
# In this example we are going to get the five characters of Full_Name column relative to the end of the string.#  As can be seen in the example, last 5 charcters are returnedmodified_dfFromRDD7=dfFromRDD2.withColumn("Last_Name",substring('Full_Name',-5,5))
In [18]:
modified_dfFromRDD7.show()
+-----------------+------+---------+
|        Full_Name|Salary|Last_Name|
+-----------------+------+---------+
|     John A Smith|  1000|    Smith|
|Alex Wesley Jones|120000|    Jones|
|   Jane Tom James|  5000|    James|
+-----------------+------+---------+

Note above approach works only if the last name in each row is of constant characters length. What if the last name is of different characters length, the solution is not that simple.
I will need the index at which the last name starts and also the length of 'Full_Name'. If you are curious, I have provided the solution below without the explanation.

In [19]:
frompyspark.sqlimportSparkSessionfrompyspark.sql.functionsimport(col,substring,lit,substring_index,length)

Let us create an example with last names having variable character length.

In [20]:
columns=["Full_Name","Salary"]data=[("John A Smith",1000),("Alex Wesley leeper",120000),("Jane Tom kinderman",5000)]rdd=spark.sparkContext.parallelize(data)dfFromRDD2=spark.createDataFrame(rdd).toDF(*columns)dfFromRDD2.show()
+------------------+------+
|         Full_Name|Salary|
+------------------+------+
|      John A Smith|  1000|
|Alex Wesley leeper|120000|
|Jane Tom kinderman|  5000|
+------------------+------+

Pyspark substr

In [21]:
dfFromRDD2.withColumn('Last_Name',col("Full_Name").substr((length('Full_Name')-length(substring_index('Full_Name',"",-1))),length('Full_Name'))).show()
+------------------+------+----------+
|         Full_Name|Salary| Last_Name|
+------------------+------+----------+
|      John A Smith|  1000|     Smith|
|Alex Wesley leeper|120000|    leeper|
|Jane Tom kinderman|  5000| kinderman|
+------------------+------+----------+

In [22]:
spark.stop()

Viewing all articles
Browse latest Browse all 22849

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>