Developer Blog
Overview
Working with your favorite Python packages along with distributed PySpark jobs across a Hadoop cluster can be difficult due to tedious manual setup and configuration issues, which is a problem that becomes more painful as the number of nodes in your cluster increases.
Anaconda makes it easy to manage packages (including Python, R and Scala) and their dependencies on an existing Hadoop cluster with PySpark, including data processing, machine learning, image processing and natural language processing.
In a previous post, we’ve demonstrated how you can use libraries in Anaconda to query and visualize 1.7 billion comments on a Hadoop cluster.
In this post, we’ll use Anaconda to perform distributed natural language processing with PySpark using a subset of the same data set. We’ll configure different enterprise Hadoop distributions, including Cloudera CDH and Hortonworks HDP, to work interactively on your Hadoop cluster with PySpark, Anaconda and a Jupyter Notebook.
In the remainder of this post, we'll:
Install Anaconda and the Jupyter Notebook on an existing Hadoop cluster.
Load the text/language data into HDFS on the cluster.
Configure PySpark to work with Anaconda and the Jupyter Notebook with different enterprise Hadoop distributions.
Perform distributed natural language processing on the data with the NLTK library from Anaconda.
Work locally with a subset of the data using Pandas and Bokeh for data analysis and interactive visualization.
Provisioning Anaconda on a cluster
Because we’re installing Anaconda on an existing Hadoop cluster, we can follow the bare-metal cluster setup instructions in Anaconda for cluster management from a Windows, Mac, or Linux machine. We can install and configure conda
on each node of the existing Hadoop cluster with a single command:
$ acluster create cluster-hadoop --profile cluster-hadoop
After a few minutes, we’ll have a centrally managed installation of conda
across our Hadoop cluster in the default location of /opt/anaconda
.
Installing Anaconda packages on the cluster
Once we’ve provisioned conda
on the cluster, we can install the packages from Anaconda that we’ll need for this example to perform language processing, data analysis and visualization:
$ acluster conda install nltk pandas bokeh
We’ll need to download the NLTK data on each node of the cluster. For convenience, we can do this using the distributed shell functionality in Anaconda for cluster management:
$ acluster cmd 'sudo /opt/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'
Loading the data into HDFS
In this post, we'll use a subset of the data set that contains comments from the reddit website from January 2015 to August 2015, which is about 242 GB on disk. This data set was made available on July 2015 in a reddit post. The data set is in JSON format (one comment per line) and consists of the comment body, author, subreddit, timestamp of creation and other fields.
Note that we could convert the data into different formats or load it into various query engines; however, since the focus of this blog post is using libraries with Anaconda, we will be working with the raw JSON data in PySpark.
We’ll load the reddit comment data into HDFS from the head node. You can SSH into the head node by running the following command from the client machine:
$ acluster ssh
The remaining commands in this section will be executed on the head node. If it doesn’t already exist, we’ll need to create a user directory in HDFS and assign the appropriate permissions:
$ sudo -u hdfs hadoop fs -mkdir /user/ubuntu
$ sudo -u hdfs hadoop fs -chown ubuntu /user/ubuntu
We can then move the data by running the following command with valid AWS credentials, which will transfer the reddit comment data from the year 2015 (242 GB of JSON data) from a public Amazon S3 bucket into HDFS on the cluster:
$ hadoop distcp
s3n://AWS_KEY:AWS_SECRET@blaze-data/reddit/json/2015/*.json /user/ubuntu/
Replace AWS_KEY
and AWS_SECRET
in the above command with valid Amazon AWS credentials.
Configuring the spark-submit command with your Hadoop Cluster
To use Python from Anaconda along with PySpark, you can set the PYSPARK_PYTHON
environment variable on a per-job basis along with the spark-submit
command. If you’re using the Anaconda parcel for CDH, you can run a PySpark script (e.g., spark-job.py
) using the following command:
$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit spark-job.py
If you’re using Anaconda for cluster management with Cloudera CDH or Hortonworks HDP, you can run the PySpark script using the following command (note the different path to Python):
$ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit spark-job.py
Installing and Configuring the Notebook with your Hadoop Cluster
Using the spark-submit
command is a quick and easy way to verify that our PySpark script works in batch mode. However, it can be tedious to work with our analysis in a non-interactive manner as Java and Python logs scroll by.
Instead, we can use the Jupyter Notebook on our Hadoop cluster to work interactively with our data via Anaconda and PySpark.
Using Anaconda for cluster management, we can install Jupyter Notebook on the head node of the cluster with a single command, then open the notebook interface in our local web browser:
$ acluster install notebook
$ acluster open notebook
Once we’ve opened a new notebook, we’ll need to configure some environment variables for PySpark to work with Anaconda. The following sections include details on how to configure the environment variables for Anaconda to work with PySpark on Cloudera CDH and Hortonworks HDP.
Using the Anaconda Parcel with Cloudera CDH
If you’re using the Anaconda parcel with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and the Anaconda 4.0 parcel.
>>> import os
>>> import sys
>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"
>>> os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/cloudera/parcels/Anaconda"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Using Anaconda for cluster management with Cloudera CDH
If you’re using Anaconda for cluster management with Cloudera CDH, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Cloudera CDH 5.7 running Spark 1.6.0 and Anaconda for cluster management 1.4.0.
>>> import os
>>> import sys
>>> os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-7-oracle-cloudera/jre"
>>> os.environ["SPARK_HOME"] = "/opt/anaconda/parcels/CDH/lib/spark"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Using Anaconda for cluster management with Hortonworks HDP
If you’re using Anaconda for cluster management with Hortonworks HDP, you can configure the following settings at the beginning of your Jupyter notebook. These settings were tested with Hortonworks HDP running Spark 1.6.0 and Anaconda for cluster management 1.4.0.
>>> import os
>>> import sys
>>> os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
>>> os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
>>> os.environ["PYSPARK_PYTHON"] = "/opt/anaconda/bin/python"
>>> sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
>>> sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Initializing the SparkContext
After we’ve configured Anaconda to work with PySpark on our Hadoop cluster, we can initialize a SparkContext that we’ll use for distributed computations. In this example, we’ll be using the YARN resource manager in client mode:
>>> from pyspark import SparkConf
>>> from pyspark import SparkContext
>>> conf = SparkConf()
>>> conf.setMaster('yarn-client')
>>> conf.setAppName('anaconda-pyspark-language')
>>> sc = SparkContext(conf=conf)
Loading the data into memory
Now that we’ve created a SparkContext, we can load the JSON reddit comment data into a Resilient Distributed Dataset (RDD) from PySpark:
>>> lines = sc.textFile("/user/ubuntu/*.json")
Next, we decode the JSON data and decide that we want to filter comments from the movies subreddit:
>>> import json
>>> data = lines.map(json.loads)
>>> movies = data.filter(lambda x: x['subreddit'] == 'movies')
We can then persist the RDD in distributed memory across the cluster so that future computations and queries will be computed quickly from memory. Note that this operation only marks the RDD to be persisted; the data will be persisted in memory after the first computation is triggered:
>>> movies.persist()
We can count the total number of comments in the movies
subreddit (about 2.9 million comments):
>>> movies.count()
2905085
We can inspect the first comment in the dataset, which shows fields for the author, comment body, creation time, subreddit, etc.:
>>> movies.take(1)
CPU times: user 8 ms, sys: 0 ns, total: 8 msWall time: 113 ms
[{u'archived': False,
u'author': u'kylionsfan',
u'author_flair_css_class': None,
u'author_flair_text': None,
u'body': u'Goonies',
u'controversiality': 0,
u'created_utc': u'1420070402',
u'distinguished': None,
u'downs': 0,
u'edited': False,
u'gilded': 0,
u'id': u'cnas90u',
u'link_id': u't3_2qyjda',
u'name': u't1_cnas90u',
u'parent_id': u't3_2qyjda',
u'retrieved_on': 1425124282,
u'score': 1,
u'score_hidden': False,
u'subreddit': u'movies',
u'subreddit_id': u't5_2qh3s',
u'ups': 1}]
Distributed Natural Language Processing
Now that we’ve filtered a subset of the data and loaded it into memory across the cluster, we can perform distributed natural language computations using Anaconda with PySpark.
First, we define a parse()
function that imports the natural language toolkit (NLTK) from Anaconda and tags words in each comment with their corresponding part of speech. Then, we can map the parse()
function to the movies
RDD:
>>> def parse(record):
... import nltk
... tokens = nltk.word_tokenize(record["body"])
... record["n_words"] = len(tokens)
... record["pos"] = nltk.pos_tag(tokens)
... return record
>>> movies2 = movies.map(parse)
Let’s take a look at the body of one of the comments:
>>> movies2.take(10)[6]['body']
u'Dawn of the Apes was such an incredible movie, it should be up there in my opinion.'
And the same comment with tagged parts of speech (e.g., nouns, verbs, prepositions):
>>> movies2.take(10)[6]['pos']
[(u'Dawn', 'NN'),
(u'of', 'IN'),
(u'the', 'DT'),
(u'Apes', 'NNP'),
(u'was', 'VBD'),
(u'such', 'JJ'),
(u'an', 'DT'),
(u'incredible', 'JJ'),
(u'movie', 'NN'),
(u',', ','),
(u'it', 'PRP'),
(u'should', 'MD'),
(u'be', 'VB'),
(u'up', 'RP'),
(u'there', 'RB'),
(u'in', 'IN'),
(u'my', 'PRP$'),
(u'opinion', 'NN'),
(u'.', '.')]
We can define a get_NN()
function that extracts nouns from the records, filters stopwords, and removes non-words from the data set:
>>> def get_NN(record):
... import re
... from nltk.corpus import stopwords
... all_pos = record["pos"]
... ret = []
... for pos in all_pos:
... if pos[1] == "NN" \
... and pos[0] not in stopwords.words('english') \
... and re.search("^[0-9a-zA-Z]+$", pos[0]) is not None:
... ret.append(pos[0])
... return ret
>>> nouns = movies2.flatMap(get_NN)
We can then generate word counts for the nouns that we extracted from the dataset:
>>> counts = nouns.map(lambda word: (word, 1))
After we’ve done the heavy lifting, processing, filtering and cleaning on the text data using Anaconda and PySpark, we can collect the reduced word count results onto the head node.
>>> top_nouns = counts.countByKey()
>>> top_nouns = dict(top_nouns)
In the next section, we’ll continue our analysis on the head node of the cluster while working with familiar libraries in Anaconda, all in the same interactive Jupyter notebook.
Local analysis with Pandas and Bokeh
Now that we’ve done the heavy lifting using Anaconda and PySpark across the cluster, we can work with the results as a dataframe in Pandas, where we can query and inspect the data as usual:
>>> import pandas as pd
>>> df = pd.DataFrame(top_nouns.items(), columns=['Noun', 'Count'])
Let’s sort the resulting word counts, and view the top 10 nouns by frequency:
>>> df = df.sort_values('Count', ascending=False)
>>> df_top_10 = df.head(10)
>>> df_top_10
Noun | Count |
movie | 539698 |
film | 220366 |
time | 157595 |
way | 112752 |
gt | 105313 |
http | 92619 |
something | 87835 |
lot | 85573 |
scene | 82229 |
thing | 82101 |
Let’s generate a bar chart of the top 10 nouns using Pandas:
>>> %matplotlib inline
>>> df_top_10.plot(kind='bar', x=df_top_10['Noun'])
Finally, we can use Bokeh to generate an interactive plot of the data:
>>> from bokeh.charts import Bar, show
>>> from bokeh.io import output_notebook
>>> from bokeh.charts.attributes import cat
>>> output_notebook()
>>> p = Bar(df_top_10,
... label=cat(columns='Noun', sort=False),
... values='Count',
... title='Top N nouns in r/movies subreddit')
>>> show(p)
Conclusion
In this post, we used Anaconda with PySpark to perform distributed natural language processing and computations on data stored in HDFS. We configured Anaconda and the Jupyter Notebook to work with PySpark on various enterprise Hadoop distributions (including Cloudera CDH and Hortonworks HDP), which allowed us to work interactively with Anaconda and the Hadoop cluster. This made it convenient to work with Anaconda for the distributed processing with PySpark, while reducing the data to a size that we could work with on a single machine, all in the same interactive notebook environment. The complete notebook for this example with Anaconda, PySpark, and NLTK can be viewed on Anaconda Cloud.
You can get started with Anaconda for cluster management for free on up to 4 cloud-based or bare-metal cluster nodes by logging in with your Anaconda Cloud account:
$ conda install anaconda-client
$ anaconda login
$ conda install anaconda-cluster -c anaconda-cluster
If you’d like to test-drive the on-premises, enterprise features of Anaconda with additional nodes on a bare-metal, on-premises, or cloud-based cluster, get in touch with us at sales@continuum.io. The enterprise features of Anaconda, including the cluster management functionality and on-premises repository, are certified for use with Cloudera CDH 5.
If you’re running into memory errors, performance issues (related to JVM overhead or Python/Java serialization), problems translating your existing Python code to PySpark, or other limitations with PySpark, stay tuned for a future post about a parallel processing framework in pure Python that works with libraries in Anaconda and your existing Hadoop cluster, including HDFS and YARN.