Earlier this year, as part of our partnership with Cloudera, we announced a freely available Anaconda parcel for Cloudera CDH based on Python 2.7 and the Anaconda Distribution. The Anaconda parcel has been very well received by both Anaconda and Cloudera users by making it easier for data scientists and analysts to use libraries from Anaconda that they know and love with Hadoop and Spark along with Cloudera CDH.
Since then, we’ve had significant interest from Anaconda Enterprise users asking how they can create and use custom Anaconda parcels with Cloudera CDH. Our users want to deploy Anaconda with different versions of Python and custom conda packages that are not included in the freely available Anaconda parcel. Using parcels to manage multiple Anaconda installations across a Cloudera CDH cluster is convenient, because it works natively with Cloudera Manager without the need to install additional software or services on the cluster nodes.
We’re excited to announce a new self-service feature of the Anaconda platform that can be used to generate custom Anaconda parcels and installers. This functionality is now available in the Anaconda platform as part of the Anaconda Scale and Anaconda Repository platform components.
Deploying multiple custom versions of Anaconda on a Cloudera CDH cluster with Hadoop and Spark has never been easier! Let’s take a closer look at how we can create and install a custom Anaconda parcel using Anaconda Repository and Cloudera Manager.
Generating Custom Anaconda Parcels and Installers
For this example, we’ve installed Anaconda Repository (which is part of the Anaconda Enterprise subscription) and created an on-premises mirror of more than 600 conda packages that are available in the Anaconda distribution. We’ve also installed Cloudera CDH 5.8.2 with Spark on a cluster.
In Anaconda Repository, we can see a new feature for Installers, which can be used to generate custom Anaconda parcels for Cloudera CDH or standalone Anaconda installers.
The Installers page gives an overview of how to get started with custom Anaconda installers and parcels, and it describes how we can create custom Anaconda parcels that are served directly from Anaconda Repository from a Remote Parcel Repository URL.
After choosing Create new installer, we can then specify packages to include in our custom Anaconda parcel, which we’ve named anaconda_plus
.
First, we specify the latest version of Anaconda (4.2.0) and Python 2.7. We’ve added the anaconda
package to include all of the conda packages that are included by default in the Anaconda installer. Specifying the anaconda package is optional, but it’s a great way to supercharge your custom Anaconda parcel with more than 200 of the most popular Open Data Science packages, including NumPy, Pandas, SciPy, matplotlib, scikit-learn and more.
We also specified additional conda packages to include in the custom Anaconda parcel, including libraries for natural language processing, visualization, data I/O and other data analytics libraries: azure, bcolz, boto3, datashader, distributed, gensim, hdfs3, holoviews, impyla, seaborn, spacy, tensorflow
and xarray
.
We also could have included conda packages from other channels in our on-premise installation of Anaconda Repository, including community-built packages from conda-forge or other custom-built conda packages from different users within our organization.
After creating a custom Anaconda parcel, we see a list of parcel files that were generated for all of the Linux distributions supported by Cloudera Manager.
Additionally, Anaconda Repository has already updated the manifest file used by Cloudera Manager with the new parcel information at the existing Remote Parcel Repository URL. Now, we’re ready to install the newly created custom Anaconda parcel using Cloudera Manager.
Installing Custom Anaconda Parcels Using Cloudera Manager
Now that we’ve generated a custom Anaconda parcel, we can install it on our Cloudera CDH cluster and make it available to all of the cluster users for PySpark and SparkR jobs.
From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
Click the Configuration button on the top right of the Parcels page.
Click the plus symbol in the Remote Parcel Repository URLs section, and add the repository URL that was provided from Anaconda Repository.
Finally, we can download, distribute and activate the custom Anaconda parcel.
And we’re done! The custom-generated Anaconda parcel is now activated and ready to use with Spark or other distributed frameworks on our Cloudera CDH cluster.
Using the Custom Anaconda Parcel
Now that we’ve generated, installed and activated a custom Anaconda parcel, we can use libraries from our custom Anaconda parcel with PySpark.
You can use spark-submit
along with the PYSPARK_PYTHON
environment variable to run Spark jobs that use libraries from the Anaconda parcel, for example:
$ PYSPARK_PYTHON=/opt/cloudera/parcels/anaconda_plus/bin/python spark-submit pyspark_script.py
Or, to work with Spark interactively on the Cloudera CDH cluster, we can use Jupyter Notebooks via Anaconda Enterprise Notebooks, which is a multi-user notebook server with collaboration and support for enterprise authentication. You can configure Anaconda Enterprise Notebooks to use different Anaconda parcel installations on a per-job basis.
Get Started with Custom Anaconda Parcels in Your Enterprise
If you’re interested in generating custom Anaconda installers and parcels for Cloudera Manager, we can help! Get in touch with us by using our contact us page for more information about this functionality and our enterprise Anaconda platform subscriptions.
If you’d like to test-drive the on-premises, enterprise features of Anaconda on a bare-metal, on-premises or cloud-based cluster, get in touch with us at sales@continuum.io.
The enterprise features of the Anaconda platform, including the distributed functionality in Anaconda Scale and on-premises functionality of Anaconda Repository, are certified by Cloudera for use with Cloudera CDH 5.x.