Developer Blog
In this post, we present two projects for the R programming language that are powered by Anaconda. We will explore how rBokeh allows you to create beautiful interactive visualizations and how easy it is to scale your predictive models with SparkR through Anaconda’s cluster management capabilities.
Bokeh and rBokeh
Bokeh is an interactive visualization framework that targets modern web browsers for presentation. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards and data applications, without having to learn web technologies, such as JavaScript. Bokeh currently provides interfaces in Python, R, Lua and Scala. rBokeh is the R library that allows you write interactive visualizations in R.
Spark and SparkR
Spark is a popular open source processing framework for large scale in-memory distributed computations. SparkR is an R package that provides a frontend API to use Apache Spark from R.
Getting started with R in Anaconda
Conda, R Essentials and the R channel on Anaconda Cloud.
The easiest way to get started with R in Anaconda is installing R Essentials, a bundle of over 80 of the most used R packages for data science, including dplyr, shiny, ggplot2, tidyr, caret and nnet. R Essentials also includes Jupyter Notebooks and the IRKernel. To learn about conda and Jupyter, visit our previous blog post, "Jupyter and conda for R."
Conda is Anaconda's package, dependency and environment manager. Conda works across platforms (Windows, Linux, OS X) and across languages (R, Python, Scala...). R users can also use conda and benefit from its capabilities: create and install R packages and manage portable sandboxes of environments that might have different packages or versions of packages. An R channel is available on Anaconda Cloud with over 200+ R packages.
To install R Essentials run:
$ conda install -c r r-essentials
To learn more about the benefits of conda, how to create R conda packages, and manage projects with Python and R dependencies, visit our previous blogpost “Conda for data science."
rBokeh: Interactive Data Visualizations in R
rBokeh is included in R Essentials, but it can also be separately installed from the R channel:
$ conda install -c r r-rbokeh
Once you have rBokeh installed, you can start an R console by typing `r` in your terminal:
$(r-essentials):~/$ r
Import the rBokeh library and start creating your interactive visualizations. The following example draws a scatterplot of the iris dataset with different glyphs and marker colors depending on the Species class, and the hover tool indicates their values as you mouse over the data points:
> library(rbokeh) > p - figure() %>% ly_points(Sepal.Length, Sepal.Width, data = iris, color = Species, glyph = Species, hover = list(Sepal.Length, Sepal.Width)) > p
rBokeh plots include a toolbox with the following functionality: panning, box zooming, resizing, wheel zooming, resetting, saving and tooltip hovering.
rBokeh and Shiny
Besides the interactivity that is offered through the toolbox, rBokeh also integrates nicely with Shiny, allowing you to create visualizations that can be animated.
Here’s an example of a simple Shiny App using rBokeh that generates a new hexbin plot from randomly sampling two normal distributions (x and y).
library("shiny") library("rbokeh") library("htmlwidgets") ui - fluidPage( rbokehOutput("rbokeh") ) server - function(input, output, session) { output$rbokeh - renderRbokeh({ invalidateLater(1500, session) figure(plot_width = 400, plot_height = 800) %>% ly_hexbin(rnorm(10000), rnorm(10000)) }) } shinyApp(ui, server)
For more information and examples on rBokeh, visit the rBokeh documentation.
Using Anaconda for cluster management
The simplicity that conda brings to package and environment management can be extended to your cluster through Anaconda’s capabilities for cluster management. Anaconda for cluster management is freely available for unlicensed, unsupported use with up to 4 cloud-based or bare-metal cluster nodes.
You can install the cluster management library from the anaconda-cluster
channel on Anaconda Cloud. You must have an Anaconda Cloud account and be logged in via anaconda login
.
conda install anaconda-client anaconda login conda install anaconda-cluster -c anaconda-cluster
For detailed installation instructions, refer to the Installation section in the documentation.
Setting up your cloud cluster
In this example, we will create and provision a 4-node cloud-based cluster on Amazon EC2. After installing the Anaconda cluster management library, run the following command:
$ acluster
This will create the ~/.acluster
directory, which contains all of the configuration information. Edit the the ~/.acluster/providers.yaml
file and add your AWS credentials and key file information.
aws_east: cloud_provider: ec2 keyname: my-private-key location: us-east-1 private_key: ~/.ssh/my-private-key.pem secret_id: AKIAXXXXXX secret_key: XXXXXXXXXX
Next, create a profile in the ~/.acluster/profiles.d/
directory that defines the cluster and includes the spark-standalone and notebook plugins.
name: aws_sparkr node_id: ami-d05e75b8 node_type: m3.large num_nodes: 4 plugins: - spark-standalone - notebook provider: aws_east user: ubuntu
You can now launch and provision your cluster with a single command:
$ acluster create spark_cluster --profile aws_sparkr
Notebooks and R Essentials on your cluster
After the cluster is created and provisioned with conda, Spark, and a Jupyter Notebook server, we can install R Essentials on all of the cluster nodes with:
$ acluster conda install r-essentials -c r
You can open the Jupyter Notebook server that is running on the head node of your cluster with:
$ acluster open notebook
The default notebook password is acluster
, which can be customized in your profile. You can now open an R notebook that is running on the head node of your cloud-based cluster:
For example, here’s a simple notebook that runs on a single node with R Essentials:
The full example notebook can be viewed and downloaded from Anaconda Cloud.
Running SparkR on your cluster
You can open the Spark UI in your browser with the following command:
$ acluster open spark-standalone
Now, let’s create a notebook that uses SparkR to distribute a predictive model across your 4-node cluster. Start a new R notebook and execute the following lines:
Sys.setenv(SPARK_HOME='/opt/anaconda/share/spark') .libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths())) library(SparkR) sc - sparkR.init("spark://{{ URL }}:7077", sparkEnvir = list(spark.r.command='/opt/anaconda/bin/Rscript'))
Replace {{ URL }}
in the above command with the URL displayed in the Spark UI. In my case, the URL is spark://ip-172-31-56-255:7077
.
The following notebook uses three Spark workers across the cluster to fit a predictive model.
You can view the running Spark applications in the Spark UI:
and you can verify that the three Spark workers are running the application:
You can view and download the SparkR notebook from Anaconda Cloud.
Once you’ve finished, you can easily destroy the cluster with:
$ acluster destroy spark_cluster
Conclusion
We have presented two useful projects for doing Data Science in R with Anaconda: rBokeh and SparkR. Learn more about these projects in their respective documentation pages: rBokeh and SparkR. I also recommend downloading the Anaconda for cluster management cheat sheet to help you setup, manage, and provision your clusters.
Thanks to Ryan Hafen, rBokeh developer; and all of the Spark and SparkR developers.
For more information about scaling up Python and R in your enterprise, or Anaconda's cluster management features, contact sales@continuum.io.