Developer Blog
In this blog series we will showcase Orange, an open source data visualization and data analysis tool, through two simple predictive models and a Monte Carlo Simulation.
Introduction to Orange
Orange is a comprehensive, component-based framework for machine learning and data mining. It is intended for both experienced users and researchers in machine learning, who want to prototype new algorithms while reusing as much of the code as possible, and for those just entering the field who can either write short Python scripts for data analysis or enjoy the powerful, easy-to-use visual programming environment. Orange includes a range of techniques, such as data management and preprocessing, supervised and unsupervised learning, performance analysis and a range of data and model visualization techniques.
Orange has a visual programming front-end for explorative data analysis and visualization called Orange Canvas. Orange Canvas is a visual, component-based programming approach that allows us to quickly explore and analyze data sets. Orange’s GUI is composed of widgets that communicate through channels; a set of connected widgets is called a schema. The creation of schemas is quick and flexible, because widgets are added on through a drag-and-drop method.
Orange can also be used as a Python library. Using the Orange library, it is easy to prototype state-of-the-art machine learning algorithms.
Building a Simple Predictive Model in Orange
We start with two simple predictive models in the Orange canvas and their corresponding Jupyter notebooks.
First let’s take a look at our Simple Predictive Model- Part 1 notebook. Now, let’s recreate the model in the Orange Canvas. Here is the schema for predicting the results of the Iris data set via a classification tree in Orange:
Notice the toolbar on the left of the canvas- this is where the 100+ widgets can be found and dragged onto the canvas. Now, let’s take a look at how this simple schema works. The schema reads from left to right, with information flowing from widget to widget through the pipelines. After the Iris data set is loaded in, it can be viewed through a variety of widgets. Here, we chose to see the data in a simple data table and a scatter plot. When we click on those two widgets, we see the following:
With just three widgets, we already get a sense of the data we are working with. The scatter plot has an option to “Rank Projections,” determining the best way to view our data. In this case, having the scatter plot as “Petal Width vs Petal Length” allows us to immediately see a potential pattern in the width of a flower’s petal and the type of iris the flower is. Beyond scatter plots, there are a variety of different widgets to help us visualize our data in Orange.
Now, let’s look at how we built our predictive model. We simply connected the data to a Classification Tree widget and can view the tree through a Classification Tree Viewer widget.
We can see exactly how our predictive model works. Now, we connect our model and our data to the “Test and Score” and “Predictions” widgets. The Test and Score widget is one way of seeing how well our Classification Tree performs:
The Predictions widget predicts the type of iris flower given the input data. Instead of looking at a long list of these predictions, we can use a confusion matrix to see our predictions and their accuracy.
Thus, we see our model misclassified 3/150 data instances.
We have seen how quickly we can build and visualize a working predictive model in the Orange canvas. Now, let’s take a look at how the exact same model can once again be built via scripting in Orange, a Python 3 data mining library.
Building a Predictive Model with a Hold Out Test Set in Orange
In our second example of a predictive model, we make the model slightly more complicated by holding out a test set. By doing so, we can use separate datasets to train and test our model, thus helping to avoid overfitting. Here is the original notebook.
Now, let’s build the same predictive model in the Orange Canvas. The Orange Canvas will allow us to better visualize what we are building.
Orange Schema:
As you can tell, the difference between Part 1 and Part 2 is the Data Sampler widget. This widget randomly separates 30% of the data into the testing data set. Thus, we can build the same model, but more accurately test it using data the model has never seen before.
This example shows how easy it is to modify existing schemas. We simply introduced one new widget to vastly improve our model.
Now let’s look at the same model built via the Orange Python 3 library.
Summary
In this blogpost, we have introduced Orange, an open source data visualization and data analysis tool, and presented two simple predictive models. In our next blogpost, we will instruct how to build a Monte Carlo Simulation done with Orange.