Introduction
A common business analytics task is trying to forecast the future based on known historical data. Forecasting is a complicated topic and relies on an analyst knowing the ins and outs of the domain as well as knowledge of relatively complex mathematical theories. Because the mathematical concepts can be complex, a lot of business forecasting approaches are “solved” with a little linear regression and “intuition.” More complex models would yield better results but are too difficult to implement.
Given that background, I was very interested to see that Facebook recently open sourced a python and R library called prophet which seeks to automate the forecasting process in a more sophisticated but easily tune-able model. In this article, I’ll introduce prophet and show how to use it to predict the volume of traffic in the next year for Practical Business Python. To make this a little more interesting, I will post the prediction through the end of March so we can take a look at how accurate the forecast is.
Overview of Prophet
For those interested in learning more about prophet, I recommend reading Facebook’s white paper on the topic. The paper is relatively light on math and heavy on the background of forecasting and some of the business challenges associated with building and using forecasting models at scale.
The paper’s introduction contains a good overview of the challenges with current forecasting approaches:
Producing high quality forecasts is not an easy problem for either machines or for most analysts. We have observed two main themes in the practice of creating business forecasts:
1. Completely automatic forecasting techniques can be brittle and they are often too inflexible to incorporate useful assumptions or heuristics.
2. Analysts who can produce high quality forecasts are quite rare because forecasting is a specialized data science skill requiring substantial experience. The result of these themes is that the demand for high quality forecasts often far outstrips the pace at which the organization can produce them.
Prophet seeks to provide a simple to use model that is sophisticated enough to provide useful results - even when run by someone without deep knowledge of the mathematical theories of forecasting. However, the modeling solution does provide several tuneable parameters so that analysts can easily make changes to the model based on their unique business needs.
Installation
Before going any further, make sure to install prophet. The complex statistical modeling is handled by the Stan library and is a prerequisite for prophet. As long as you are using anaconda, the installation process is pretty simple:
conda install pystan
pip install fbprophet
Starting the Analysis
For this analysis, I will be using a spreadsheet of the actual web traffic volume from pbpython starting in Sept 2014 and going through early March 2017. The data is downloaded from Google analytics and looks like this:
importpandasaspdimportnumpyasnpfromfbprophetimportProphetdata_file="All Web Site Data Audience Overview.xlsx"df=pd.read_excel(data_file)df.head()
Day Index | Sessions | |
---|---|---|
0 | 2014-09-25 | 1 |
1 | 2014-09-26 | 4 |
2 | 2014-09-27 | 8 |
3 | 2014-09-28 | 42 |
4 | 2014-09-29 | 233 |
The first thing we need to check is to make sure the Day Index column came through as a datetime type:
df.dtypes
Day Index datetime64[ns]
Sessions int64
dtype: object
Since that looks good, let’s see what kind of insight we can get with just simple pandas plots:
df.set_index('Day Index').plot();
The basic plot is interesting but, like most time series data, it is difficult to get much out of this without doing further analysis. Additionally, if you wanted to add a predicted trend-line, it is a non-trivial task with stock pandas.
Before going further, I do want to address the outlier in the July 2015 timeframe. My
most popular article is Pandas Pivot Table Explained which saw the biggest traffic spike
on this blog. Since that article represents an outlier in volume, I am going to
change those values to
nan
so that it does not unduly influence the projection.
This change is not strictly required but it will be useful to show that prophet can handle this missing data without further manipulation. This process also highlights the need for the analyst to still be involved in the process of making the forecast.
df.loc[(df['Sessions']>5000),'Sessions']=np.nandf.set_index('Day Index').plot();
This is pretty good but I am going to do one other data transformation
before continuing. I will convert the
Sessions
column to be a log
value. This article has more information on why a log transform is useful for
these types of data sets. From the article:
… logging converts multiplicative relationships to additive relationships, and by the same token it converts exponential (compound growth) trends to linear trends. By taking logarithms of variables which are multiplicatively related and/or growing exponentially over time, we can often explain their behavior with linear models.
df['Sessions']=np.log(df['Sessions'])df.set_index('Day Index').plot();
The data set is almost ready to make a prediction. The final step is to rename
the columns to
ds
and
y
in order to comply with the prophet API.
df.columns=["ds","y"]df.head()
ds | y | |
---|---|---|
0 | 2014-09-25 | 0.000000 |
1 | 2014-09-26 | 1.386294 |
2 | 2014-09-27 | 2.079442 |
3 | 2014-09-28 | 3.737670 |
4 | 2014-09-29 | 5.451038 |
Now that the data is cleaned and labeled correctly, let’s see what prophet can do with it.
Making a Prediction
The prophet API is similar to scikit-learn. The general flow is to
fit
the
data then
predict
the future time series. In addition, prophet supports
some nice plotting features using
plot
and
plot_components
.
Create the first model (m1) and fit the data to our dataframe:
m1=Prophet()m1.fit(df)
In order to tell prophet how far to predict in the future, use
make_future_dataframe.
In this example, we will predict out 1 year (365 days).
future1=m1.make_future_dataframe(periods=365)
Then make the forecast:
forecast1=m1.predict(future1)
The
forecast1
is just a pandas dataframe with a several columns of data.
The predicted value is called
yhat
and the range is defined by
yhat_lower
and
yhat_upper
. To see the last 5 predicted values:
forecast1[['ds','yhat','yhat_lower','yhat_upper']].tail()
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
1250 | 2018-02-27 | 7.848040 | 6.625887 | 9.081303 |
1251 | 2018-02-28 | 7.787314 | 6.565903 | 9.008327 |
1252 | 2018-03-01 | 7.755146 | 6.517481 | 8.948139 |
1253 | 2018-03-02 | 7.552382 | 6.309191 | 8.785648 |
1254 | 2018-03-03 | 7.011651 | 5.795778 | 8.259777 |
To convert back to the numerical values representing sessions, use
np.exp
np.exp(forecast1[['yhat','yhat_lower','yhat_upper']].tail())
yhat | yhat_lower | yhat_upper | |
---|---|---|---|
1250 | 2560.709477 | 754.373407 | 8789.412841 |
1251 | 2409.836175 | 710.452848 | 8170.840734 |
1252 | 2333.549138 | 676.871358 | 7693.563414 |
1253 | 1905.275686 | 549.600404 | 6539.712030 |
1254 | 1109.484324 | 328.907843 | 3865.233952 |
To make this look nice and impress management, plot the data:
m1.plot(forecast1);
Very cool. The other useful feature is the ability to plot the various components:
m1.plot_components(forecast1);
I really like this view because it is a very simple way to pull out the daily and weekly trends. For instance, the charts make it easy to see that Monday-Thursday are peak times with big fall offs on the weekend. Additionally, I appear to have bigger jumps in traffic towards the end of the year.
Refining the Model
I hope you’ll agree that the basic process to create a model is relatively straightforward and you can see that the results include more rigor than a simple linear trend line. Where prophet really shines is the ability to iterate the models with different assumptions and inputs.
One of the features that prophet supports is the concept of a “holiday.” The simplest way to think about this idea is the typical up-tick in store sales seen around the Thanksgiving and Christmas holidays. If we have certain known events that have major impacts on our time series, we can define them and the model will use these data points to try to make better future predictions.
For this blog, any time a new article is published, there is an uptick in traffic for
about 1 week, then there is a slow decay back to steady state. Therefore for this
analysis, we can define a holiday as a blog post. Since I know that the post
drives increased traffic for about 5-7 days, I can define an
upper_window
to encapsulate those 5 days in that holiday window. There is also a corresponding
lower_window
for days leading up to the holiday. For this analysis, I will
only look at the upper_window.
To capture the holidays, define a holiday dataframe with a datestamp and the description of the holiday:
articles=pd.DataFrame({'holiday':'publish','ds':pd.to_datetime(['2014-09-27','2014-10-05','2014-10-14','2014-10-26','2014-11-9','2014-11-18','2014-11-30','2014-12-17','2014-12-29','2015-01-06','2015-01-20','2015-02-02','2015-02-16','2015-03-23','2015-04-08','2015-05-04','2015-05-17','2015-06-09','2015-07-02','2015-07-13','2015-08-17','2015-09-14','2015-10-26','2015-12-07','2015-12-30','2016-01-26','2016-04-06','2016-05-16','2016-06-15','2016-08-23','2016-08-29','2016-09-06','2016-11-21','2016-12-19','2017-01-17','2017-02-06','2017-02-21','2017-03-06']),'lower_window':0,'upper_window':5,})articles.head()
ds | holiday | lower_window | upper_window | |
---|---|---|---|---|
0 | 2014-09-27 | publish | 0 | 5 |
1 | 2014-10-05 | publish | 0 | 5 |
2 | 2014-10-14 | publish | 0 | 5 |
3 | 2014-10-26 | publish | 0 | 5 |
4 | 2014-11-09 | publish | 0 | 5 |
Astute readers may have noticed that you can include dates in the future. In this instance, I am including today’s blog post in the holiday dataframe.
To use the publish dates in the model, pass it to the model via the
holidays
keyword. Perform the normal
fit
,
make_future
(this time we’ll try 90 days),
predict
and
plot
:
m2=Prophet(holidays=articles).fit(df)future2=m2.make_future_dataframe(periods=90)forecast2=m2.predict(future2)m2.plot(forecast2);
Because we have defined holidays, we get a little more information when we plot components:
m2.plot_components(forecast2);
Predictions
Prophet offers a couple of other options for continuing to tweak the model. I
encourage you to play around with them to get a feel for how they work and what
and can be used for your models. I have included one new option
mcmc_samples
in the final example below.
As promised, here is my forecast for website traffic between today and the end of March:
m3=Prophet(holidays=articles,mcmc_samples=500).fit(df)future3=m3.make_future_dataframe(periods=90)forecast3=m3.predict(future3)forecast3["Sessions"]=np.exp(forecast3.yhat).round()forecast3["Sessions_lower"]=np.exp(forecast3.yhat_lower).round()forecast3["Sessions_upper"]=np.exp(forecast3.yhat_upper).round()forecast3[(forecast3.ds>"3-5-2017")&(forecast3.ds<"4-1-2017")][["ds","yhat","Sessions_lower","Sessions","Sessions_upper"]]
ds | yhat | Sessions_lower | Sessions | Sessions_upper | |
---|---|---|---|---|---|
892 | 2017-03-06 | 7.845280 | 1432.0 | 2554.0 | 4449.0 |
893 | 2017-03-07 | 8.087120 | 1795.0 | 3252.0 | 5714.0 |
894 | 2017-03-08 | 7.578796 | 1142.0 | 1956.0 | 3402.0 |
895 | 2017-03-09 | 7.556725 | 1079.0 | 1914.0 | 3367.0 |
896 | 2017-03-10 | 7.415903 | 917.0 | 1662.0 | 2843.0 |
897 | 2017-03-11 | 6.796987 | 483.0 | 895.0 | 1587.0 |
898 | 2017-03-12 | 6.627355 | 417.0 | 755.0 | 1267.0 |
899 | 2017-03-13 | 7.240586 | 811.0 | 1395.0 | 2341.0 |
The model passes the intuitive test in that there is a big spike anticipated with the publishing of this article. The upper and lower bounds represent a fairly large range but for the purposes of this forecast, that is likely acceptable.
To keep me honest, you can see all of the values in the github notebook.
Final Thoughts
It is always interesting to get insights into the ways big companies use various open source tools in the their business. I am impressed with the functionality that Facebook has given us with prophet. The API is relatively simple and since it uses the standard panda’s dataframe and matplotlib for displaying the data, it fits very easily into the python datascience workflow. There is a lot if recent github activity for this library so I suspect it to get more useful and powerful over the months ahead.
As Yogi Berra said, “It’s tough to make predictions, especially about the future.” I think this library is going to be very useful for people trying to improve their forecasting approaches. I will be interested to see how well this particular forecast works on this site’s data. Stay tuned for an update where I will compare the prediction against the actuals and we will see what insight can be gained.