Introduction
Several months ago, I participated in my first crowd-sourced Data Science competition in the Twin Cities run by Analyze This!. In my previous post, I described the benefits of working through the competition and how much I enjoyed the process. I just completed the second challenge and had another great experience that I wanted to share and (hopefully) encourage others to try these types of practical challenges to build their Data Science/Analytics skills.
In this second challenge, I felt much more comfortable with the actual process of cleaning the data, exploring it and building and testing models. I found that the python tools continue to serve me well. However, I also identified a lot of things that I need to do better in future challenges or projects in order to be more systematic about my process. I am curious if the broader community has tips or tricks they can share related to some of the items I will cover below. I will also highlight a few of the useful python tools I used throughout the process. This post does not include any code but is focused more on the process and python tools for Data Science.
Background
As mentioned in my previous post, Analyze This! is an organization dedicated to raising awareness of the power of Data Science and increasing visibility in the local business community of the capabilities that Data Science can bring to their organizations. In order to accomplish this mission, Analyze This! hosts friendly competitions and monthly educational sessions on various Data Science topics.
This specific competition focused on predicting 2015 Major League Baseball Fanduel points. A local company provided ~36,000 rows of data to be used in the analysis. The objective was to use the 116 measures to build a model to predict the actual points a hitter would get in a Fanduel fantasy game. Approximately 10 teams of 3-5 people each participated in the challenge and the top 4 presented at SportCon. I was very proud to be a member of the team that made the final 4 cut and presented at SportCon.
Observations
As I went into the challenge, I wanted to leverage the experience from the last challenge and focus on a few skills to build in this event. I specifically wanted to spend more time on the exploratory analysis in order to more thoughtfully construct my models. In addition, I wanted to actually build out and try the models on my own. My past experience was very ad-hoc. I wanted this process to be a little more methodical and logical.
Leverage Standards
About a year ago, I took an introductory Business Analytics class which used the book Data Science for Business (Amazon Referral) by Foster Provost and Tom Fawcett as one of the primary textbooks for the course. As I have spent more time working on simple Data Science projects, I have really come to appreciate the insights and perspectives from this book.
In the future, I would like to do a more in-depth review of this book but for the purposes of this article, I used it as a reference to inform the basic process I wanted to follow for the project. Not surprisingly, this book mentions that there is an established methodology for Data Mining/Analysis called the “Cross Industry Standard Process for Data Mining” aka CRISP-DM. Here is a simple graphic showing the various phases:
This process matched what my experience had been in the past in that it is very iterative as you explore the potential solutions. I plan to continue to use this as a model for approaching data analysis problems.
Business and Data Understanding
For this particular challenge, there were a lot of interesting aspects to the “business” and “data” understanding. From a personal perspective, I was familiar with baseball as a casual fan but did not have any in-depth experience with Fanduel so one of the first things I had to do was learn more about how scores were generated for a given game.
In addition to the basic understanding of the problem, it was a bit of a challenge to interpret some of the various measures; understand how they were calculated and figure out what they actually represented. It was clear as we went through the final presentations that some groups understood the intricacies of the data in much more detail than others. It was also interesting that in-depth understanding of each data element was not required to actually “win” the competition.
Finally, this phase of the process would typically involve more thought around what data elements to capture. The structure of this specific challenge made that a non-issue since all data was provided and we were not allowed to augment it with other data sources.
Data Preparation
For this particular problem, the data was relatively clean and easily read in via Excel or csv. However there were three components to the data cleaning that impacted the final model:
- Handling missing data
- Encoding categorical data
- Scaling data
As I worked through the problem, it was clear that managing these three factors required quite a bit of intuition and trial and error to figure out the best approach.
I am generally aware of the options for handling missing data but I did not have a good intution for when to apply the various approaches:
- When is it better to replace a missing value with a numerical substitute like mean, median or mode?
- When should a dummy value like NaN or -1 be used?
- When should the data just be dropped?
Categorical data proved to have somewhat similar challenges. There were approximately 16 categorical variables that could be encoded in several ways:
- Binary (Day/Night)
- Numerical range (H-M-L converted to 3-2-1)
- One hot encoding (each value in a column)
- Excluded from the model
Finally, the data included many measures with values < 1 as well as measures > 1000. Depending on the model, these scales could over-emphasize some results at the expense of others. Fortunately scikit-learn has options for mitigating but how do you know when to use which option? In my case, I stuck with using RobustScaler as my go-to function. This may or may not be the right approach.
The challenge with all these options is that I could not figure out a good systematic way to evaluate each of these data preparation steps and how they impacted the model. The entire process felt like a lot of trial and error.
Ultimately, I believe this is just part of the process but I am interested in understanding how to systematically approach these types of data preparation steps in a methodical manner.
Modeling and Evaluation
For modeling, I used the standard scikit learn tools augmented with TPOT and ultimately used XGboost as the model of choice.
In a similar vein to the challenges with data prep, I struggled to figure out how to choose which model worked best. The data set was not tremendously large but some of the modeling approaches could take several minutes to run. By the time I factored in all of the possible options of data prep + model selection + parameter tuning, it was very easy to get lost in the process.
Scikit-learn has capabilities to tune hyper-parameters which is helpful. Additionally, TPOT can be a great tool to try a bunch of different approaches too. However, these tools don’t always help with the further up-stream process related to data prep and feature engineering. I plan to investigate more options in this area in future challenges.
Tool Sets
In this particular challenge, most groups used either R or python for their solution. I found it interesting that python seemed to be the dominant tool and that most people used a the standard python Data Science stack. However, even though everyone used similar tools and processes, we did come up with different approaches to the solutions.
I used Jupyter Notebooks quite extensively for my analysis but realized that I need to re-think how to organize them. As I iterated through the various solutions, I started to spend more time struggling to find which notebook contained a certain piece of code I needed. Sorting and searching through the various notebooks is very limited since the notebook name is all that is displayed on the notebook index.
One of my biggest complaints with Jupyter notebooks is that they don’t lend themselves to standard version control like a standalone python script. Obviously, storing a notebook in git or mercurial is possible but it is not very friendly for diff viewing. I recently learned about the nbdime project which looks very interesting and I may check out next time.
Speaking of Notebooks, I found a lot of useful examples for python code in the Allstate Kaggle Competition. This specific competition had a data set that tended to have data analysis approaches that worked well for the Baseball data as well. I used a lot of code snippets and ideas from these kernels. I encourage people to check out all of the kernels that are available on Kaggle. They do a nice job of showing how to approach problems from multiple different perspectives.
Another project I will likely use going forward are the Cookiecutter templates for Data Science. The basic structure may be a little overkill for a small project but I like the idea of enforcing some consistency in the process. As I looked through this template and the basic thought process for its development, it makes a lot of sense and I look forward to trying it in the future.
Another tool that I used in the project was mlxtend which contains a set of tools that are useful for “day-to-day data science tasks.” I particularly liked the ease of creating a visual plot of a confusion matrix. There are several other useful functions in this package that work quite well with scikit-learn. It’s well worth investigating all the functionality.
Finally, this dataset did have a lot of missing data. I enjoyed using the missingno tool to get a quick visualization of where the missing data was and how prevalent the missing values were. This is a very powerful library for visualizing missing data in a pandas DataFrame.
Conclusion
I have found that the real life process of analyzing and working through a Data Science challenge is one of the best ways to build up my skills and experience. There are many resources on the web that explain how to use the tools like pandas, sci-kit learn, XGBoost, etc but using the tools is just one piece of the puzzle. The real value is knowing how to smartly apply these tools and intuitively understanding how different choices will impact the rest of the downstream processes. This knowledge can only be gained by doing something over and over. Data Science challenges that focus on real-world issues are tremendously useful opportunities to learn and build skills.
Thanks again to all the people that make Analyze This! possible. I feel very fortunate that this type of event is available in my home town and hopefully others can replicate it in their own geographies.