I had the great honor and pleasure of presenting the first tutorial at AnacondaCon 2018, on machine learning with scikit-learn. I spoke to a full room of about 120 enthusiastic data scientists and aspiring data scientists. I would like to thank my colleagues at Anaconda, Inc. who did such a wonderful job of organizing this conference; but even more to thank all the attendees of my session and the numerous other tutorials and talks. AnacondaCon is a warm and intellectually stimulating place. Throughout the presentations, question periods, and in the “hallway track,” attendees could make personal and professional connections and learn new and useful things about data science.

The attendees of my session were a very nice group of learners and experts. But I decided I wanted to know even more about these people than I could find by looking at their faces and responding to their questions. So I asked them to complete a slightly whimsical form at about 3 hours into my tutorial. Just who are these people, and what can scikit-learn tell us about which of them benefitted most from the tutorial?

In the interest of open data science, the collection of answers given by attendees is available under a CC-BY-SA 4.0 license. Please credit it as “© Anaconda Inc. 2018” if you use the anonymized data, available as a CSV file. If you wish to code along with the rest of this post, save it locally as data/Learning about Humans learning ML.csv (or adjust your code as needed).

It turns out that data never arrives at the workstation of a data scientist quite clean, no matter how much validation is attempted in the collection process. The respondent data is no exception. Using the familiar facilities in Pandas, we can improve the initial data before applying scikit-learn to it. In particular, I failed to validate the field “Years of post-secondary education (e.g. BA=4; Ph.D.=10)” as a required integer. Also, the “Timestamps” added by the form interface are gratuitous for these purposes—they are all within a couple minutes of each other, but the order or spacing is unlikely to have any value to our models.

In [1]:

import pandas as pd
fname = "data/Learning about Humans learning ML.csv"
humans = pd.read_csv(fname)

humans.drop('Timestamp', axis=1, inplace=True)
humans['Education'] = (humans[
    'Years of post-secondary education (e.g. BA=4; Ph.D.=10)']
                       .str.replace(r'.*=','')
                       .astype(int))
humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', 
            axis=1, inplace=True)

After slight data massaging we can get a picture of the attendee group (most likely a few failed to complete the form, but this should be most of them).

In [2]:

humans.describe(include=['object', 'int', 'float'])

Out[2]:

Favorite programming language Favorite Monty Python movie Years of Python experience Have used Scikit-learn Age In the Terminator franchise, did you root for the humans or the machines? Which is the better game? How successful has this tutorial been so far? Education
count 116 116 116.000000 116 116.000000 116 116 116.000000 116.000000
unique 7 6 NaN 2 NaN 2 4 NaN NaN
top Python Monty Python and the Holy Grail NaN Yep! NaN Team Humans! Chess NaN NaN
freq 94 57 NaN 80 NaN 88 69 NaN NaN
mean NaN NaN 4.195690 NaN 36.586207 NaN NaN 7.051724 6.172414
std NaN NaN 5.136187 NaN 13.260644 NaN NaN 2.229622 3.467303
min NaN NaN 0.000000 NaN 3.000000 NaN NaN 1.000000 -10.000000
25% NaN NaN 1.000000 NaN 28.000000 NaN NaN 5.000000 4.000000
50% NaN NaN 3.000000 NaN 34.000000 NaN NaN 8.000000 6.000000
75% NaN NaN 5.000000 NaN 43.250000 NaN NaN 9.000000 8.000000
max NaN NaN 27.000000 NaN 99.000000 NaN NaN 10.000000 23.000000

You can look into other features of the data yourself, but in the summary view a few data quality issues jump out. This is—again—almost universal to real world datasets. It seems dubious that two 3-year-olds were in attendance. Perhaps a couple 30-somethings mistyped entering their ages. A 99-year-old is possible, but seems more likely to be a placeholder value used by some respondent. While the description of what is meant by the integer “Education” was probably underspecified, it still feels like the -10 years of education is more likely to be a data entry problem than an intended indicator.

But the data we have is the data we must analyze.

Before we go further, it is usually a good idea to use one-hot encoding of categorical data for machine learning purposes. Most likely this makes less difference for the decision tree and random forest classifiers used in this blog post than it might for other classifiers and regressors, but it rarely hurts. For this post, the encoding is performed with pandas.get_dummies(), but you could equally use sklearn.preprocessing.LabelBinarizer to accomplish the same goal.

In [3]:

human_dummies = pd.get_dummies(humans)
list(human_dummies.columns)

Out[3]:

['Years of Python experience',
 'Age',
 'How successful has this tutorial been so far?',
 'Education',
 'Favorite programming language_C++',
 'Favorite programming language_JavaScript',
 'Favorite programming language_MATLAB',
 'Favorite programming language_Python',
 'Favorite programming language_R',
 'Favorite programming language_Scala',
 'Favorite programming language_Whitespace',
 'Favorite Monty Python movie_And Now for Something Completely Different',
 'Favorite Monty Python movie_Monty Python Live at the Hollywood Bowl',
 'Favorite Monty Python movie_Monty Python and the Holy Grail',
 "Favorite Monty Python movie_Monty Python's Life of Brian",
 "Favorite Monty Python movie_Monty Python's The Meaning of Life",
 'Favorite Monty Python movie_Time Bandits',
 'Have used Scikit-learn_Nope.',
 'Have used Scikit-learn_Yep!',
 'In the Terminator franchise, did you root for the humans 
		or the machines?_Skynet is a WINNER!',
 'In the Terminator franchise, did you root for the humans 
		or the machines?_Team Humans!',
 'Which is the better game?_Chess',
 'Which is the better game?_Go',
 'Which is the better game?_Longing for the sweet release of death',
 'Which is the better game?_Tic-tac-toe (Br. Eng. "noughts and crosses")']

It is time to use scikit-learn to model the respondents. In particular, we would like to know whether other features of attendees are a good predictor of how successful they found the tutorial. A very common pattern you will see in machine learning based on starting DataFrames is to drop one column for the X features, and keep that one for the y target.

In my analysis, I felt a binary measure of success was more relevant than a scalar measure initially collected as a 1-10 scale. Moreover, if the target is simplified this way, it becomes appropriate to use a classification algorithm as opposed to a regression algorithm. It would be a mistake to treat the 1-10 scale as a categorical consisting of 10 independent labels—there is something inherently ordinal about these labels, although scikit-learn will happily calculate models as if there is not. This is a place where subject matter judgment is needed by a data scientist.

You will have noticed by the summary data giving mean and median of success scores that >=8 will approximately evenly divide the data into “Yes” and “No” categories.

In [4]:

X = human_dummies.drop("How successful has this tutorial been so far?", axis=1)
y = human_dummies["How successful has this tutorial been so far?"] >= 8
y.head()

Out[4]:

0     True
1     True
2     True
3    False
4     True
Name: How successful has this tutorial been so far?, dtype: bool

In [5]:

X.iloc[:5,:5]

Out[5]:

Years of Python experience Age Education Favorite programming language_C++ Favorite programming language_JavaScript
0 20.0 53 12 0 0
1 4.0 33 5 0 0
2 1.0 31 10 0 0
3 12.0 60 10 0 0
4 7.0 48 6 0 0

While using sklearn.model_selection.StratifiedKFold is a more rigorous way of evaluating a model, for quick-and-dirty experimentation, using train_test_split is usually the easiest approach.

In [6]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

An interesting thing happened in trying a few models out. While RandomForestClassifier is incredibly powerful, and very often produces the most accurate predictions, for this particular data, a single DecisionTreeClassifer does better. Readers might want to think about why this turns out to be true and/or experiment with hyperparameters to find a more definite explanation; other classifiers might perform better still also, of course.

I will note that choosing the best max_depth for decision tree family algorithms is largely a matter of trial and error. You can search the space in a nice high level API using sklearn.model_selection.GridSearchCV, but it often suffices to use a basic Python loop like:

for n in range(1,20):
    tree = DecisionTreeClassifier(max_depth=n)
    tree.fit(X_train, y_train)
    print(n, tree.score(X_test, y_test))

In [7]:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=5, random_state=0)

rf.fit(X_train, y_train)
rf.score(X_test, y_test)

Out[7]:

0.4482758620689655

In [8]:

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=7, random_state=0)
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

Out[8]:

0.5862068965517241

Best practice in machine learning is to keep training and testing sets separate. In the end, with sufficiently large datasets, it makes little difference in the trained model parameters whether and how train/test observations are separated. But this is a small dataset, and also reflects a somewhat unique event (many students will learn about machine learning through many channels, but this particular tutorial, with a particular instructor, at a particular conference, will not necessarily generalize to all those channels).

Therefore, in order to see simply what is the “best possible” decision tree for this dataset, I deliberately overfit by including all the observations in the model.

In [9]:

tree = DecisionTreeClassifier(max_depth=7, random_state=0)
tree.fit(X, y)
tree.score(X, y)

Out[9]:

0.8103448275862069

We can easily look at what features are most important in this trained model, and also use a lovely utility method in sklearn.tree to display the entire tree and its decision cuts.

In [10]:

%matplotlib inline
pd.Series(tree.feature_importances_, index=X.columns).plot.barh(figsize=(18,7));

In [11]:

from sklearn.tree import export_graphviz
import sys, subprocess
from IPython.display import Image

export_graphviz(tree, feature_names=X.columns, class_names=['failure','success'],
                out_file='tmp/ml-good.dot', impurity=False, filled=True)
subprocess.check_call([sys.prefix+'/bin/dot','-Tpng','tmp/ml-good.dot',
                       '-o','tmp/ml-good.png'])
Image('tmp/ml-good.png')

Out[11]:

In the diagram, blue branches reflect those respondents who found the tutorial more successful, and orange branches those who found it less so. The saturation of the displayed boxes reflects the strength of that decision branch.

As seems obvious in retrospect, the fans of And Now for Something Completely Different really did not like my tutorial very much. I probably should have provided a disclaimer at the beginning of the session. Years of Python experience is a slightly more important feature, but it follows an oddly stratified pattern wherein several different ranges of years show positive or negative effects—it’s not linear.

And of course, Time Bandits was not a Monty Python film at all: it is a Terry Gilliam film that happened to cast a number of Monty Python cast members. What on earth were those respondents thinking?!

In [12]:

first_film = human_dummies[
    'Favorite Monty Python movie_And Now for Something Completely Different']
human_dummies[first_film==1].loc[:,"How successful has this tutorial been so far?"]

Out[12]:

11     7
14     4
46     7
50     7
67     7
71     4
72     5
74     5
77     7
86     6
97     5
108    6
110    7
114    1
Name: How successful has this tutorial been so far?, dtype: int64

Credit for this blog title goes to Dr. Timmy Churches, University of Woolloomooloo.