Jun 17, 2018

Machines Learning about Humans Learning about Machines Learning

Anaconda Team

7min

I had the great honor and pleasure of presenting the first tutorial at AnacondaCon 2018, on machine learning with scikit-learn. I spoke to a full room of about 120 enthusiastic data scientists and aspiring data scientists. I would like to thank my colleagues at Anaconda, Inc. who did such a wonderful job of organizing this conference; but even more to thank all the attendees of my session and the numerous other tutorials and talks. AnacondaCon is a warm and intellectually stimulating place. Throughout the presentations, question periods, and in the “hallway track,” attendees could make personal and professional connections and learn new and useful things about data science.

The attendees of my session were a very nice group of learners and experts. But I decided I wanted to know even more about these people than I could find by looking at their faces and responding to their questions. So I asked them to complete a slightly whimsical form at about 3 hours into my tutorial. Just who are these people, and what can scikit-learn tell us about which of them benefitted most from the tutorial?

In the interest of open data science, the collection of answers given by attendees is available under a CC-BY-SA 4.0 license. Please credit it as “© Anaconda Inc. 2018” if you use the anonymized data, available as a CSV file. If you wish to code along with the rest of this post, save it locally as data/Learning about Humans learning ML.csv (or adjust your code as needed).

It turns out that data never arrives at the workstation of a data scientist quite clean, no matter how much validation is attempted in the collection process. The respondent data is no exception. Using the familiar facilities in Pandas, we can improve the initial data before applying scikit-learn to it. In particular, I failed to validate the field “Years of post-secondary education (e.g. BA=4; Ph.D.=10)” as a required integer. Also, the “Timestamps” added by the form interface are gratuitous for these purposes—they are all within a couple minutes of each other, but the order or spacing is unlikely to have any value to our models.

In [1]: <pre class=”language-python”><code class=”language-python”>import pandas as pd fname = “data/Learning about Humans learning ML.csv” humans = pd.read_csv(fname) humans.drop(‘Timestamp’, axis=1, inplace=True) humans[‘Education’] = (humans[ ‘Years of post-secondary education (e.g. BA=4; Ph.D.=10)’] .str.replace(r’.*=’,”) .astype(int)) humans.drop(‘Years of post-secondary education (e.g. BA=4; Ph.D.=10)’, axis=1, inplace=True)</code></pre>

After slight data massaging we can get a picture of the attendee group (most likely a few failed to complete the form, but this should be most of them).

In [2]:

humans.describe(include=['object', 'int', 'float'])

Out[2]:

	Favorite programming language	Favorite Monty Python movie	Years of Python experience	Have used Scikit-learn	Age	In the Terminator franchise, did you root for the humans or the machines?	Which is the better game?	How successful has this tutorial been so far?	Education
count	116	116	116.000000	116	116.000000	116	116	116.000000	116.000000
unique	7	6	NaN	2	NaN	2	4	NaN	NaN
top	Python	Monty Python and the Holy Grail	NaN	Yep!	NaN	Team Humans!	Chess	NaN	NaN
freq	94	57	NaN	80	NaN	88	69	NaN	NaN
mean	NaN	NaN	4.195690	NaN	36.586207	NaN	NaN	7.051724	6.172414
std	NaN	NaN	5.136187	NaN	13.260644	NaN	NaN	2.229622	3.467303
min	NaN	NaN	0.000000	NaN	3.000000	NaN	NaN	1.000000	-10.000000
25%	NaN	NaN	1.000000	NaN	28.000000	NaN	NaN	5.000000	4.000000
50%	NaN	NaN	3.000000	NaN	34.000000	NaN	NaN	8.000000	6.000000
75%	NaN	NaN	5.000000	NaN	43.250000	NaN	NaN	9.000000	8.000000
max	NaN	NaN	27.000000	NaN	99.000000	NaN	NaN	10.000000	23.000000

You can look into other features of the data yourself, but in the summary view a few data quality issues jump out. This is—again—almost universal to real world datasets. It seems dubious that two 3-year-olds were in attendance. Perhaps a couple 30-somethings mistyped entering their ages. A 99-year-old is possible, but seems more likely to be a placeholder value used by some respondent. While the description of what is meant by the integer “Education” was probably underspecified, it still feels like the -10 years of education is more likely to be a data entry problem than an intended indicator.

But the data we have is the data we must analyze.

Before we go further, it is usually a good idea to use one-hot encoding of categorical data for machine learning purposes. Most likely this makes less difference for the decision tree and random forest classifiers used in this blog post than it might for other classifiers and regressors, but it rarely hurts. For this post, the encoding is performed with <a href="https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html">pandas.get_dummies()</a>, but you could equally use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html#sklearn.preprocessing.LabelBinarizer"> sklearn.preprocessing.LabelBinarizer</a> to accomplish the same goal.

In [3]: <pre class=”language-python”><code>human_dummies = pd.get_dummies(humans) list(human_dummies.columns)</code></pre>

Out[3]:

<pre class=”language-python”><code>[‘Years of Python experience’, ‘Age’, ‘How successful has this tutorial been so far?’, ‘Education’, ‘Favorite programming language_C++’, ‘Favorite programming language_JavaScript’, ‘Favorite programming language_MATLAB’, ‘Favorite programming language_Python’, ‘Favorite programming language_R’, ‘Favorite programming language_Scala’, ‘Favorite programming language_Whitespace’, ‘Favorite Monty Python movie_And Now for Something Completely Different’, ‘Favorite Monty Python movie_Monty Python Live at the Hollywood Bowl’, ‘Favorite Monty Python movie_Monty Python and the Holy Grail’, “Favorite Monty Python movie_Monty Python’s Life of Brian”, “Favorite Monty Python movie_Monty Python’s The Meaning of Life”, ‘Favorite Monty Python movie_Time Bandits’, ‘Have used Scikit-learn_Nope.’, ‘Have used Scikit-learn_Yep!’, ‘In the Terminator franchise, did you root for the humans or the machines?_Skynet is a WINNER!’, ‘In the Terminator franchise, did you root for the humans or the machines?_Team Humans!’, ‘Which is the better game?_Chess’, ‘Which is the better game?_Go’, ‘Which is the better game?_Longing for the sweet release of death’, ‘Which is the better game?_Tic-tac-toe (Br. Eng. “noughts and crosses”)’]</code></pre>

It is time to use scikit-learn to model the respondents. In particular, we would like to know whether other features of attendees are a good predictor of how successful they found the tutorial. A very common pattern you will see in machine learning based on starting DataFrames is to drop one column for the X features, and keep that one for the y target.

In my analysis, I felt a binary measure of success was more relevant than a scalar measure initially collected as a 1-10 scale. Moreover, if the target is simplified this way, it becomes appropriate to use a classification algorithm as opposed to a regression algorithm. It would be a mistake to treat the 1-10 scale as a categorical consisting of 10 independent labels—there is something inherently ordinal about these labels, although scikit-learn will happily calculate models as if there is not. This is a place where subject matter judgment is needed by a data scientist.

You will have noticed by the summary data giving mean and median of success scores that >=8 will approximately evenly divide the data into “Yes” and “No” categories.

In [4]:

<pre class=”language-python”><code>X = human_dummies.drop(“How successful has this tutorial been so far?”, axis=1) y = human_dummies[“How successful has this tutorial been so far?”] >= 8 y.head()</code></pre>

Out[4]:

<pre class=”language-python”><code>0 True 1 True 2 True 3 False 4 True Name: How successful has this tutorial been so far?, dtype: bool</code></pre>

In [5]:

X.iloc[:5,:5]

Out[5]:

	Years of Python experience	Age	Education
0	20.0	53	12
1	4.0	33	5
2	1.0	31	10
3	12.0	60	10
4	7.0	48	6

While using sklearn.model_selection.StratifiedKFold is a more rigorous way of evaluating a model, for quick-and-dirty experimentation, using train_test_split is usually the easiest approach.

In [6]: <pre class=”language-python”><code>from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)</code></pre>

An interesting thing happened in trying a few models out. While RandomForestClassifier is incredibly powerful, and very often produces the most accurate predictions, for this particular data, a single DecisionTreeClassifer does better. Readers might want to think about why this turns out to be true and/or experiment with hyperparameters to find a more definite explanation; other classifiers might perform better still also, of course.

I will note that choosing the best max_depth for decision tree family algorithms is largely a matter of trial and error. You can search the space in a nice high level API using sklearn.model_selection.GridSearchCV, but it often suffices to use a basic Python loop like:

<pre class=”language-python”><code>for n in range(1,20): tree = DecisionTreeClassifier(max_depth=n) tree.fit(X_train, y_train) print(n, tree.score(X_test, y_test))</code></pre>

In [7]:

<pre class=”language-python”><code>from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=5, random_state=0) rf.fit(X_train, y_train) rf.score(X_test, y_test)</code></pre>

Out[7]:

0.4482758620689655

In [8]:

<pre class=”language-python”><code>from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(max_depth=7, random_state=0) tree.fit(X_train, y_train) tree.score(X_test, y_test)</code></pre>

Out[8]:

0.5862068965517241

Best practice in machine learning is to keep training and testing sets separate. In the end, with sufficiently large datasets, it makes little difference in the trained model parameters whether and how train/test observations are separated. But this is a small dataset, and also reflects a somewhat unique event (many students will learn about machine learning through many channels, but this particular tutorial, with a particular instructor, at a particular conference, will not necessarily generalize to all those channels).

Therefore, in order to see simply what is the “best possible” decision tree for this dataset, I deliberately overfit by including all the observations in the model.

In [9]:

<pre class=”language-python”><code>tree = DecisionTreeClassifier(max_depth=7, random_state=0) tree.fit(X, y) tree.score(X, y)</code></pre>

Out[9]: <pre class=”language-python”><code>0.8103448275862069</code></pre>

We can easily look at what features are most important in this trained model, and also use a lovely utility method in sklearn.tree to display the entire tree and its decision cuts.

In [10]:

%matplotlib inline pd.Series(tree.feature_importances_, index=X.columns).plot.barh(figsize=(18,7));

In [11]:

<pre class=”language-python”><code>from sklearn.tree import export_graphviz import sys, subprocess from IPython.display import Image export_graphviz(tree, feature_names=X.columns, class_names=[‘failure’,’success’], out_file=’tmp/ml-good.dot’, impurity=False, filled=True) subprocess.check_call([sys.prefix+’/bin/dot’,’-Tpng’,’tmp/ml-good.dot’, ‘-o’,’tmp/ml-good.png’]) Image(‘tmp/ml-good.png’)</code></pre>

Out[11]:

In the diagram, blue branches reflect those respondents who found the tutorial more successful, and orange branches those who found it less so. The saturation of the displayed boxes reflects the strength of that decision branch.

As seems obvious in retrospect, the fans of And Now for Something Completely Different really did not like my tutorial very much. I probably should have provided a disclaimer at the beginning of the session. Years of Python experience is a slightly more important feature, but it follows an oddly stratified pattern wherein several different ranges of years show positive or negative effects—it’s not linear.

And of course, Time Bandits was not a Monty Python film at all: it is a Terry Gilliam film that happened to cast a number of Monty Python cast members. What on earth were those respondents thinking?!

In [12]:

<pre class=”language-python”><code>first_film = human_dummies[ ‘Favorite Monty Python movie_And Now for Something Completely Different’] human_dummies[first_film==1].loc[:,”How successful has this tutorial been so far?”]</code></pre>

Out[12]:

11 7 14 4 46 7 50 7 67 7 71 4 72 5 74 5 77 7 86 6 97 5 108 6 110 7 114 1 Name: How successful has this tutorial been so far?, dtype: int64

Credit for this blog title goes to Dr.

You Might Also Be Interested In

Talk to an Expert

Talk to one of our financial services and banking industry experts to find solutions for your AI journey.

Talk to an Expert

Machines Learning about Humans Learning about Machines Learning

Anaconda’s Response to DataCamp’s CEO and Board of Directors

Behind the Code of Dask and pandas: Q&A with Tom Augspurger

New from Anaconda: Python in the Browser

Talk to an Expert