Ed. note: The author presented a terrific webinar on this topic—Getting Started with Machine Learning with scikit-learn—now available on-demand!
By David Mertz, PhD
I had the great honor and pleasure of presenting the first tutorial at AnacondaCon 2018, on machine learning with scikit-learn. I spoke to a full room of about 120 enthusiastic data scientists and aspiring data scientists. I would like to thank my colleagues at Anaconda, Inc. who did such a wonderful job of organizing this conference; but even more to thank all the attendees of my session and the numerous other tutorials and talks. AnacondaCon is a warm and intellectually stimulating place. Throughout the presentations, question periods, and in the “hallway track,” attendees could make personal and professional connections and learn new and useful things about data science.
The attendees of my session were a very nice group of learners and experts. But I decided I wanted to know even more about these people than I could find by looking at their faces and responding to their questions. So I asked them to complete a slightly whimsical form at about 3 hours into my tutorial. Just who are these people, and what can scikit-learn tell us about which of them benefitted most from the tutorial?
In the interest of open data science, the collection of answers given by attendees is available under a CC-BY-SA 4.0 license. Please credit it as “© Anaconda Inc. 2018” if you use the anonymized data, available as a CSV file. If you wish to code along with the rest of this post, save it locally as
data/Learning about Humans learning ML.csv (or adjust your code as needed).
It turns out that data never arrives at the workstation of a data scientist quite clean, no matter how much validation is attempted in the collection process. The respondent data is no exception. Using the familiar facilities in Pandas, we can improve the initial data before applying scikit-learn to it. In particular, I failed to validate the field “
Years of post-secondary education (e.g. BA=4; Ph.D.=10)” as a required integer. Also, the “
Timestamps” added by the form interface are gratuitous for these purposes—they are all within a couple minutes of each other, but the order or spacing is unlikely to have any value to our models.
import pandas as pd fname = "data/Learning about Humans learning ML.csv" humans = pd.read_csv(fname) humans.drop('Timestamp', axis=1, inplace=True) humans['Education'] = (humans[ 'Years of post-secondary education (e.g. BA=4; Ph.D.=10)'] .str.replace(r'.*=','') .astype(int)) humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', axis=1, inplace=True)
After slight data massaging we can get a picture of the attendee group (most likely a few failed to complete the form, but this should be most of them).
humans.describe(include=['object', 'int', 'float'])
|Favorite programming language||Favorite Monty Python movie||Years of Python experience||Have used Scikit-learn||Age||In the Terminator franchise, did you root for the humans or the machines?||Which is the better game?||How successful has this tutorial been so far?||Education|
|top||Python||Monty Python and the Holy Grail||NaN||Yep!||NaN||Team Humans!||Chess||NaN||NaN|
You can look into other features of the data yourself, but in the summary view a few data quality issues jump out. This is—again—almost universal to real world datasets. It seems dubious that two 3-year-olds were in attendance. Perhaps a couple 30-somethings mistyped entering their ages. A 99-year-old is possible, but seems more likely to be a placeholder value used by some respondent. While the description of what is meant by the integer “Education” was probably underspecified, it still feels like the -10 years of education is more likely to be a data entry problem than an intended indicator.
But the data we have is the data we must analyze.
Before we go further, it is usually a good idea to use one-hot encoding of categorical data for machine learning purposes. Most likely this makes less difference for the decision tree and random forest classifiers used in this blog post than it might for other classifiers and regressors, but it rarely hurts. For this post, the encoding is performed with
pandas.get_dummies(), but you could equally use
sklearn.preprocessing.LabelBinarizer to accomplish the same goal.
human_dummies = pd.get_dummies(humans) list(human_dummies.columns)
It is time to use scikit-learn to model the respondents. In particular, we would like to know whether other features of attendees are a good predictor of how successful they found the tutorial. A very common pattern you will see in machine learning based on starting DataFrames is to drop one column for the
X features, and keep that one for the
In my analysis, I felt a binary measure of success was more relevant than a scalar measure initially collected as a 1-10 scale. Moreover, if the target is simplified this way, it becomes appropriate to use a classification algorithm as opposed to a regression algorithm. It would be a mistake to treat the 1-10 scale as a categorical consisting of 10 independent labels—there is something inherently ordinal about these labels, although scikit-learn will happily calculate models as if there is not. This is a place where subject matter judgment is needed by a data scientist.
You will have noticed by the summary data giving mean and median of success scores that
>=8 will approximately evenly divide the data into “Yes” and “No” categories.
X = human_dummies.drop("How successful has this tutorial been so far?", axis=1) y = human_dummies["How successful has this tutorial been so far?"] >= 8 y.head()
0 True 1 True 2 True 3 False 4 True Name: How successful has this tutorial been so far?, dtype: bool
While using sklearn.model_selection.StratifiedKFold is a more rigorous way of evaluating a model, for quick-and-dirty experimentation, using
train_test_split is usually the easiest approach.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
An interesting thing happened in trying a few models out. While
RandomForestClassifier is incredibly powerful, and very often produces the most accurate predictions, for this particular data, a single
DecisionTreeClassifer does better. Readers might want to think about why this turns out to be true and/or experiment with hyperparameters to find a more definite explanation; other classifiers might perform better still also, of course.
I will note that choosing the best
max_depth for decision tree family algorithms is largely a matter of trial and error. You can search the space in a nice high level API using sklearn.model_selection.GridSearchCV, but it often suffices to use a basic Python loop like:
for n in range(1,20): tree = DecisionTreeClassifier(max_depth=n) tree.fit(X_train, y_train) print(n, tree.score(X_test, y_test))
from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier(n_estimators=5, random_state=0) rf.fit(X_train, y_train) rf.score(X_test, y_test)
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(max_depth=7, random_state=0) tree.fit(X_train, y_train) tree.score(X_test, y_test)
Best practice in machine learning is to keep training and testing sets separate. In the end, with sufficiently large datasets, it makes little difference in the trained model parameters whether and how train/test observations are separated. But this is a small dataset, and also reflects a somewhat unique event (many students will learn about machine learning through many channels, but this particular tutorial, with a particular instructor, at a particular conference, will not necessarily generalize to all those channels).
Therefore, in order to see simply what is the “best possible” decision tree for this dataset, I deliberately overfit by including all the observations in the model.
tree = DecisionTreeClassifier(max_depth=7, random_state=0) tree.fit(X, y) tree.score(X, y)
We can easily look at what features are most important in this trained model, and also use a lovely utility method in
sklearn.tree to display the entire tree and its decision cuts.
%matplotlib inline pd.Series(tree.feature_importances_, index=X.columns).plot.barh(figsize=(18,7));
from sklearn.tree import export_graphviz import sys, subprocess from IPython.display import Image export_graphviz(tree, feature_names=X.columns, class_names=['failure','success'], out_file='tmp/ml-good.dot', impurity=False, filled=True) subprocess.check_call([sys.prefix+'/bin/dot','-Tpng','tmp/ml-good.dot', '-o','tmp/ml-good.png']) Image('tmp/ml-good.png')
In the diagram, blue branches reflect those respondents who found the tutorial more successful, and orange branches those who found it less so. The saturation of the displayed boxes reflects the strength of that decision branch.
As seems obvious in retrospect, the fans of And Now for Something Completely Different really did not like my tutorial very much. I probably should have provided a disclaimer at the beginning of the session. Years of Python experience is a slightly more important feature, but it follows an oddly stratified pattern wherein several different ranges of years show positive or negative effects—it’s not linear.
And of course, Time Bandits was not a Monty Python film at all: it is a Terry Gilliam film that happened to cast a number of Monty Python cast members. What on earth were those respondents thinking?!
first_film = human_dummies[ 'Favorite Monty Python movie_And Now for Something Completely Different'] human_dummies[first_film==1].loc[:,"How successful has this tutorial been so far?"]
11 7 14 4 46 7 50 7 67 7 71 4 72 5 74 5 77 7 86 6 97 5 108 6 110 7 114 1 Name: How successful has this tutorial been so far?, dtype: int64
Credit for this blog title goes to Dr. Timmy Churches, University of Woolloomooloo.
Ed. note: Want to learn more? You’re in luck. The author presented a webinar on machine learning with scikit-learn, now available on-demand!