On the left it can be difficult to see the skewed distribution. Baseball, unlike hockey, begins in the spring and often the cut-off for younger players is late in the summer. To view the skew, in the above on the right, I shifted the graph, plotting August first and ending with July. Now we can see that the trend echoes that of the NHL.

So far Gladwell is batting 2 for 2. Now let’s look at the date of birth distribution for sports which shouldn’t have a well-defined structure.

NFL and NBA

From the plots above we don’t see any well-defined trend. There are small elements of the graph that kind of map to the distributions of births across the US (a valley in the early part of the year and peaks in the summer), but generally we don’t see any structure. Why might this be? To hazard a guess, I would say that Football, unlike Hockey, really selects for size. But eventual size (weight and height) is difficult to predict at an early age. Additionally, as players transition from high-school to college to the NFL, players will and often do change position as their bodies continue to change. Therefore, we don’t see any particular month or set of months that are selected out for players across the NFL.

Similarly, we don’t see any structure of the distribution for the NBA.

Again, while many many people play basketball from an early age and in the US there exist many mechanisms (McDonald’s All-American Game) to select skilled high-school kids, it is very difficult to end up in the NBA unless you are at least 6 ft. 7 in. So here, as well, eventual size (which is not easily predicted at a young age) matters quite a bit.

UEFA 2012

Lastly, we return to soccer. Gladwell says that the trend he identified in Hockey also exists in Football-hungry Europe, so let’s look there. There are many many leagues and many many teams to choose from, but I found it easiest to grab the 16 teams which comprised the 2012 European Football Championship (UEFA 2012)

While there definitely is the fingerprint of a downward trend, the two spikes in May and August are a little troublesome. This may be due to the limited amount of player data or odd birth rates per month in Europe. Most likely, the spikes are due to non-standard cut-off dates for play across different European countries.

Conclusion

It’s easy to pass on misinformation. We often hear facts (“chewing gum, if swallowed, will stay in your digestive tract for 7 years”), quotes (“From each according to his ability, to each according to his need” attributed incorrectly to the US Constitution), and statistics (“you are more likely to die at 11:00 am“) and accept these statements without question — but only one of these, it turns out, is true. Yet we continue to reinforce this misinformation by adding our own voice to the conversation when we repeat false facts and figures over and over and over again. Thanks to a new focus on open data, however, it’s getting easier and easier to verify this anecdotal information. With open data repositories, collaboratively curated information such as Wikipedia, and common software tools, we can quickly verify the facts and figures we hear thrown around in everyday life.

This kind of analysis should be commonplace. With Python and the tools in Anaconda, it’s easy to analyze many varieties of data. Wakari makes it very easy to share these analyses, with a minimum of installation fuss. With the growing list of public data sets built into Wakari, we invite everyone to find new patterns in the data or verify old statements we continue to repeat.


Facebook
About the Author

Ben Zaitlen

Data Scientist

Ben Zaitlen has been with the Anaconda Global Inc. team for over 5 years.

Read more

Join the Disucssion