In the winter of 2011 Sascha Mehlhase completed construction on a 1:50 scale model of the ATLAS detector.

Over 48 hours, Sascha painstakingly designed a3D representation of the ATLAS detector and then spent 33 hours building a model of it with Lego bricks. The model has 9,500 pieces and costs around $2,600.

Obviously, I want one — I need one. An avid Lego builder in my youth, I’ve collected 20,000 Legos; and like many former and current Lego builders, all the pieces are now in one giant bucket. I want to know, “Do I have all the pieces required to build the Lego ATLAS?”

To answer the question, I could sort each type of Lego — keeping a count as I sorted; but this would be fraught with errors and would take more time than I’m willing to devote to such a task. If I had a few friends to help sort and count the pieces, my query would be answered much more quickly.

And that’s what MapReduce is for: A defined parallel framework to “distribute Legos” which can be sorted and counted more quickly than single node processing!

What is MapReduce?

Three brief explanations

Here’s another way to understand MapReduce in three increasingly more detailed explanations:

  1. Non-Technical Definition I have a question which a data set can answer. I have lots of data and I have many computers connected to one another. MapReduce takes advantage of the connected computers and decreases the computing time used to answer my question. Specifically, MapReduce maps data in the form of key-value pairs to each machine and reduces the data to an answer.

  2. Moderately Technical Definition I have a question which a data set can answer. I have lots of data and I have a cluster of commodity hardware.  MapReduce is a parallel framework which takes advantage of my cluster by distributing work across each machine. Specifically, MapReduce maps data in the form of key-value pairs, then feeds the pairs to each machine which reduces the data to an “answer” or a list of “answers”.

  3. Highly Technical Definition I have a question which a data set can answer. I have lots of data and I have a cluster of nodes.  MapReduce is a parallel framework which takes advantage of my cluster by distributing work across each node. Specifically, MapReduce maps data in the form of key-value pairs which are then partitioned into buckets. The buckets can be spread easily over all the nodes in the cluster and each node or Reducer, reduces the data to an “answer” or a list of “answers”.

What data can be processed with MapReduce? Any data that can be acted upon independently from other data.

MapReduce Languages

MapReduce has successfully captured the parallel data crunching crowd.  So much so that there exist several implementations in a variety of languages. Examples include:

  • Disco (Python)
  • Hadoop (Java)
  • Amazon Elastic MapReduce (SQL-Like)
  • Skynet (Ruby)
  • Plasma MapReduce (OCaml)

Continuum’s tools utilize the simplicity of Python and leverage lower-level languages for faster calculation — and Disco’s Python MapReduce Framework is a great baseline to build upon.  More on this later.

As Continuum’s data scientist and user advocate, I will be posting novel use cases of MapReduce, Disco tutorials, and general data science thoughts. At Continuum, we believe that yes, Big Data has answers, but just as important, it invites domain experts and amateurs alike to ask more informed questions.  We strive to create the tools used for deep, fast data introspection and exploration. From this, patterns and behaviors emerge and important conclusions can be drawn — like, are there enough Legos in my bucket to build an ATLAS detector…


About the Author

Ben Zaitlen

Data Scientist

Ben Zaitlen has been with the Anaconda Global Inc. team for over 5 years.

Read more

Join the Disucssion