Dip into Data Science with Stipple
In addition to building great open-source and enterprise software, Continuum also consults with data science companies and organizations. One company we recently contracted with is Stipple. Many of those that attended PyData last month met Stipple’s Chief Science Officer, Davin Potts. In Davin’s talks, he described in detail the enormously difficult challenge Stipple is trying to solve. Simply put, Stipple wants to index every image on the web, extract features from those images and identify them. A feature could be anything from a pair of pants or a shirt sold by a large clothing company, the latest electronic device, or a picture of a recent cooking creation tagged by a user. Using Stipple, you could even potentially disambiguate a can of coke from a puppy.
What’s even more impressive is that they not only have the technology to extract features from billions of images, they can also identify manipulation performed on an already processed image: cropping, rotation, etc. While Stipple is generally more interested in identifying things than famous people, they are interested in associations between what’s in an image and what’s in the surrounding text on a page (or tweet). Face detection and recognition is one aspect of parsing that image and being able to associate text/sentiment with an object/face offers promising insight. Continuum partnered with Stipple to determine the feasibility of facial recognition and data gathering at scale, and we explored how our products worked together with the following use case.
We decided it would be interesting to identify the two most prominent figures in current politics: Barack Obama and Mitt Romney. Since today is Election Day (Go vote!), images of Obama and Romney are particularly interesting and heavily featured on social networks. (Don’t worry, cats and Justin Bieber will return to the throne of most trending.)
Tools use can be found within Continuum’s AnacondaCE
We’re going to leverage Twitter to find tweets which link to images, news articles, blog posts, etc. and try to identify Obama or Romney in any of the resulting linked images. We want this to work at scale, so we will need a distributed solution as well. Let’s first work on identifying our subjects.
For python equivalent, download samples from http://code.opencv.org/
The code above not only detects faces, but also crops the face and resizes the resulting image to a width and height of 200×200. We need images of the same size to properly perform comparisons downstream in our image analysis.
Now that we can create a collection of faces (all images 200×200) we need to train a model to recognize a face. The latest OpenCV provides 3 supervised learning methods explicitly for face recognition: EigenFaces, FisherFaces and Local Binary Patterns Histograms (LBPH). OpenCV also has a great write-up on the problem of face recognition, as well as code samples for using their API. There is a Python interface but the code is not well documented.
The various methods all have different quirks and so do not yield the same predictions for a given face.
I found that while EigenFaces has a simple algorithm I’m familiar with (PCA), it works only for ideal conditions — images where lighting and alignment are maintained in every image. To me, this seems great for photo ID cards, mug shots, and other tightly controlled and uniform image captures. Another great feature about OpenCV’s face recognition API is that they built in methods for outputting creepy eigenfaces.
You can see the remnants of both Obama and Romney. Additionally, we can infer that if there is a small occlusion or rotation, then facial recognition would easily fail. In the end, I decided not to use EigenFaces and instead opted for a combination of FisherFaces and LBPH.
I’m not going to go into detail for these methods — perhaps in a later post. Nevertheless, FisherFaces and LBPH are a little more forgiving with non-uniform images. It is, however, good to remember that the model is only as good as the training set.
After collecting and labeling a set of Obama images and a set of Romney images, we simply pipe the images to OpenCV and let it build the model for us. Please see the following for a code example.
The results from training a model will be two files: LBPHfaces_at.yml and fisherfaces_at.yml. These files numerically represent what a labeled face is. Pretty cool! We can use these models to predict whether a new face is one of the labeled faces we trained against.
In our training, we have only two labels: 0 for Obama, and 1 for Romney. -1 indicates the face was not recognized as either of the trained faces. The second is a distance calculation and can be thought of as a value of similarity. The closer to 0, the more similar the new face is to the model.
Because the post is getting a little long, I’m going to break up this post and write next week on the details of the disco job used to sort through all the tweets. The general outline for what I did was give each node a search term and start collecting tweets (Map Step 1). Each tweet is then parsed, and those which contain URLs are collected (Reduce Step 1). The URLs are then inspected, and a list of image URLs are collected (Map Step 2). The images are then downloaded, cleaned and passed on to the face recognition software (Reduce Step 2). If any face is detected, the results of the facial recognition are delivered to a data store on S3 along with the img URL and originating URL.
How did we do?
We don’t have access to Twitter’s firehose and are thus rate limited to 350 requests per hour. As of now, we’ve collected 451 images containing at least one humanoid face over the course of 24 hours. This is is good example of exploratory data analysis. We don’t want to dedicate significant resources to an idea we’re still testing out but we want a framework which can easily grow. The data below shows that while there are problems, we also see enough encouraging hints to warrant more effort and resources.
91 of the images were predicted to be Romney. Below is an example of a properly detected image of Mitt Romney:
http://act.watchdog.net/petitions/1818?l=JFL9oiYaGsE http://static5.businessinsider.com/image/5057d1dc6bb3f78d3d000006-402-301/mitt-romney-obama-ad.jpg 1 73.7802083211 FISHER 1 51.7175533424 LBPH
17 images were predicted to be Obama. Below is an example of a properly detected image of Barack Obama:
http://soletschat.files.wordpress.com/2012/11/tumblr_md2mm1skk01rw9g6ko1_500.jpeg?w=812 http://soletschat.wordpress.com/2012/11/06/vote/ 0 72.5934366586 FISHER 0 5.01185092339 LBPH
58 images were detected as neither Obama or Romney. Below is an example of an image which correctly did not identify Obama or Romney:
http://media.salon.com/2012/04/AndrewOhehir_Bio.jpg http://www.salon.com/2012/11/03/lessons_for_obama_from_abe_lincoln/ -1 1.79769313486e+308 FISHER -1 1.79769313486e+308 LBPH
Lastly, 285 images could neither be predicted nor invalidated. This is the case for when one method fails outright and the other makes a valid prediction. 285/451 indicates that in truth, just because a person mentions Obama/Romney doesn’t mean they’re also supplying a picture of that person. Below is an example of an image that could neither be predicted nor invalidated:
http://5.mshcdn.com/wp-content/uploads/2012/11/Grover-Campaign-Poster.png http://mashable.com/2012/11/06/obama-vs-romney-fight-video/?utm_source=dlvr.it&utm;_medium=twitter&utm;_campaign=alexandvevo -1 1.79769313486e+308 FISHER -1 1.79769313486e+308 LBPH
There’s definitely room for improvement. The models would do well with more training images, and continuous updating and validation would be a great addition. I also suspect that while we’re seeing ok results (not many false positives), we’d have better statistics if we consumed more data.