The purpose of my project is to try to bridge that gap. If I can successfully predict what areas in Austin are likely to have high demand, I can tell drivers about those spots and increase the supply to meet that demand. Predicting where these hotspots are going to be means taking in the historical information about when and where ride requests happen and using them to predict where the ride requests are likely to happen next.

In order to do this, the guys over at RideAustin (the same ride-sharing app I used to get my ride that day) were gracious enough to give me their database in a SQL dump. RideAustin is a non-profit ride-sharing service that not only has great drivers and prices, but also allows you round-up your fare to donate the proceeds to a local charity. They boast more than 2 million rides, the details of which I will be using to train my model. Once I had the data, I immediately got my hands wet. 

This post is part of our Galvanize Capstone featured projects. This post was written by Will Johnson and posted here with his permission. 

The Story

On April 13th of 2017, I went to a concert. Specifically, here, in downtown Austin, Texas.

 

After the show was done at 10:00PM, I needed a ride home. I opened up my favorite ride-sharing app and asked for a driver to come pick me up. But, instead of getting a ride, I got a message that read “NO DRIVER AVAILABLE.” I thought that was odd, since this was the middle of downtown and there should be plenty of traffic. So, I gave it a minute and, sure enough, after trying again, I got myself a ride.

I told the driver, “Boy, you guys must be busy if I can’t get a ride downtown.”

He looked back at me and said “Actually, I’ve been driving around for ages looking for a fare.”

What that told me was that there was a disconnect between rider and driver. Here I am with money, there he is with a car, but the only thing keeping us from finding each other is location.

The Goal

The purpose of my project is to try to bridge that gap. If I can successfully predict what areas in Austin are likely to have high demand, I can tell drivers about those spots and increase the supply to meet that demand. Predicting where these hotspots are going to be means taking in the historical information about when and where ride requests happen and using them to predict where the ride requests are likely to happen next.

In order to do this, the guys over at RideAustin (the same ride-sharing app I used to get my ride that day) were gracious enough to give me their database in a SQL dump. RideAustin is a non-profit ride-sharing service that not only has great drivers and prices, but also allows you round-up your fare to donate the proceeds to a local charity. They boast more than 2 million rides, the details of which I will be using to train my model. Once I had the data, I immediately got my hands wet. 

Building the Model

The first thing I did was some research on what other data scientists have done before on this subject. One notable idea I read about was in Japan where a cellular service provider was able to predict ride requests for a taxi company within 500 square feet. Their method was to break down into a grid and, from what I gathered, return the grid points that were the most likely to see the next ride request. What that had that I didn’t is real-time geolocation of users (because they’re a cell service company), so my model works a bit differently.

While the grid-mapping method may make it easier for a model like an ensemble classifier to predict the area with the highest demand, what it also does is decorrelate areas from each other. If I have downtown blocked off into a grid of 4 areas, then a classifier would assume each class is independent of each other, when I know that all four of those areas are highly correlated with each other. Moreover, I wanted to capture a sense of the variability of these areas of demand continuously to see how they may change in certain circumstances. So, I instead decided to plot hotspots using a K-Means classifier.

K-Means Classifier

K-Means is an unsupervised machine learning algorithm that separates a dataset into a certain number (k) of clusters. If I can find the centerpoint of those clusters, known as the centroid, then those points could be good hotspots to suggest to my drivers. The trick is getting the right number of clusters down. I found that 5 clusters makes the most sense for Austin’s ride request data, given the neighborhoods and historical ride demand.

These hotspots are returned as pairs of latitude and longitude coordinates. The trouble with running these kinds of pairs into a regressor is that if I were to predict a certain latitude, how do I know which longitude makes sense to pair it with? Moreover, regressors are designed to predict values with the highest accuracy to out-of sample data, but do I really care about what order my predictions are returned in? No. All I care about is that my predicted hotspots make sense with where the demand is actually going to be. I want to send my drivers to the right spots, not necessarily to the right spots in a particular order.

My Model

In the end, I created an algorithm that would use historical data in order to predict likely ride requests, then run a k-means clustering algorithm on those rides to find where the hotspots are going to be. Breaking it down into six steps, the model works like this:

Step 1: Find a Ride Request

Let’s look at just one ride. I’ve got information about it, like its lat-long location, the time of day, the day of the week, the date and so on. If I go back into my dataset, looking at all the ride requests that came before this one, I can then find requests that are similar to that ride request.

Step 2: Find Similar Ride Requests

The green dots are all similar rides that happened around the same spot, same time of day, and the same day of the week. Each of these rides had at least one ride request to come after it. My thought was that those following rides would follow a historical pattern of where rides were likely to pop up next in similar conditions. So let’s find the requests that came after all these green ones.

Step 3: Find the Following Ride Requests

These green dots are all rides that came after the similar ride requests to my original blue one, down in the bottom left corner. As you can see, most of them occur downtown, the bar scene south of the river, the mall, or the airport. This being 10:00PM on a Thursday night, that makes sense.

So let’s now randomly sample from this distribution of requests and choose that to be our next expected ride request. This is sort of hijacking the Central Limit Theorem in statistics that tells us the more observations we take, the more normal our distribution should be and the more towards the true-mean our observed mean should be. In this case, we have most of our rides downtown, but a good amount elsewhere. An independent random sample of this distribution should reflect the same variation of the ride requests that would follow in reality.

Step 4: Randomly Sample

See that new blue dot? That is our first expected request. It’s highly unlikely that the next actual ride request is going to be in that exact spot, but that’s now what we’re after. We want to know what the general distribution of rides is likely to look like, so we can get a good idea of what areas might see more demand in the coming 30 minutes.

So we repeat the process, finding similar ride requests and their follow ups and plotting the distribution of those, we can see they again favor downtown but also spread out a little.

Step 5: Repeat

If we were to repeat this process enough times, we now have a distribution of what we expect our rides to look like for the next 30 minutes. How did we decide how many to pick? I used a regression model called a Random Forest to predict how many rides we should expect to see for the whole of Austin in the next 30 minutes. That number, we’ll call n, is how many times we repeat this process. If we repeated the same amount every time, then our expected hotspot for any given time period would generally be the average hotspot for that time block. While that might be OK to do, I’m more interested in capturing the variation on any given day.

So now that we’ve got an idea of how we expect the demand to look in the next half hour, let’s run our K-Means clustering algorithm on it.

Step 6: Cluster

Now we’ve got our hotspots!

These are the areas we’ll suggest to our riders to pay attention to. Now, in data science, a model is no good unless we can prove that it works. So, let’s test this method by calculating the distance between the predicted hotspots and the closest actual hotspots that we can generate using the data we’ve withheld from the model so far. We should come up with 5 distances in miles. We’ll use the ‘Manhattan Distance,’ since that is more representative of driving distance for our case. Take the average distance of those 5, and we’ll get a good idea of how well our model predicted hotspots.

How Did We Do?

Let’s go back to April 13th to see how we did. That bigger purple dot is a more precise look at where my model suggested drivers go. The blue dots are ride requests that happened between 10 and 10:30 that night. That blue one in the top right corner? That’s me!

Running Random Tests

Now let’s repeat this process of testing the distance in miles between our predicted hotspots and our actual hotspots. Below is the result of randomly testing my model at 120 different dates and seeing how well this process worked for each date.

Note: I split the dataset in half each time, only using the ‘before’ data to train the model so that there was no leakage of information.

Note2: There were often times when there was not enough predicted traffic to justify saying that there were any hotspots going to happen. If a Monday morning at 3AM only expects to see six ride requests, then I can’t really say that any location would be a ‘hot’ spot to be in.

Via hypothesis test, I can conclude that, on average, my model predicts hotspots within 2 miles of the actual hotspot that occurred afterwards. While 2 miles is good, I believe I could get closer if I added in some more details about the rides.

Building a Tableau Dashboard

So how might I present this information to drivers? If you’ve ever looked at a weather radar, you know that in one setting it shows you the past hour’s worth of the storm as it marches across your screen. If you hit ‘future,’ it will then show you where the model expects the storm to go.

I’ve organized a dashboard using Tableau to do the same. Here I have two gifs demonstrating looking at the past hour’s worth of data and then presenting my expectations of where the demand is likely to be next.

Expected: 

That may work well for any city, but what if I know things specifically about areas in Austin that I want to take advantage of? Below you’ll see how I provide the same previous vs. expectation format to specific areas like downtown, the airport, or the northern part of South Lamar street.

Expected: 

Notice how the northern part of South Lamar street has a high amount of demand, nearly $20 on average for fares (which is a good amount!), but only 30% of the requests are actually getting taken. What that tells me, the driver, is that I could get a good amount of money without having to wait very long, because there is demand there that is not getting met.

Double checking this, we look at the expected demand for that area, and we can see that it is expected to go up, and we can see anywhere between 0 to 28 requests to happen in the next half hour (that’s a 95% confidence interval, by the way). So, now I know that this is a good area to be in, so I’m going to sink my gas money into going there instead of waiting at the airport any longer for a bigger fair but fighting with all the other drivers for the next ride.

Limitations

As I noted before, there are certain time periods where the model does not work. Hard to say any spot could be ‘hot’ if there will only be a handful of ride requests happening in the whole of Austin.

Another downside may be that since I let the hotspots roam wherever they want, the more precise locations can sometime be in the river, or perhaps off road. The general area might still be good, but I need to be careful not to tell drivers to submerge their car in the lake.

Lastly, it’s important to consider Herd-Behavior here. If I tell my drivers that one location is great, and they all flock there, I may be draining the supply of drivers in other locations. This might have an adverse affect on my ability to meet overall demand and instead focus too hard on high demand areas. Testing this for real would tell me more about what to set my threshold at for this.

Where To Go From Here?

One bit of information I could use to improve this model is weather data, like percent chance of precipitation or temperature. Anyone who has ever tried to get a cab in New York City while it’s raining knows that demand skyrockets and all the cabs are taken. Likewise, I’d also like to know more about incoming flights at the airport. If I can predict a certain number of flights coming in at the same time, I may suggest the airport is a better spot than usual.

As I mentioned earlier, a company in Japan was able to use a mobile user’s real-time geolocation data to predict when and where they would ask for a ride. Some taxi drivers saw increases of 20% in business after using the model. While I don’t have that kind of data, if I did, you can bet I would use it.

Something else I never got to try was weighing the unfulfilled rides more. If I care more about taking care of unfulfilled requests, my hotspots should favor those.

Final Thoughts

A huge thank you to the folks at RideAustin for allowing me to use their dataset for this project. Also, big thanks to Tableau for giving me a student license to learn how to use their software to build this awesome interactive app you see above.

If I could, I would go into greater detail in the ways I mentioned earlier with this project. Providing the drivers with this kind of information allows THEM to be the ones making decisions on where to go. They want to give people rides! What my model does is help them find where they are and where they might be next.


About the Author

Will Johnson

Guest Blogger

Will Johnson is a recent graduate from the Data Science Immersive course at Galvanize Austin. He graduated from the University of Richmond in 2014 with a B.S. in Business Administration and spent 2 years in the music business in New York Ci …

Read more

Join the Disucssion