1 year taxi data is about 50 GB. All the visualization and analytics in this website relies on hadoop platform for data aggregation and preparation - AWS EMR, hadoop, pig and python.
K-means clustering method of the drop off area based on some features derived from Census data reveals several interesting finding.
Rtree algorithm was in map reduce was used to index the pickup and dropoff lat/lon spatially by zipcode.
The most common tip amount is between 1~3 dollars.
The most frequently occurring tip amount is $1 dollar, followed by $2 and $1.50.
People tend to give dollars and half dollars for tips.
The most common tip percentage ranges in 20-22%.
It out counts the second common range, which is 22-24%, by almost three times.
The average speed is between 0-10MPH.
The riders pay around 20% of tip.
The tip decreases with the increase of the speed until the speed hits 38MPH.
Overall trend: Greater the fare amount, smaller the tip percentage.
Fare > 50, tip percentage fluctuates.
Fare > 50, low tip percentage at fares ending in 0, 5.
On workdays
people tend to tip the most during off-work hours (4-7pm).
People tend to tip the least during to-work hours (6-9am).
On weekend
Tip percentage does not fluctuate as much as on workdays.
It is slightly higher during night (8pm-5am) and morning (8-11am) than in the other period of the day.
Average tip percentage peaks December, January and August.
Average tip percentage is lower in spring and fall seasons.
Hypothesis: Perhaps people tend to pay more tips when the weather is harsh!
Trip number peaked in March, April, October and November.
Trip number hit the bottom in August.
Hypothesis: visitors came to the City in Spring and Fall; New Yorkers fled out in the summer.