Path: blob/master/big_data/spark_crime.ipynb
1470 views
Spark Exercise
Note that the following notebook is not a tutorial on the basics of spark, it assumes you're already somewhat familiar with it or can pick it up quickly by checking documentations along the way. Please use the following link to download the dataset.
Question 1
By using SparkSQL generate a bar chart of average crime events by month. Find an explanation of results
Judging from the plot below, we can see that the average crime count tends to be higher during the summer. The reason behind this phenomenon might be:
Elevated temperature makes people more irritable, which leads to a rise in aggressive behavior and crime.
During the summer, people tend to leave their windows open more often to cool down or spend more time outside to enjoy outdoor activities. Both of them gives burglars more opportunity to break into people's home.
Question 2
Find the top 10 blocks in crime events in the last 3 years
Find the two beats that are adjacent with the highest correlation in the number of crime events (this will require you looking at the map to determine if the correlated beats are adjacent to each other) over the last 5 years
Determine if the number of crime events is different between Mayors Daly and Emanuel at a granularity of your choice (not only at the city level). Find an explanation of results. Side information: Rahm Emanuel is the current mayor of Chicago, and Richard Daley was his predecessor (left the office in 2011)
The following section also contains a small example of how to use the mapPartition
function.
For part 1, the top 10 blocks in terms of number of crime events are places either in downtown Chicago or the South part of Chicago. This is probably due to the fact that those areas are densely populated, hence increases the chance of crime occurence.
For the problem of finding the two beats that are adjacent with the highest correlation in the number of crime events, I decided to use the rdd corr
function from the mllib package to perform the operation. Even thought there's a DataFrame corr
function, I couldn't find an easy way to utilize that API.
After computing the year-level crime number correlation between all the pairwise beats combination and outputting the top 100 combination, the two adjacent beats that have the highest correlation is listed below (this is done by manually inspecting the top 100 pairs).
For part 3, determining if the number of crime events is different between Mayors Daly and Emanuel, we will compute the monthly crime occurences per beat and perform a pair t-test to evaluate whether the difference is significant or not.
We observe a large pair t-statistic when comparing monthly crime numbers between mayor Daley and Emanuel. This indicates the data provides strong evidence against the null hypothesis in which there is no difference in the number of crimes between the two mayors. And based on the way the number is calculated, a large negative number means there's significantly less number of crimes when Rahm Emanuel (the current mayor of Chicago) is mayor.
Question 3
Predict the number of crime events in the next week at the beat level.
Framed the problem as a supervised machine learning problem, thus lagged features were created. i.e. we use week 1, 2, 3's number to predict the number for week 4. We could have framed it as a time series problem, but here we're simply using it as a chance to familarize ourselves with Spark ML's API
Trained a RandomForest model for every beat level. This decision is due to the fact that if we were to only train one model, it would require us to one-hot encode around 300 different beat values, which is often times not ideal as a categorical variable with too many distinct levels often leads to overfitting
For the RandomForest model, 3-fold cross validation and a grid search on the
maxDepth
parameter was performed
A quick intro into the SQL window function:
Window functions operate on a set of rows and return a single value for each row from the underlying query. When we use a window function in a query, we define the window using the OVER()
clause, which has the folloiwing capabilities:
Defines window partitions to form groups of rows. (PARTITION BY clause)
Orders rows within a partition. (ORDER BY clause)
e.g.
The following query uses the AVG()
window function with the PARTITION BY
clause to determine the average car sales for each dealer in Q1:
For more information, including different types of window function the following link gives a very nice overview. Drill Documentation: SQL Window Functions Introduction
The following section ties everything together: loop through all the beats, create the lagged feature, make the prediction for the next week, report the prediction and the model's performance.
Question 4
Find patterns of crimes with arrest with respect to time of the day, day of the week, and month.
Base on the results below, we see crimes mostly happen during the evening, the summer and on weekdays. The rationale behind why summer time has the most crimes was already mentioned in question 1 and the fact that most crime happens during the evening is not that surprising as well as that's when there are less people on the street to help you and the lack of sunlight also gives the criminial an advantage (i.e. you might not notice them approaching or remember how their face looks like).
As for why there is a huge drop in the number of crimes during Sunday, one possible reason is people tend to spend the evening at home to get ready for Monday. Thus there's less chance of being a subject of criminal activities.