First Kaggle
This is a brief summary of my first experience with a Kaggle competition. I entered for the learning experience, and because I wanted to play around with a real problem in machine learning.
The Problem: Crime In San Francisco
I chose the Crime Classification in San Francisco as it seemed like a fairly well-defined problem, and it is also in the Playground section of Kaggle so is probably more of a gentle introduction.
You are given a training data set of 878,049 crimes from the past 12 years, each with the following information:
- Date Crime Occurred
- Day of Week
- Police District
- Street Address
- X and Y coordinates
- Type of Crime
- Description
- Resolution
The challenge is to develop an algorithm which can then correctly classify other crimes using only the date and location features mentioned above.
Given a test data set of another 884,262 crimes, for each instance the algorithm needs to assign probabilities to 39 different crime categories.
The output is evaluated using the Multi-Class Logarithmic Loss function (more in another post), where each predicted probability is compared to the true probability. The idea is to make the loss function as small as possible.
Submission 1
Being very eager to jump in and try something out, I very naively threw something together using the Python Sklearn Random Forest algorithm (chosen as I had read that it is one of the algorithms that most frequently wins Kaggle competitions), with the following features:
- Time of day (based on number of second in the day)
- X & Y coordinates
Results:
Score of 26.85, ranked 1,150 on leaderboard.
Submission 2
My second submission was the result of randomly playing around. I cut down the features to only:
- Police precinct
- Time of Day
However I think the biggest improvement was using the predict_proba instead of predict method. Whereas predict gives effectively a binary classification for the most likely class (i.e. 1 or 0), predict_proba assigns a probability to every single class.
It turns out that Logarithmic Loss penalizes confident but wrong answers very heavily, so it is better to “spread your bets” as it were rather than predicting a single category.
Results:
Big improvement to score of 4.9, moved up 172 places.
Submission 3
Around this point, I started to test locally prior to submitting by splitting the train data set approximately 80%/20% into training and test data in order to better measure whether my changes were improving my score or not.
For the third submission, I used only two categorical features:
- Hour of day
- Police precinct
However the biggest improvement came from specifying a max depth of 10 for the Random Forest.
Results:
Improved score to 2.59, moved up 426 places.
Submission 4
Submission 4 was when I realized that pretty much all of my success up until this point had been beginner’s luck, and I needed a more systematic approach. So, after reading quite a bit about feature engineering, I used the following variables:
- Precinct
- Day of Week
- Time of Day
- X and Y (scaled to 0,1)
For Precinct, Day of Week and Time of Day I finally began treating them as categorical variables (using pandas get_dummies).
Results:
Improved score to 2.49, moved up 267 places.
Submission 5
For submission 5 I got some inspiration from comments on the competition forum (thanks papadopc and SatendraKumar).
In particular I did additional feature engineering on the X and Y coordinates to:
- Rotate coordinate frames through 30, 45, 60 and 90 degrees
- Convert to polar coordinates and calculate R for each point
I was also inspired by a comment from papdopc to create a crime count per address, that is to say, using the training data:
- Get a list of unique street names
- For each street name count the instances of each crime category
- Use these counts as new categorical variables
(Note: I have already seen some issues with this approach so will revisit later on).
Results:
Improved score to 2.44, and moved up to current position of 277 on leaderboard.
Submission 6
It would only be honest to note that I have made an additional submission with an approach pretty similar to Submission 5, but filtering for outliers in the training data, however this did not improve on my current best score.
Summary and Next steps
So far it has been a really fun and interesting experience and it is pretty cool that there are so many tools out there that enable you to make headway so quickly with complex problems.
In some respects this can also be a disadvantage because it also makes it very easy to implement algorithms when you have very little idea of what you are doing, or how they work. However, on balance, I think it is also very useful not to have to get bogged down in the detail of creating efficient implementations of algorithms and enables far faster learning of some difficult concepts.
I am already working on my newest model including:
- Some new categorical features: Season of Year, Daylight Saving Time Indicator, Whether the address is an Intersection or not
- Smarter featurization of the address data vs. crime counts (based on papdopc solution)
Some other approaches I want to try include:
- Using a Neural Network
- Combining predictions from multiple algorithms
- Probability smoothing