- Word count: 694
- Average reading time: 3 minutes and 28 seconds (based on 200 WPM)
Somewhere during and after my ANPR project in 2015.
This post is to illustrate what I've tried for the Kaggle Titanic challenge and how I got to my final score of 79.9xxxx. But also what approaches I attempted in the Right whale recognition competition.
During my project on the license plate recognition I decided to try some machine learning algorithms and learn some tricks of the trade. Having read Python for Data Analysis By Wes McKinney, I was eager to try out my newly learned skills and dived in the Titanic challenge. After I had extracted the first license plate characters I figured I'd make a worthy opponent in the Whale recognition challenge.
Having worked through the tutorial that looked very similar to the current one it seemed that the random forest approach was the most promising. Initially I tweaked the parameters a bit randomly but quickly noticed my score wasn't increasing. I started reading the forums and started working with the features: dropping and adding features. To be honest it has been quite a while ago and I can't recall exactly what features I added but I do recall they didn't improve the score all that much. More noteably is that I found my passion for combining models, or simply put: Ensembles. I used the following models in my ensemble for my final submission before going forward:
- Random Forest classifer
- Support Vector Machines (SVM)
- Bagging Classifier
- Naive Bayes
- Logistic Regression
From the forecast of these models I used the probabiliity output instead of the binary output as to combine them into an average and converting this average probability into a binary output. This approach, after some tweaking, finally resulted in a score of 79.9xxx.
This challenge was a lot more daunting than I initially thought. The dataset size was approx. 11GB and my laptop back then held only 2.99GB of available RAM without a discrete GPU - this would pose a major issue I learned later on. I tried several approaches to cut out the whale itself and to get a smaller image until I came across this fantastically simple approach by a Kaggle user
This recursive algorithm cut out the whales into much smaller sizes so they were more manageable for training the model later on. However, as I learned about other user's their approach I began to understand the importance of only extracting the head - the part that makes one whale stand out significantly towards another. Luckily another Kaggle user provided a JSON file with all the head coordinates, ready for extraction. After 8 (!) hours my old trusty laptop had extracted all the whale heads. Having rescaled the image to a tiny (3x256x256) format, as I've read others repeatedly stating the importance of small size for reasonable run times, the images were ready for training.
Here is an image to give you an idea of what the whale images looked like with an annotated head.
At this stage I wanted to train a generic CNN as provided by this example in the book: (Neural Networks and Deep Learning)[http://neuralnetworksanddeeplearning.com/chap6.html] just to be able to locate the heads from the test set. However the required training time seemed to last forever and with a short training time the performance was horrible when trained on 80% on the train data and tested on the remaining 20% of the train data. I considered using an AWS instance for training but decided I was out of scope as other things started to take more and more of my time.
In hindsight, I have learned a lot from the other Kaggle users and felt that this challenge was a great motivation to learn more about CNN's.
These challenges in combination with the famous Andrew NG Machine Learning course and the great Neural Networks and Deep learning book provided me with a good foundation for future endeavours. One of the most notable lessons I derived from these challenges is the importance of feature engineering and data preparation. An ensemble is quite capable of compensating for non-optimal parameters but the greatest performance comes from good features and clean data. Understanding the data and doing some exploration before diving into the preparation is just as important. Writing this some years after I would like to share one of my favorite data exploration approaches to date from another Kaggle user on another challenge: link