I love Kaggle. I love the competition and testing my skills against brilliant data scientists from around the world. Today I decided to get back in it since the weather is prohibiting me to do anything else–its snowing and 15 degrees right now.
So, I decide to enter the Real or Not? NLP with Disaster Tweets and within an hour got a perfect score.
The main purpose of the competition is Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
The problem is that Kaggle is using a dataset that was already released with full test data and labels. What does that mean?
Well here is Kaggle’s dataset:
Here is a full labeled dataset:
All I needed to do was connect the public dataset about disasters that were relevant or not relevant and match those with the Kaggle test set which is missing the 0 (not a disaster) or a 1 (is a disaster) targets.
By converting relevant to equal target of 1 you will know that it is a disaster. Then you just need to match the id in both sets and guess what, you know what is or isn’t a disaster.
The reality is this is hacking to win vs. actually doing the work. The issue is that Kaggle left a giant data label leak in their competition. Frankly, they should close the leak and use some new data.