Data Wrangling: Cleaning up Ohio Crime Data for Machine Learning

September 12, 2018

Often it seems like the biggest part of machine learning is actually acquiring and cleaning up data. The state of Ohio provides crime data in CSV format however the data cannot be used out of the box. I’m sure it is useful for someone but not for running predictions or even BI tools in its current state. So, cleaning the data and formatting it into a way that is useable is a daunting task.

Below is an example of the original data (I clipped off the other crimes as that is not important to show the cleanup and changes required). First, the data is in separate files by year. You could run those files and pull all the data in and do a join but a full-scale cleanup is better for the long-run.

Initial data for 2016:

The cleaned up version removes empty lines, totals, and general housekeeping. The added columns are: town, year and county.

In the end the changes weren’t monumental but they were time consuming to do five years of data cleanup manually but worth the work. Next I’ll start showing some predictions based off the cleaned up data.

2 COMMENTS

Data Wrangling Part 2: Cleaning up Ohio Crime Data for Machine Learning - Crained September 23, 2018 At 3:31 pm

[…] In a previous post, I discuss cleaning public Ohio crime data. As I start to get deeper into the data, and go through years 2016-2009, many new issues come to light. It is also very good cleaning up because you also start to think of ideas as well. […]
Machine Learning: How to pull Google Sheets data into Colabs - Crained October 8, 2018 At 10:15 am

[…] I started working on my Ohio Crime Data project, I started with inputting my data into a Google Sheet for the cleanup project. Once that was done, […]

Data Wrangling: Cleaning up Ohio Crime Data for Machine Learning

2 COMMENTS

LEAVE A REPLY

EDITOR PICKS

Use Google Colab and Kaggle Data with bonus: fastai2

What is the Python sorted function? An example of how to...

How to create Python class variables

POPULAR POSTS

What is Wifi Assist and why you want to turn it...

How to learn R programming

So you got a monopoly huh? Guess again

POPULAR CATEGORY

What is Conditional Probability and formula?

How I got a Perfect Score on a Kaggle NLP with...

Feature Engineering: LabelEncoder sklearn example