Data Wrangling Part 2: Cleaning up Ohio Crime Data for Machine Learning

By Charlie September 23, 2018

1 Min Read

In a previous post, I discuss cleaning public Ohio crime data. As I start to get deeper into the data, and go through years 2016-2009, many new issues come to light. It is also very good cleaning up because you also start to think of ideas as well.

When cleaning up datasets, it is a good idea in machine learning / data wrangling that you keep notes. notes of issues, ideas, potential changes, etc. Below is a running list of things that have come up while cleaning up this data before I even start to run any modeling.

Not all of my columns (cities, sheriff’s department, etc.) have months where they collected data.
Why? This brings up another part of the data. The locations are not cities alone. You’d assume that this would just be city data, however there are universities and parks–amongst other things. I don’t want to assume anything, but it would seem to make sense some locations, such as a park, isn’t open all year in Ohio. Or it’s just too cold to commit a crime. 😛
It’s not just universities and parks, theme parks are listed as well. As are Sheriff’s departments. I don’t know what the criteria is for these other locations being listed but it is important to note.
What is missing? It would be great to get each location’s crime data by the month not just the full year. Why? Because then you could make some assessments based on the data by month. And single out those times of year in analysis. I could also then determine why I’m missing data–is it just because a location didn’t report some months that year? Or was there nothing to report? All you can do is guess without digging into each of these cities which would be a monumental task.
I might even add a new column called Types. What would that entail? Parks, universities, sheriff department, cities, entertainment parks, etc.
I will also need to make sure to come to some avg. as well for all of the locations because of the lack of data all year. Otherwise it’ll be hard to make any analysis as some apples to apples comparison.

Coming soon… we will start to do analysis of the data and then try to make predictions for 2017 and 2018 and see how close those come to actuals.

Categorized in:

Tech & AI,

Last Update: February 20, 2026

Press ESC to close

Share this:

Related Articles

Bambu Lab P1S Review: I Bought One — Here’s the Honest Truth

Positive Grid Spark NEO Review: I Rocked My Face Off and My Dog Never Flinched

Speediance Gym Monster 2 Review: I Spent $3,484 to Never Wait for a Machine Again

Analyzing Premier League Predictions: How Accurate Were We?

Leave a Reply