In a previous post, I discuss cleaning public Ohio crime data. As I start to get deeper into the data, and go through years 2016-2009, many new issues come to light. It is also very good cleaning up because you also start to think of ideas as well.
When cleaning up datasets, it is a good idea in machine learning / data wrangling that you keep notes. notes of issues, ideas, potential changes, etc. Below is a running list of things that have come up while cleaning up this data before I even start to run any modeling.
- Not all of my columns (cities, sheriff’s department, etc.) have months where they collected data.
- Why? This brings up another part of the data. The locations are not cities alone. You’d assume that this would just be city data, however there are universities and parks–amongst other things. I don’t want to assume anything, but it would seem to make sense some locations, such as a park, isn’t open all year in Ohio. Or it’s just too cold to commit a crime. 😛
- It’s not just universities and parks, theme parks are listed as well. As are Sheriff’s departments. I don’t know what the criteria is for these other locations being listed but it is important to note.
- What is missing? It would be great to get each location’s crime data by the month not just the full year. Why? Because then you could make some assessments based on the data by month. And single out those times of year in analysis. I could also then determine why I’m missing data–is it just because a location didn’t report some months that year? Or was there nothing to report? All you can do is guess without digging into each of these cities which would be a monumental task.
- I might even add a new column called Types. What would that entail? Parks, universities, sheriff department, cities, entertainment parks, etc.
- I will also need to make sure to come to some avg. as well for all of the locations because of the lack of data all year. Otherwise it’ll be hard to make any analysis as some apples to apples comparison.
Coming soon… we will start to do analysis of the data and then try to make predictions for 2017 and 2018 and see how close those come to actuals.