No dataset is going to come perfect and ready to go. There are always issues–bad data or missing fields. Often you will find NAN–which is not a number in Python. For example, say you did a survey and asked for age and someone wrote in “thirty”. Well, that’s not a number in a field that you probably want to assign a number to.
What do you do? Well, in Pandas you can use the fillna function. But how do you use it and what do you fill it in with? Well, you don’t want to fill it in with a zero for example. Why? Because it’ll destroy your true statistics of your data. Imagine you have 100 entrees and 25 are NAN. If you made those zero can you imagine what happens to your mean? It’d look like 25% of your audience hasn’t been born yet and the mean would probably skew very young.
The fix is to fill in the NAN with the mean. That will help keep your mean the same and essentially make those data points a wash.
Lets look at an example with Titanic data and how to fillna in Pandas.
First you have to full in your data:
Then the solution is simple. Add this little line of code and you will fix all those NAN datapoints:
Run your code to test your fillna data in Pandas to see if it has managed to clean up your data.
And you will get the following data: