This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

A Recipe: Intro to Data Preparation¶

You have probably heard the notion that Machine Learning Engineers and Data Scientists spend more than 80% cleaning the data. That is not to exagerate, it is true and there is a reason. "Good model comes from good data". In the real world, many datasets are messy. and it takes enormous amount of time to get the data in good shape.

Preparing data is a process. You can manipulate, remove, or create new features. To elaborate it, here are more things that you are most likely to do:

To check and remove duplicate values.
To decide if you will remove or if you can keep missing values. Missing values can either be removed, filled, or left as they are. This depend on the size of dataset, and the goal.
To remove noise from the data such as features which doesn't have good predictive power.
To convert the values/features into their proper format. A quick example, if you have a numeric feature, you should never have some text values in it. Or if you are working in image related problems, you should never have images which have extensions of (image_name.bmp) for example. Images are usually in jpg or .png formats.
To encode categorical features.
To create additional features. This is a creative process, but when done well, you can boost your machine learning model performance with those features especially if they have high predictive power.
To Standardize/normalize numeric features: This is to scale the numeric features down to small values. Normalizing is scaling the numeric features to the values between 0 and 1 whereas Standardizing is scaling the numeric features to have zero mean and unit standard deviation. Feeding the scaled features makes the machine learning model training fast.

The above points are a shallow list, not a step after a step. There are also other things involved as well, such as shuffling or randomizing the dataset, splitting the data into training, validation and test sets.

As this was a recipe, we will see a proper work flow of data preparation in the next chapters when doing end to end machine learning projects.

In [ ]: