Did someone just say “Feature Engineering”?

Shivam Dutt Sharma
Analytics Vidhya
Published in
8 min readMar 14, 2021

--

FEATURE ENGINEERING

You might be here because someone at your workplace mentioned the term “Feature Engineering” to / around you and it didn’t ring a bell, yeah? It has happened with me long time back. Having worked as a Product Manager in most of my previous stints, I had several occasions where I worked closely with Data Scientists and at times wearing the hat myself. This term “Feature Engineering” came up quite frequently. There are lot of shades & strings attached to it.

The most popular one being :-

Source : https://elitedatascience.com/algorithm-selection

Yep — that being the most popular problem statement for Data Scientists when it comes to Feature Engineering. Moreover, the idea of Feature Engineering in itself promotes getting rid of this problem statement.

Let me see if I can share some knowledge with you so that the next time when you come across this fancy term, you know what does it mean and you can contribute to the discussion too.

Too many features, aye?

Let’s define Feature Engineering……

FEATURE ENGINEERING is the science of preparing relevant & specific data-sets for the Machine Learing models so as to get the best performance (results / insights / accuracy / precision) out of them.

Let’s get deeper……

Whenever a Data Scientist or a Machine Learning Engineer works on a Data / ML model, there’s always some input data involved. The input data has certain structured columns to it, which are popularly called as features in the Data Science community. It mostly happens that a raw dataset might not really have all its features acting as relevant data inputs for the ML model, thereby, calling the need for Feature Engineering, which will basically do that job of maintaining certain features of particular characteristicsf, in the input data on which the ML model will be trained.

The features that one uses in his / her dataset influence the results of a Machine Learning model, more than anything else. There happens to be no such algorithm which can alone supplement the information gain given by correct Feature Engineering.

I came across this article that mentioned an insight presented by IBM Data Analytics Team that “Data Scientists spend 80% of their time on Data Preparation.”

Source : https://www.dataoptimal.com/data-cleaning-with-python-2018/

On a side note; I understand that the length of any article or its reading time is directly proportional to the propensity of the user to drop-off midway. Hence, I would like to stick to the most important aspects associated with Feature Engineering.

I will be touching each of these :-

  • Most important techniques used in Feature Engineering.
  • Important Python libraries that are prominently used in Feature Engineering.
  • How do these Feature Engineering techniques help Data Scientists & Product Managers in general.

Before we get into each of the above 3 points, let’s import the two most important libraries of Python which are inevitable for Feature Engineering.

import pandas as pd
import numpy as np

Most important techniques used in Feature Engineering :-

  1. IMPUTATION : The standard definition of Imputation in Statistics is “The process of replacing the missing values in a data with some substitute values.”
The art of imputation
  • Missing values in datasets, is the most common problem that the Data Scientists and the Machine Learning engineers generally run into when they try to prepare data for Machine Learning.
  • The common reasons for such missing values may be the humar errors, interruptions in the data flow, privacy concerns, and so on.
  • The missing values have to be taken care of, else they will affect the performance of the Machine Learning model.
  • Some of the Machine Learning platforms automatically drop the rows which include missing values in the model training phase, which further decreases the model performance because of the reduced training size.
  • There are also cases where the machine learning algorithms do not accept datasets that have missing values and generally give an error.

As it apparently comes out, the most simple solution to missing values is to drop the rows or the entire column. There is not a standard rule which defines an optimum threshold for dropping but one can use the 75% as an example value and drop the rows and columns which have missing values higher than this threshold.
This solution is in fact such popularly accepted that it gives the same vibe as the image below :-

Modern problems require modern solutions

Let us look at how the dropping of null values look like?

  • First thing that we obviously do is that we import the necessary libraries :- “pandas” & “numpy”.
  • Next, we read the .csv file that has the data with missing values in it.
Importing needful libraries & reading the .csv file
  • Let’s look at how this .csv file or the sample dataset looks like which we will work on, in order to handle the null values.
Sample data with missing values
  • Next, then we define a threshold for dropping the rows with null values. Here threshold = 0.1 basically means that we will drop all those rows that have missing value rate more than 0.1. How we do it is in a little indirect manner. If you will see, I create a new data out of the existing data by locating only those rows where the missing value rate is less than the threshold.
Dropping the rows with null values
Data with null value rows excluded
  • Similarly we can also take the approach where one drops the columns instead of rows, wherever the null value missing rate is greather than the threshold . So, here then the Python line of code for that should look like as follows. Also, the resultant dataframe is printed nextwhich if you will see now, has excluded the rank column because it had null values :-
Dropping the columns with null values & printing the resultant dataframe

However, to look at it, there is a much better and more appreciated approach than just dropping the values, and that is called as Numerical Imputation.

Numerical Imputation : What’s so better about it? It preserves the data size. There is one thing that needs to be considered particularly, though. It is vital to be very sure of the values that you impute to the missing values. Let’s look at some examples in this case :-

- If we have a column that appears to be originally containing binary values (0 & 1). However, then you spot some NA values and the other values that you see are just 1. Then, it is quite likely that the NA values correspond to 0. Hence, it is safe to impute those NA values with 0.

- Similary, let’s say we are looking at a data that has one column which contains the “total number of daily transactions”, and we see that there are some missing values. We may want to replace those missing values with zero (0) as long as we think that it is a sensible solution for our overall problem.
Sometimes, what also happens that you join a couple of different sized tables. As a result, you would find it ideal to replace the missing values with 0.

Imputing missing values with 0

-Imputing the missing values with a default value clearly comes across as the preferable option. However, there are also times when you would see that imputing the missing values with the medians of the columns stands to be the best imputation approach.
Reasons? — The central tendency of any data generally acts as the spokesperson for that data, namely, mean, median, etc. Now when we talk about mean vs. median; we see that median still would act as a better fit for imputation because it is insensitive to outliers in the data, while, mean on the other hand is sensitive.

Imputing missing values with the median

*The above example was to give a demonstration of how to replace missing values with the median. However, when you look at it, it may not be the best approach to impute missing rank values. So, as a caution; dont do median based imputations on rank data.

Categorical Imputation : This kind of imputation generally comes in picture when you are looking at a dataset that has a categorical tone to the data it contains. So what do you do in that case? You basically then replace the missing values in a certain column with the maximum occurring value of that column. This is obviously not the only option but a good option for handling categorical columns.
There may also be times when you see that the values are mostly uniformly distributed and there isn’t a dominant value per se. In that case, you may just want to impute a category like “Other”.

Imputing missing values of categorical column with max
  • The ‘rank’ column above had missing values in rows 2nd & 6th.
  • The same got imputed with the max value which is 7.0
  • *PS : This is again not an ideal approach to fill missing values in a ‘rank’ type column. The above is just for demonstration purpose. Moreover, the column is also not a categorical one. The demonstration is just to show how do you impute missing values with the max value.

To be continued : Handling Outliers & other methods.

--

--