What would Keanu Reeves cherish the most as a Data Scientist?

Shivam Dutt Sharma
12 min readMar 23, 2024
Some random street in Auckland

Keanu Reeves is known for portraying characters, that had a sense of purpose. He has always been on a mission. Hasn’t he? Unapologetically over-determined and revengeful, with a quest of redemption.

Keanu Reeves as Neo, wanted to know what the Matrix was. He was doing every possible exploratory analysis.

Keanue Reeves as John Wick, wanted to fix the pain he head in his heart and hence, went on a rampage, seeking vengeance for the death of his dog.

In Point Break, as Johnny Utah, he had to uncover the truth about the robberies.

Say hi to the man!

His larger / long-terms / primary goals might be of performing some logistic regression / random forest regression to classify whether a certain suspect of his beloved dog’s death, was the real assassin or not.

Or may be some time-series forecasting, where he wanted to predict, how his life that started off as a computer engineer in Matrix, will find its true / real purpose, in future and whether the machines would be really posing a threat to the world as opposed to how it appeared originally.

However, I think, more than his end-goals, he was really enjoying and committed to the process. Each of his stratagems was very methodical. He took calculated steps and ensured that each step resulted in a significant contribution to his larger / end goal.

I am not really fan-boying Keanu Reeves and bringing his greatness to light, honestly. Or may be, I am.

However, what I intend to really do is, that I want to share some practices that Keanu would really be keen on doing as daily chores, if he was a Data Scientist, in his pursuit to accomplish his end goals (which could be forecasting, classification, etc.)

Let me re-mention the three phrases that I have high-lighted above.

  • Exploratory Analysis : Read this as Exploratory Data Analysis itself.
    We all know how much EDA matters and holds significance for any ML use-case. Regardless, what model are you trying to build; EDA goes a long way in surfacing highly important insights on your data that will help you in taking necessary data-preprocessing steps required for the final model building / training and evaluation.
  • fix the : Here, by fixing, I actually wanted to annotate how much fixing the data, really matters. Fixing would mean things like — standard scaling the numerical data, one-hot encoding the categorical variables, imputing null values, etc. Anything that may be considered under pre-processing of the data.
  • uncover : Again, being used as metaphor, uncovering would mean the insights discovery part, which brings those insights to light, which are otherwise not very directly evident on the surface of your data-frame. Something like correlations or checking the linear relationship between the numerical variables.
Say hi to John Wick

To demonstrate how Keanu would really be doing each of those cherished chores of Data Science, let me take up a use-case.
Let me help his John Wick persona, to zero-in on the suspects of Daisy’s death and classify that out of each of the suspects, could he classify as the real assassins behind Daisy’s death.
While, we all know that Losef Tarasov was the real assassin (and the most hated villain in the John Wick series), who had killed Daisy.
But let’s say what if John really had to do some bit of investigation into finding the actual thugs of Tarasov who had broken into his house and killed Daisy.
Later, we see that he kills Viggo and Losef, and also goes after Abram.

I don’t really remember every scene and character in the movie, but what I will do is that I shall create a sample data of all the suspects as per John, and then do some logistic regression on them to classify who

Alright, let’s do it!

Yay!

So, here I first created a dataset synthetically by googling that would be some relevant features to be included for the murder-suspect classification type of a use-case.
And what you’ll see below is the data-frame for the two known primary suspects, and I shall be later appending another data-frame that will have details of the other thugs, who would have been working for Losef, around the world.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
data_primary_suspects = {'name' : ['Losef Tarasov', 'Viggo Tarasov'], 'age' : [50, 30], 'any_previous_interaction_with_the_dog' : ['Y', 'N'], 'height_in_cms' : [170, 172], 'weight_in_kgs' : [80,75], 'hair_color' : ['black', 'brown'], 'eye_color' : ['black', 'black'], 'tattoo' : ['N', 'N'], 'scar' : ['N', 'N'], 'criminal_history' : ['Y', 'Y'], 'cruel_to_animals_generally' : ['Y', 'N'], 'owns_a_weapaon' : ['Y', 'N'], 'owns_a_mercedes' : ['Y', 'Y'], 'has_a_dent_on_the_car' : ['Y', 'N'], 'seen_nervous_on_the_day_of_incident' : ['Y', 'N'], 'suspected_by_police' : ['Y', 'N']}
data_primary_suspects_df = pd.DataFrame(data_primary_suspects)
data_primary_suspects_df

Just broken down the data-frame into two snapshots (one below the other) for everyone’s perusal.

Now, below I am creating another dataframe of 998 thugs that could potentially be in Losef’s gang. And then later I am concatenating the above data-frame of 2 records and this one. Just for the ease of visibility, I am sharing two snapshots of the data-frame by breaking the features into two images.

!pip install faker
from faker import Faker
fake = Faker()
num_names = 998
random_names_list = [fake.name_male() for i in range(num_names)]
more_suspects_thugs_df = pd.DataFrame({'name' : random_names_list})
more_suspects_thugs_df['age'] = [np.random.randint(30,60) for _ in range(998)]
more_suspects_thugs_df['any_previous_interaction_with_the_dog'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['height_in_cms'] = [np.random.randint(150,190) for _ in range(998)]
more_suspects_thugs_df['weight_in_kgs'] = [np.random.randint(60, 100) for _ in range(998)]
more_suspects_thugs_df['hair_color'] = [np.random.choice(['black', 'brown', 'grey', 'golden']) for _ in range(998)]
more_suspects_thugs_df['eye_color'] = [np.random.choice(['black', 'brown', 'grey', 'blue', 'green']) for _ in range(998)]
more_suspects_thugs_df['tattoo'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['scar'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['criminal_history'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['cruel_to_animals_generally'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['owns_a_weapon'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['owns_a_mercedes'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['has_a_dent_on_the_car'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['eye_color'] = [np.random.choice(['black', 'brown', 'grey', 'blue', 'green']) for _ in range(998)]
more_suspects_thugs_df['seen_nervous_on_the_day_of_incident'] = [np.random.choice(['Y', 'N']) for _ in range(998)]
more_suspects_thugs_df['suspected_by_police'] = [np.random.choice(['Y', 'N']) for _ in range(998)]

all_suspects_thugs_df = pd.concat([data_primary_suspects_df, more_suspects_thugs_df], ignore_index = True)
all_suspects_thugs_df
# 'height_in_cms' : [170, 172], 'weight_in_kgs' : [80,75], 'hair_color' : ['black', 'brown'], 'eye_color' : ['black', 'black'], 'tattoo' : ['N', 'N'], 'scar' : ['N', 'N'], 'criminal_history' : ['Y', 'Y'], 'cruel_to_animals_generally' : ['Y', 'N'], 'owns_a_weapon' : ['Y', 'N'], 'owns_a_mercedes' : ['Y', 'Y'], 'has_a_dent_on_the_car' : ['Y', 'N'], 'seen_nervous_on_the_day_of_incident' : ['Y', 'N'], 'suspected_by_police' : ['Y', 'N']}
Final concatenated data-frame of 1000 records — part I
Final concatenated data-frame of 1000 records — part II

Following are some visualizations on the data, that Google Colab has already generated and recommended to me.

Finding the set of numerical, character and categorical columns in the dataset.

num_cols = all_suspects_thugs_df.select_dtypes(include = 'number').columns.tolist()
char_cols = all_suspects_thugs_df.select_dtypes(include = 'object').columns.tolist()
cat_cols = all_suspects_thugs_df.select_dtypes(include = 'category').columns.tolist()
num_cols, char_cols, cat_cols
numerical, character and categorical columns

Alright, so far, John has prepared his dataset.
What will he do next, now?

He will do each of the three practices mentioned above, that IMO, he would be cherishing the most. The thing about Exploratory Data Analytics (exploratory analysis, as afore-mentioned), Data Pre-processing (data fixing as afore-mentioned) and Insights Discovery (uncovering / un-earthing insights as afore-mentioned) is that all of it is like re-inventing your soul, in the pursuit of your eternal goal. Achieving your end goal (classification for example, in this case) shall always appear difficult if you haven’t taken the right highway. Metaphorically, highway shall be the right set of analytical practices and data handling methods, one must adopt to reach the final destination.

Let’s see what John does at each of the three levels :-

Exploratory Data Analytics (exploratory analysis) :-

John shall firstly begin with simply summarizing the data.

  • info()
all_suspects_thugs_df.info()
info()

The thing about info() is that, the analyst / data scientist primarily wants to look at the magnitude of the data he is dealing with. While, it is another thing that John prepared this data himself, but his concern at this stage would genuinely be that what is the count of thugs, that may be the suspects of Daisy’s death. Isn’t it true that an army plans its warfare based on their recce and inspection on the size of the enemy army. So, here there is a similar intent.

John will get to know the data types, missing values, memory usage and other potential data issues (formatting, data quality, etc.) in addition to the size / extent of the data-frame that becomes evident on the surface through info().

  • describe()
all_suspects_thugs_df.describe()
describe()

Here, the main idea for John is to see how distributed his data is. In addition to the count of the various features (mainly numerical) in his dataset, how does the mean, standard deviation, minimum, maximum value, etc. look like. For example, if he might want to get an idea that what could be the average age of Daisy’s murder suspects so that when he is on the lookout, he can shortlist the suspects from a distance. Similarly, the same resonates with the height of the suspects as another factor.

  • head()
all_suspects_thugs_df.head()
first 5 rows of the data-frame — I
first 5 rows of the data-frame — II

John, then might want to get to some of class distribution analysis. For example, he may want to know that out of the 1000 suspects, how many have the police already identified as potential murderers.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x = 'suspected_by_police', data = all_suspects_thugs_df)
plt.title('Class Distribution')
plt.show()

Alternatively, plotting a pie chart could also help John visually realize the class distribution.

class_count_pie = all_suspects_thugs_df['suspected_by_police'].value_counts()
plt.pie(class_count_pie, labels = class_count_pie.index, autopct='%1.1f%%', startangle=140)
plt.axis('equal')
plt.title("Class Distribution")
plt.show()
Class Distribution

Also, I feel John might want to see the class distribution across the different categorical variables in his dataset. For example, the thoughts he may have would be like, that how many suspects had black hair vs how many had brown hair, etc. Also more like how many of the tattooe’d guys were suspected by police already vs how many were not. This is more to know whether he should zero-in more on guys with tattoos, or on the ones without. The code for it could be :-

for feature in cat_cols:
plt.figure(figsize=(8, 6))
sns.countplot(x=feature, hue='suspected_by_police', data=all_suspects_thugs_df)
plt.title(f'Target Variable Distribution across {feature}')
plt.xlabel(feature)
plt.ylabel('Count')
plt.legend(title='Class', loc='upper right')
plt.show()

Next, he may want to do null values analysis, (like for example, whether is there any information missing about any of the suspects on his list). However, in this case, since the data is primarily curated by himself, and from a data science standpoint, the data isn’t too huge, so the propensity of spotting any missing values is anyway very less.

null values analysis

Though, to demonstrate how John would have done null values handling / imputations, if there were any null values to see in his data, let me actually artificially introduce some missing values in his data — all_suspects_thugs_df.

#Selecting random indices and columns for setting them to null

import pandas as pd
import numpy as np
# Defining the proportion of the null values that we want to add
null_proportion = 0.1
# Determining the number of cells to set to null
num_nulls = int(all_suspects_thugs_df.size * null_proportion)
null_cols = list(all_suspects_thugs_df.columns)
null_cols.remove('suspected_by_police')
#Randomly selecting cells to set to null
null_indices = np.random.choice(all_suspects_thugs_df.index, size = num_nulls)
null_columns = np.random.choice(null_cols, size = num_nulls)

PS: I have removed the target variable — ‘suspected_by_police’ from the columns list for null values introduction, as it shouldn’t have null values.

#Introducing null values in the data-frame

for idx, col_ in zip(null_indices, null_columns):
all_suspects_thugs_df.at[idx, col_] = np.nan
all_suspects_thugs_df

Now, if we look at the data-frame, we will see that null values have been introduced.

Data-frame with null values

Looking at the sum of the null-values in each of the columns

all_suspects_thugs_df.isnull().sum()
isnull().sum()

Now, John would IMO most probably be thinking that how should he take care of each of these columns, as all of these have null values and for his end classification objective, he doesn’t want to jeopardize the prospects of getting a good accuracy by simply dropping these columns.

The best way for him to take a quick decision whether to drop these columns or not, is by looking at the correlations in the data-frame.

He shall begin by looking at the correlation matrix on the numerical columns, and spot any concerning correlation between any of the numerical columns (independent features / predictors) and the target variable (‘suspected_by_police’).

But before he’d do that, it is imperative that he makes the ‘suspected_by_police’ column numerical in nature.

all_suspects_thugs_df['suspected_by_police'] = all_suspects_thugs_df['suspected_by_police'].replace({'Y' : 1, 'N' : 0})
all_suspects_thugs_df['suspected_by_police'] = all_suspects_thugs_df['suspected_by_police'].astype('int')
num_cols = all_suspects_thugs_df.select_dtypes(include = ['number']).columns.tolist()
all_suspects_thugs_df[num_cols].corr()
correlations on numerical columns

Based on the universal standards, the highly correlated variables would be those where the correlation co-efficient values would be below -0.7 or above +0.7. Now, while it is quite evident for John, that none of the numerical columns exhibit such high correlations; however, if there were still any such values and let’s say the no. of variables would have been too many; then John might have wanted to analyse the same by using a threshold value. However, as expected the boolean value would be False as it can also be seen below.

threshold = 0.7  # Define a threshold for correlation coefficients
highly_correlated = (all_suspects_thugs_df[num_cols].corr().abs() > threshold) & (all_suspects_thugs_df[num_cols].corr() != 1)

# Display highly correlated features
print("\nHighly Correlated Features:")
print(highly_correlated)

One thing, John is clear of is that the dropping of these columns would not be a precarious idea; however he still wants to go ahead and impute the mean values. He still wants to decide whether he should go for mean-based imputation or linear regression based imputation!

# — — — — — — — — — — — — — — — — — — — — — — -

He just called me and told me he is facing some errors with his LinearRegression model for null values imputation. He is gonna be back tomorrow. Allow me a day, and I shall share the update on his hunt (read as LR model).

--

--