John’s discoveries on Daisy’s fatality

6 min readAug 11, 2024

I am assuming you are coming here from — https://shivamdutt606.medium.com/what-would-keanu-reeves-cherish-the-most-as-a-data-scientist-d1621f9eabda

The aforementioned article featured Keanu’s data science idiosyncrasies if he were a Data Scientist. Now when you read that piece in its entirety, you get to know what all activities would Keanu Reaves as John, perform being a Data Scientist (or rather what would he be cherishing the most).

The piece lists down all the routine activities that a Data Scientist would do, for any ML model building / designing use-case.
To give you a quick recap : John has been trying to create a Logistic Regression based classification model, that will help him zero-in on the potential murderers of his dog Daisy.

So far, John has created a data-frame of the suspects, where there are a good number of features which also includes a predictor : ‘suspected_by_police’ that provides information about whether that suspect has already been accused by the police or not?

And then there is a target variable — “is_the_murderer”

Next, he does some Exploratory Data Analysis where he examines the summary stats and distribution of the numerical and categorical variables. After having performed the EDA, he then goes for data pre-processing, where he starts by handling null values. As there were no null-values initially, hence, to demonstrate the idea of null values imputation, he introduces a few null values first in the data-frame so as to be able to impute them later. Before going for imputation, he tries to see if he can drop any particular columns that do not have any strong correlation with the target variable. The threshold he uses is 0.7. However, to his sight, none of the predictor variables exhibit any extreme correlation with the target variable. So, he rules down the dropping of the columns.

Where he gets stuck was in his decisioning of whether he should go for mean-based null values imputation or linear regression — based null values imputation.

While that code seems to still not be working, he resorted to KNN (k Nearest Neighbours) to impute the missing values.

import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

# Separate numerical and categorical columns
numerical_cols = all_suspects_thugs_df.select_dtypes(include=['float64', 'int64']).columns
# categorical_cols = all_suspects_thugs_df.select_dtypes(include=['object']).columns
categorical_cols = cat_cols

# Impute missing values in numerical columns using KNN
knn_imputer = KNNImputer(n_neighbors=2)
all_suspects_thugs_df[numerical_cols] = knn_imputer.fit_transform(all_suspects_thugs_df[numerical_cols])

# Impute missing values in categorical columns using the most frequent value
freq_imputer = SimpleImputer(strategy='most_frequent')
all_suspects_thugs_df[categorical_cols] = freq_imputer.fit_transform(all_suspects_thugs_df[categorical_cols])

print(all_suspects_thugs_df)

Just up next — he checks the list of categorical columns in the dataset, since he would be requiring this list when creating dummies.

categorical_cols

The ‘name’ column from the list shall be excluded later in the code, as you’d see.

Next up, checking the null values count across all the columns :-

all_suspects_thugs_df.isnull().sum()

Just closely looking at the ‘is_the_potential_murderer’ column, Keanu finds that the column has float values. This is a target variable and is supposed to have values categorical in nature. Since, it is a classification use-case and Keanu plans to use Logistic Regression model for it, it becomes a necessity to have the target variable contain classes 1 & 0, and that too in categorical format. The following is a snapshot of how the values looked in the is_the_potential_murderer column.

Thus, Keanu drops the column and later creates it again as category type.

all_suspects_thugs_df.drop(columns = 'is_the_potential_murderer')

all_suspects_thugs_df['is_the_potential_murderer'] = np.random.choice([0, 1], size=len(all_suspects_thugs_df))
all_suspects_thugs_df['is_the_potential_murderer'] = all_suspects_thugs_df['is_the_potential_murderer'].astype('category')

Now comes the most awaited part.

Seeing who is identified as the most potential murderer by using a classification concept through a Logistic Regression model. What a typical Logistic Regression model spits is coefficients of the various predictors / features and the probability scores of how likely the various potential murderers (records in the dataset in literal sense) are to be the actual murderer of Keanu’s dog Daisy. Since, probability is the chances of an event happening, so here the chances of the target class being 1 (one) is the event-in-focus and Keanu aims to find the name of the murderer for whom the model spits the highest probability score.

Here’s how Keanu gets to the name of the murderer, the model feels is the real murderer (through the probability score analysis)/
Check the code below :-

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


categorical_cols.remove('name') #else we will have so many dummies - like one for each name. Hence, excluding name from the categorical list (as it is actually not categorical)

# Encode categorical variables
all_suspects_thugs_df_encoded = pd.get_dummies(all_suspects_thugs_df, columns=categorical_cols)

# Scale numerical features
scaler_ = StandardScaler()
predictors = all_suspects_thugs_df_encoded.drop(columns=['is_the_potential_murderer', 'name'])
scaled_predictors = scaler_.fit_transform(predictors)

X_train, X_test, y_train, y_test, train_indices, test_indices = train_test_split(
    scaled_predictors, 
    all_suspects_thugs_df_encoded['is_the_potential_murderer'], 
    all_suspects_thugs_df_encoded.index,  # Preserve original indices
    test_size=0.3, 
    random_state=42
)

# Step 4: Train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Step 5: Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Extract coefficients
coefficients = model.coef_[0]
feature_names = predictors.columns

# Create a DataFrame to display the feature names and their corresponding coefficients
coeff_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

# Sort the coefficients by their absolute value to see the most impactful predictors
coeff_df['Absolute Coefficient'] = coeff_df['Coefficient'].abs()
coeff_df = coeff_df.sort_values(by='Absolute Coefficient', ascending=False)

print("Feature Coefficients:\n", coeff_df[['Feature', 'Coefficient']])

# Step 6: Making predictions
# Get the probability predictions for the positive class
y_pred_prob = model.predict_proba(X_test)[:, 1]

# Find the index of the most likely murderer
most_likely_murderer_idx = y_pred_prob.argmax()

# Get the index from the test set
most_likely_murderer_index = test_indices[most_likely_murderer_idx]

# Retrieve the most likely murderer from the original DataFrame
most_likely_murderer_record = all_suspects_thugs_df.loc[most_likely_murderer_index]

# Print the name of the most likely murderer
print("Most likely murderer (name):", most_likely_murderer_record['name'])

TIME TO INTERPRET RESULTS

Accuracy?

0.5166666666666667
A bit better than random guessing? (Yeah — that’s how it is understood for a binary classification model)

Confusion Matrix / Classification Report?

Feature Coefficients?

Keanu (in the retrospect might want to really know that what are the features that really impacted the classification in +ve and -ve sense. So, coefficients basically tell you the direction and strength of the relationship between a predictor variable and a target variable. So, if the coefficient of a predictor variable is positive (+ve); it shall mean that if the value of that predictor variable increases, the log odds of the event happening (in this case the particular suspect being the actual murderer of Daisy) also increases by the magnitude which is equal to coeff value and vice versa in case of coeff being -ve.

Following is an ordered list (descending) [by their absolute values] but the signs being kept intact to show the direction of the relationship/

Ok, so you’d see in the code above that Keanu has calculated probability scores too (you can check the formula online). Or, wait —

P=1/(1+e−(β0+β1X1+β2X2+…+βnXn)1)

Most likely murderer (name)?

Juan Johnson

Now we can imagine what happens next.

John’s discoveries on Daisy’s fatality

Now comes the most awaited part.

TIME TO INTERPRET RESULTS

Accuracy?

Confusion Matrix / Classification Report?

Feature Coefficients?

Juan Johnson

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shivam Dutt Sharma

No responses yet