Probably one of the most interesting EDAs I did

I last edited the Colab file (where the EDA code is hosted) on Oct 10, 2020. But, it just stays as fresh in my head as it was on that timestamp. This was right after completing my Post Graduation in AI/ML. EDA was like a bread and butter for that entire one year of building different types of models — whether Classification / Regression / NLP / you name it.
One fine day, while doing an Alum session for my Engineering college; I had to present something in view of the concept / practices of EDA. I chose Roger Federer’s Career on a whim. Oh boy, I am an ardent fan. I don’t think so I have missed any of his important games ever since I started watching Tennis. And if you have already started wondering which one of his games has been your favorite? Is the AO 2017 final against Nadal, by any chance, your favorite? If yes, we eat the same rice, mate :p
Though, I couldn’t get his games data post ’12. So, here’s all the EDA from his 1998–2012 career.
Let me share some pieces of that EDA :-
OBJECTIVE OF THIS EDA
Closely look at Roger Federer’s extraordinary career statistics in his prime years of tennis : 1998–2012 and perform a full-stack EDA. Later, also perform Predictive Analytics and make prospective predictions of him being able to win any possible Grand Slams in the years that followed and see if the predictions are right.
Why Roger Federer’s Career data?
I wanted to work on a data that appears consistent and is explained richly with the help of visualizations. In the Open Era of Lawn Tennis, we do not have any better player than Roger Federer when it comes to consistent extra-ordinary career statistics.
Why Federer’s Career data only from 1998 to 2012?
While Roger Federer, in later years (2017 / 2018) was seen getting back to his prime and adding three more Grand Slams to his overall tally of 17 (then), and now 20, we would still stick to his career span of 1998 to 2012 for this project, for the reason that we observe a better data sanctity and consistent data available for this particular period, as compared to his overall career period.
Here’s the link to the Colab file :-
You will need permissions to access this. So please drop a comment if you’d like to look at the code.
I am sharing a few snippets below, nevertheless :-
— — — — IMPORTING LIBRARIES — — — —
- Why am I using Pandas? I don’t think I should be answering that :p
- Let me quickly tell about pylab. It is a module that provides a Matlab like namespace by importing functions from the modules Numpy and Matplotlib.
- SimpleImputer — is a sci-kit library used to fill in the missing values in the datasets.



— — — — EDA BEGINS — — — —
- Looking at the top 2 rows of the data-frame















