Month 18 — Reaping the benefit of Pandas in my Data Analytics work
This is a summary of the last 2 months as I could not write an update last month due to downtime due to sickness and business travel.
Almost 1.5 years since starting my first programming effort, I was able to apply what I have learnt directly in my main line of work — cleaning up and preparing data for analysis using Pandas library in python.
Pandas has been such a blessing — very comprehensive and very thoughtful library with so much functionality. I would compare it to JQuery as both libraries are built for developer productivity — they are not pedantic, they don’t throw silly errors and even when you make a mistake, they try to make a reasonable guess and proceed.
However, my biggest issue with Pandas is finding materials to really master it. There is a ton of documentation on Pandas site, I have bought 2 courses on Pandas and I have bought the book written by the creator of Pandas. But somehow, all of them did not give me what I was looking for. I was not just looking for listing of function after function along with some examples which all the above resources provided. I looked for a way to understand deeply how certain concepts have been implemented in Pandas.
I got very lucky with the very first Pandas material I touched
Brandon is a great teacher and he explained the fundamentals in a very nice way so that concepts really stuck. However, on certain topics like multi-index, groupby, he did not really explain them well. That is why I was looking for some material to make sense of it all.
I am not new to groupby as I have been using databases for 20+ years now. But Pandas has fundamentally changed the way groupby is implemented in all other tools I have encountered. I got a glimpse of this beauty in this wonderful blog post
Pandas Groupby: Summarising, Aggregating, Grouping in Python
I've recently started using Python's excellent Pandas library as a data analysis tool, and, while finding the…
Normally, when you do groupby, you MUST specify the output columns and what aggregations you want to run on each output column. But in Pandas, when you do groupby, you get a groupby object which lets you even get access to the full rows in each group. I was startled to find an example in the above post where I could get the first row with all its fields with no aggregation applied from the groupby object. This is fantastic as this would allow me to do a lot more things in a faster way than a typical boring groupby implementation. You could just apply a single aggregation to all the rows in the group and Pandas will intelligently apply it on all numeric columns — so beautiful and so productive. But still I feel I don’t fully understand how it works very deeply.
But apart from learning, the last 2 months have been application time — I was using Pandas so much and benefiting from the automation of so much repetitive manual work. I enjoyed the great benefit Jupyter notebooks provide in a typical data exploration project as the notebook becomes self-documenting as I execute each data cleanup and analysis step. So grateful for a tool like Jupyter.
But as I wrote more and more code in Jupyter, I started wondering how do I test these. I started thinking about how to apply TDD to my work and what to do in Jupyter and what to do in VSCode. I am still experimenting. But I have been able to arrive at a combination that is working for me for now.
I installed pytest and pytest-watch so that when I save, the tests will automatically run. If I am writing more than a few line of code in a Jupyter cell, I flag it as a candidate item to build a reusable function. Then, I switch to VSCode and write the tests first and develop incrementally using TDD.
Though I was using Pandas for almost 2 months in real-life projects, the moment I started writing micro-level tests using TDD, I felt that my understanding of Pandas has deepened as TDD forced me to deeply understand how to compare 2 dataframes, how to do a diff between them, etc.
I am enjoying my journey — From this month, my son has joined me in this experimentation to apply Pandas to my data-analytics work. So, we are creating a set of utility functions built around Pandas starting with a function to rollup a given dataset to a monthly level — find the date column and round it to month-end, then keep all the non-numeric column and apply a default aggregation on all numeric columns.
Just for records, here are the 2 courses I have done on Pandas:
Course Review — Move from Excel to Python with Pandas by Chris Moffitt from Practical Business…
I bought this course because I happen to read this wondeful article Pandas Datatypes.
Learn Pandas to understand your data, clean it, visualize, and more!
I wanted to write a detailed review of Matt’s course but I have not found the time yet. Once I write, I will update this post.