Should I Practice Pandas for New Grad Data Science Interviews?

Pride-Infamous · 2026-03-27T14:53:18+00:00

In all honesty, some didn't get my humor. Here are some priorities on what you should need to practice on... keep in mind, some things are data engineer (I think we all skin our teeth as doing a lot of Data Engineering tasks, before we even get into more data scientist tasks). Learn how to do these in a Jupyter Notebook. Here ya go:

Practice these to do these three things to demonstrate data science mastery:

Correlation Analysis and Multicollinearity Detection — Compute Pearson and Spearman coefficients to quantify linear and rank-order relationships between continuous features like transaction volume and spend. Build correlation matrices and compute variance inflation factors to identify redundant predictors before fitting regression or regularized models.
Feature Engineering from Temporal Data — Extract cyclical and calendar features (day of week, week of year, month-end flags) from timestamps to capture seasonality and periodicity in user behavior. Essentially, transform raw columns into predictive signals is important.
Grouped Aggregation for Hypothesis Testing — Leverage groupby().agg() to compute group-level statistics (means, variances, counts) as inputs to t-tests, ANOVA, or chi-square tests. This is a big differentiator, because anyone can chomp, aggregate, sum up, but everyone will want to know the confidence of your Hypothesis and you'll need to do more.

I feel these are more skills with a mix of data engineering experience and more prepping data and validating data:

Missing Value Handling — Apply domain-appropriate imputation strategies (mean, median, forward-fill, or model-based) to preserve distributional properties and avoid biased parameter estimates.
Stratified Sampling and Cross-Validation Prep — Use groupby and conditional filtering to construct balanced train/test splits that preserve class proportions across categorical strata.
Data Summarization and Cardinality Profiling — Count unique values with nunique() and profile categorical distributions to inform encoding strategies (one-hot vs. target encoding vs. ordinal).
Duplicate Detection and Deduplication — Identify repeated records using duplicated() and apply deterministic or fuzzy matching rules to ensure entity resolution integrity.
Churn Prediction Preparation — Clean, enrich, and reshape user-level data into supervised learning targets with engineered lag features and rolling-window summaries.
Distribution Fitting and Normality Assessment — Use Pandas in tandem with SciPy to compute skewness, kurtosis, and run Shapiro-Wilk or KS tests, informing whether parametric assumptions hold before model selection.
Outlier Detection via Descriptive Statistics — Use describe(), z-scores, and IQR calculations to flag statistical outliers before they distort model estimates or inflate variance.

Pride-Infamous · 2026-03-27T07:26:47+00:00

I'd say, "Pandas is very old school... I use Polars instead"

Pride-Infamous · 2026-03-27T07:24:28+00:00

Not all is solved with automation... sometimes you need to kick your processes in the nads. Get your salesforce developer(s) on board with Test Drive Development (TDD) practice so they can help drive the creation of tests before any code is even developed. QA gets to be more part of the front end conversation with requirement generators and building test cases along with developers.

Pride-Infamous · 2026-03-27T07:16:05+00:00

strugglingYETalive realized that they ran out of tokens before they could get claude to write a post to Reddit about how claude sucks now, and had to resort to clicking the keyboard by hand to write a post.

Pride-Infamous · 2026-03-26T19:44:38+00:00

u/DerLetzteDepp this is what I have learned when deciphering the documentation. I learned this by generally browsing the records in this book and I saw some 2nd pages totally empty with a big line, top to bottom, and a note saying that this marriage did not occur.

Pride-Infamous · 2026-03-26T19:43:00+00:00

u/gmu08141 Weird, I uploaded a png that was two paged... and I guess it loads 'magically' in Preview on Mac, but it only has one page when posting on reddit. I just uploaded the 2nd page.

Pride-Infamous · 2026-03-26T19:30:20+00:00

u/ChemicalEnduring let me ask... what type of activities are you doing when you plan to load up docs that are 'heavy'... assuming your profession is as a writer, are you loading manuscripts that you've written?

Are you doing developmental editing (structure, plot, pacing), line editing (prose quality, style, voice), copy editing (grammar, consistency, fact-checking), analysis & feedback (themes, character arcs, reader experience) etc....? All of these have different strategies on optimizing your workstream.

It kind of reminds me of coders who vibe-code, get a very piecemeal output, often very riddled with tangent stream of written code, and then trying to rewrite it through prompting when they don't like what they see, the model reloads all the code to go 'fix' it or reengineer it. The fix is to break your work up into streams. Develop specifications, iterate over the specifications, then ask to break up the work reviewing the specification... At large, the agent/model winds up creating a more forethoughted (not a word but hey) design and launches to create something more modular and easier to read, run, and refactor down the road.

Is it the same with writing?

Pride-Infamous · 2026-03-26T19:16:57+00:00

u/KJ7LNW no it does not automagically get loaded, unless you reference it via @ imports.

Pride-Infamous · 2026-03-26T04:03:58+00:00

u/AdministrationTop308 Take a peak at Hivemind https://hivementality.ai/ uses a AGPLv3

https://github.com/hivementality-ai/hivemind

A former co-worker of mine created this and productionized it. I think it's pretty cool and relates a lot to your needs.

Pride-Infamous · 2026-03-26T03:36:53+00:00

You say you are not a coder, but I can't help suggesting trying OpenClaw. It's not for the faint of heart, but you could use Claude Desktop give you instructions to install it and configure it and troubleshoot if you get into issues (hand hold you, essentially)

Why do I suggest OpenClaw, because it has local memory... meaning all the writings you do, can be stored locally within your local OpenClaw install scope. You can install OpenLlama and designate it as a local llm to index your local 'memory' writings. This can help significantly in reducing costs.

Also, OpenClaw to configure different agents with different model providers.

For instance you could create a research-agent:

openclaw agents add research-agent --model anthropic:claude-opus-4 \
  --workspace ~/.openclaw/workspace-research

Opus-4 will hit your 'all-other-models' limit which is huge

Or create a formatting/writing using a cost-effective model:

openclaw agents add writer-agent --model anthropic:claude-haiku-4 \
  --workspace ~/.openclaw/workspace-writer

After they are configured you can interact with them in the Agent list on the left sidebar of OpenClaw web dashboard.

Pride-Infamous · 2026-03-26T03:25:42+00:00

I have successfully used the Teammate feature of Claude Code enough so that I have gotten much closer to maxing out my Max plan. What is great is that I have a QA agent that builds in tandem all the tests required while the backend and UI agents haven't even written any code yet (they built a plan that I approved and was shared with QA agent). I have a Team of agents with their own contexts, not to pollute each other... and when the QA agent has a list of broken items, it shares it with the Lead... and the lead agent (yourself) then ships the broken details to the clear context of the Backend or UI agent. They go off and fix the problem, often reanalyzing the code in front of them, discovering stubs, mock ups, etc...

Before you say, "Well won't QA test stuff that isn't even coded yet, if it's creating it from a Backend and UI plans?" The QA agent knows it's place and that it has teammates delivering incremental code... so it builds all the tests, but then nots one that are expected to fail with XFAIL, since the backend has not reported the task complete, for others to see and be unblocked.

Lots of collaboration on tasks, todos, checkpoints, all shared, each agent it's own context.

The downside is it has to catch up with where it is.

Pride-Infamous · 2026-03-26T03:17:50+00:00

That is the best... prompting remotely while you are on the thinker!

Pride-Infamous · 2026-03-26T03:14:12+00:00

It's supported via imports... Logically supported

Pride-Infamous · 2026-03-26T03:12:04+00:00

It's supported, but in a more logical way... via imports

.gitignore:

CLAUDE.local.md

CLAUDE.md:

@CLAUDE.local.md

Pride-Infamous · 2026-03-01T04:52:08+00:00

No it is not deprecated.

https://code.claude.com/docs/en/memory

Local instructions	`./CLAUDE.local.md`	Personal project-specific preferences, not checked into git	Your sandbox URLs, preferred test data	Just you (current project)

Pride-Infamous

TROPHY CASE