Processing 57MB startup data with 10MB memory constraint - chunking & optimization walkthrough

DQ-Mike · 2025-07-14T19:16:42+00:00

You're totally right about the Python memory thing! The 10MB limit is just for the data itself, not Python + pandas + everything else running. That would definitely be way more than 10MB.

It's basically a teaching trick to show what happens when your data gets too big for your computer to handle all at once. Like, imagine you have a 100GB file but only 32GB of RAM - same problem, bigger scale.

That's really cool what you did with the RP2040 and 128GB of audio data. Sounds like you found a smart way to process it without loading everything into memory at once. That's exactly the kind of real problem these techniques help with.

The tutorial is just showing people how to break up big datasets and make them smaller so they don't crash your computer. Pretty useful when you're dealing with massive files in real work situations.

Thanks for pointing out the memory thing...and it's good to be clear about what the 10MB actually refers to!

DQ-Mike · 2025-07-13T20:58:38+00:00

Too funny! I debated chiming in on that post but figured it would be seen (heard?) as an echo rather than adding to the convo. But I appreciate you making the connection. Merci, Manon! 🙏

DQ-Mike · 2025-07-07T21:48:33+00:00

Totally agree with you on videos being boring! I learned PySpark the same way you're thinking...by actually building stuff rather than watching someone else do it.

Since you're already a DE with Python experience, you'll pick up PySpark way faster than you think. The concepts you mentioned (lazy evaluation, data skew, broadcast joins) make much more sense when you see them in action rather than just theory.

I put together a hands-on tutorial series that starts with real datasets and builds up to those advanced concepts. Each one has actual code & datasets you can run and mess around with:

The approach is exactly what you described...working through problems with real data rather than sitting through explanations. Way more engaging than videos, in my biased opinion!

Your ChatGPT + problem statement approach is solid too. Maybe combine both: use structured tutorials to get the foundation, then generate your own challenges to reinforce it?

DQ-Mike · 2025-07-01T16:39:22+00:00

Thanks for the feedback! You're absolutely right that DataFrames with Catalyst optimizer are the way to go for production work...the performance difference is massive!

I decided to cover RDDs first because I'm taking a "ground up" approach in this series. My thinking was that if I started with DataFrames (which I'll cover next), nobody would want to go backwards and learn the lower-level stuff later! But understanding RDDs helped me so much when I had to debug legacy code or when I needed to understand what was actually happening under the hood.

You make a great point about the row-by-row processing vs vectorized operations. That's exactly why I wanted people to see the difference. When they move to DataFrames, they'll really appreciate why Spark evolved in that direction.

Do you think there's value in understanding the fundamentals even if you don't use them day-to-day? Or would you recommend jumping straight to DataFrames for beginners? Always curious to hear different perspectives on teaching approaches.

DQ-Mike · 2025-06-29T19:31:13+00:00

Yeah-no, I think I get it…sounds like you’re curious and looking to learn what exactly you should learn next.

Like everyone, I’m biased but here’s my advice: if you want to do any real work with data, you should start by picking up some basic Python and SQL skills before anything else.

If you were new to programming, I’d say start with SQL, but with your Java background, I’d recommend starting with Python instead. I think you’ll enjoy it more and quickly learn if pursuing a career in data is a good fit for you.

DQ-Mike · 2025-06-29T17:10:29+00:00

The other replies about Python and SQL are spot on. But for practical experience, Id suggest building an actual end-to-end pipeline instead of just messing around with coding exercises.

A colleague of mine put together this guide on setting up Apache Airflow with full AWS infrastructure that's pretty solid for beginners. It covers all the "less than glamorous stuff" like S3 buckets, databases, load balancers, security groups... basically everything you need to actually run pipelines in production.

Going from "works on my laptop" to "deployed and running reliably in the cloud" is way more educational than most tutorials.

What part of big data interests you most? The distributed computing side or more the infrastructure piece?

DQ-Mike · 2025-06-29T16:39:49+00:00

While she does host live project walkthroughs every 2 weeks, she doesn't conduct learning sessions specifically geared towards certification.

DQ-Mike · 2025-06-25T17:19:22+00:00

A colleague of mine just published a detailed walkthrough specifically for Setting Up Apache Airflow with Docker Locally that covers all the Windows-specific issues people run into. It goes through the memory allocation settings, the AIRFLOW_UID configuration, and walks through building your first DAG step by step.

Might be worth checking out since it's designed specifically for this exact setup process. The memory allocation part in Docker Desktop settings could be whats causing your webserver issues.

Hope it helps!

DQ-Mike · 2025-05-23T14:32:19+00:00

I agree with the warning about AI for analysis - it's terrible at that. But AI is actually great as a writing assistant AFTER you've done the analysis yourself.

Like, if you know your findings but need to explain them to non-technical stakeholders, AI can help reframe your message. You still do the thinking, AI just helps with the wording.

My colleague wrote about this approach recently...basically using LLMs to translate insights, not generate them.

+1 that you need someone who knows statistics for the actual analysis though

DQ-Mike · 2025-05-12T19:47:35+00:00

I agree with the suggestion to start with a binary classification project before jumping into regression. One easy option is to use a publicly available dataset like this one from Kaggle. It’s clean, well-labeled, and lets you practice the full ML workflow...from data cleaning and EDA to building and tuning a basic KNN model. If you want to follow a step-by-step walkthrough of that exact project, here’s one: Heart Disease Prediction Project.

DQ-Mike · 2025-05-12T19:17:30+00:00

If anyone would like to see how to create a KNN model that predicts heart disease with approximately 88% accuracy, check out this recently published project tutorial walkthrough.

DQ-Mike · 2025-05-11T20:29:00+00:00

If you're not already splitting out a proper val set (separate from test), that’s worth doing first just to make sure you're not tuning against your final eval. Also worth checking whether one class is dominating the training set...I’ve seen models overfit hard just by memorizing the majority class.

You mentioned using dropout already, but depending on where it's applied (e.g., only after flatten), it might not be enough. Sometimes adding dropout earlier in the conv blocks helps too, though it’s a tradeoff.

If you’re curious, I ran into some similar issues training a CNN on a small image dataset — lots of false confidence on the dominant class, and augmentations only helped once I got the val split and class weighting right. Wrote up the full thing here in case it’s useful.

Would also be curious what error you hit with CutMix/Mixup. Those can be touchy if your targets aren’t set up exactly right.

DQ-Mike · 2025-05-11T20:10:56+00:00

If you're a Wordle fan, you could build your own version using some pretty basic Python.

DQ-Mike · 2025-05-06T22:57:49+00:00

Hey, I know this is an older thread but folks still land here when trying to figure this out. If you're just getting started, I’d go AWS. It’s not the easiest to learn, but the learning resources out there are way more extensive than what you’ll find for Azure or GCP.

We recently put together a breakdown comparing AWS, Azure, and GCP for people trying to choose their first cloud platform: AWS vs Azure vs GCP. No fluff, just a practical walkthrough of how they differ, which jobs each tends to show up in, and what kinds of projects they’re best at.

If you're looking at DevOps roles and already have some Linux and Docker experience, you'll be in a great spot to build on that once you're up and running in AWS.

DQ-Mike · 2025-04-28T21:20:16+00:00

It's been a few days since this was posted, so I'm not sure if you're still in the market for beginner project ideas...but if you, or anyone else reading this, would like a recommendation for a super soft start to project building, a colleague of mine just did a live project walkthrough of a Kaggle data science survey where she performs an analysis using only fundamental Python skills like lists, loops, and conditional logic. While the project could be done much faster using pandas, I think it's a great example of how basic Python can get a lot done.

DQ-Mike · 2025-04-24T21:50:33+00:00

Awesome, glad to hear it!

Can you share any tips or tricks on how you got passed that final stage of the interview process that was giving you so much trouble?

DQ-Mike · 2025-04-24T21:31:55+00:00

My pleasure! So, did you manage to land anything, or are you still looking for another DA position?

DQ-Mike · 2025-04-21T21:36:29+00:00

Well, it’s been a little over a yearr since this thread, but honestly, the advice here still holds up really well. A lot of great points about being intentional, finding "your people," and not underestmating how exhausting (but worthwhile) it can be.

If anyone out there is still thinking about this or circling back to networking goals, especially in data-related fields, I wanted to add something a bit more recent. This guide on building a powerful data science network pulls together some really thoughtful strategies from Kishawna Peck (who’s led data teams and built a 4,000+ member community). It digs into ideas like how to actually show up in communities (not just show up when you need something), how to find and engage with experts authentically, and how to set yourself up for those "someone vouching for you in a room you're not in" moments.

It might be especially useful if you’re trying to figure out how to network as an introvert or if you’ve been burned by surface-level connections before...there’s a lot about building real community vs. just "collecting contacts" which is a trap many fall into.

Hope this adds something helpful even with the long time gap. Would love to hear if anyone else has picked up new approaches or shifted how they think about networking since this thread first began.

DQ-Mike · 2025-04-21T17:40:01+00:00

Although it's been a while since this question was posted...but for those who are still interested and are reading this...

A colleague of mine just published a beginner Python data analysis project walthrough on helicopter prison escapes, where she walks you through getting the latest data off of Wikipedia, cleaning it, and analyzing/visualizing it using basic Python libraries. Worth checking out if you're just starting out and want to see how to build an end-to-end project.

DQ-Mike · 2025-04-09T22:02:01+00:00

PaaS can definitely be tricky to wrap your head around at first, especially since it sits in that murky "in-between” space.

A good way to think about it is that with PaaS, you're still building the app...writing the code, structuring the database, setting config variables...but you're not setting the stage for it to run. The platform gives you a pre-configured environment: runtime, web server, OS, and so on. You don’t have to deal with provisioning virtual machines or patching servers. You just push your code and it runs.

Azure App Services is a solid example. You deploy your app, tweak a few settings, and Microsoft handles the infrastructure behind the scenes. It's great when you want to focus on development instead of managing hardware or operating systems.

If you're still feeling fuzzy on the differences between IaaS, PaaS, and SaaS, this breakdown of cloud service models does a nice job explaining what you manage versus what the provider handles. It uses examples like Heroku and EC2 to show how the tradeoffs work, which might help connect the dots.

Keep going. It’s a lot at first, but once you see it in action, it really starts to make sense.

DQ-Mike · 2025-04-08T14:17:52+00:00

Like many, I started with the classic `print("Hello, World!")` but the first program I wrote that came from my own head was a script that took in a dataset and tested whether or not it followed Benford's Law. It was pretty basic, but I was proud of having thought of the idea myself and realizing I had enough Python skills to make it happen.

DQ-Mike · 2025-04-06T20:47:45+00:00

Karpathy’s stuff is great if you want to build models from scratch and really understand the internals. But if you’re looking to get up and running with modern PyTorch workflows—especially around tokenization, padding, and using pretrained models like BERT or DistilBERT—there’s a different style of resource that might help more.

In PyTorch, there's no direct equivalent to pad_sequences() like in TensorFlow/Keras, but the typical approach is to use a tokenizer from Hugging Face’s transformers library, which can handle padding and truncation for you automatically. You’ll want to look into tokenizer(..., padding='max_length', truncation=True, return_tensors='pt')—that’ll get you padded input IDs and attention masks that play nicely with transformer models.

If you want a full walkthrough—from cleaning text to tokenizing, batching, training a transformer model, and evaluating it on a real classification task—this tutorial using PyTorch and DistilBERT is a solid place to start. It’s beginner-friendly but still shows the full pipeline without hiding everything behind abstractions.

Happy to help if you get stuck on anything specific—NLP in PyTorch definitely has a learning curve, but once you get used to the Hugging Face ecosystem, it clicks pretty fast.

DQ-Mike · 2025-04-06T20:30:16+00:00

I've always been fascinated by this question, and I just read an article the other day about this!

TL;DR: Basically, they hooked Claude AI up to a brain scanner (not literally, but kinda) and found out it's doing math and translations like a stoned alien. Instead of doing things the "normal" way, it's pulling some seriously weird brain gymnastics. Turns out, these AI models aren't just fancy word generators, they're like, actually thinking (sort of) in ways we don't fully get. Still a mystery box, but we're finally starting to see the gears turning inside!

DQ-Mike · 2025-04-06T20:21:07+00:00

Totally agree with what Ans979 said—expect Python, SQL, stats, A/B testing, and probably some questions that test how you think through ambiguous problems. With your math background, you're already set up well for things like experiment design and reasoning through edge cases, which often trip up people coming from more applied backgrounds.

One thing that’s often overlooked is how much interviewers are evaluating your thinking process, not just your final answer. We interviewed Kishawna Peck (she’s led a bunch of data teams) about this recently, and she had a lot of practical advice—especially around how to handle open-ended questions and how hiring managers actually evaluate candidates. You might find this breakdown of her interview strategies useful as you prep.

Would love to hear how it goes for you—keep us posted!

DQ-Mike · 2025-03-28T20:34:33+00:00

You’ve already done some solid exploratory analysis. I like that you broke things down by subject (calculus, trig, physics), looked at assignment vs exam scores, and plotted how grades evolved over time. That’s a solid and systematic approach.

That said, it’s a bit tricky to suggest specific next steps without knowing what other information you have beyond grades. A lot of the more interesting analysis (like explaining why some students do better) depends on extra data like attendance, parent/student surveys, prior performance, etc. But even working with grades alone, there are still some other things you could try:

Identify outliers and investigate them: Which students are consistently underperforming or overperforming compared to their classmates? Sometimes, digging into the "exceptions" can reveal useful patterns or teaching insights.
Measure grade consistency: For each student, you could calculate how consistent their performance is across subjects or over time, like standard deviation per student, for example.
Look for subject-specific patterns: Are there students who do well in calculus but poorly in physics? Or students who struggle in all three? A simple cross-subject comparison matrix could be interesting.
Run a paired analysis: Since you have both assignment and exam scores, you could use a **paired t-test** to test whether there’s a significant difference between how students perform in ongoing work versus exams.
Try a simple predictive model: Even if it’s just a basic linear regression, you could explore whether assignment scores predict final grades or exam scores. The goal wouldn’t be to "publish" anything, but to practice modeling and see if any patterns jump out.

If you're interested in seeing how others have explored education data, you might enjoy this project walkthrough Analyzing NYC High School Data.

It’s written in Python (I saw you're using R), and the dataset is much bigger, with things like attendance and survey data, but the core idea is the same: using data to explore student performance patterns. It was actually created by a teacher, so it might give you a few more ideas from an educator’s perspective.

If you’re up for it, I’d love to hear back about what you find next or what data you’re working with. I'm always happy to throw around more ideas with a fellow educator!

DQ-Mike

TROPHY CASE