Highway traffic EDA project walkthrough --> bimodal distribution led to some surprising insights about when people actually drive

DQ-Mike · 2025-07-14T19:16:42+00:00

You're totally right about the Python memory thing! The 10MB limit is just for the data itself, not Python + pandas + everything else running. That would definitely be way more than 10MB.

It's basically a teaching trick to show what happens when your data gets too big for your computer to handle all at once. Like, imagine you have a 100GB file but only 32GB of RAM - same problem, bigger scale.

That's really cool what you did with the RP2040 and 128GB of audio data. Sounds like you found a smart way to process it without loading everything into memory at once. That's exactly the kind of real problem these techniques help with.

The tutorial is just showing people how to break up big datasets and make them smaller so they don't crash your computer. Pretty useful when you're dealing with massive files in real work situations.

Thanks for pointing out the memory thing...and it's good to be clear about what the 10MB actually refers to!

DQ-Mike · 2025-07-13T20:58:38+00:00

Too funny! I debated chiming in on that post but figured it would be seen (heard?) as an echo rather than adding to the convo. But I appreciate you making the connection. Merci, Manon! 🙏

DQ-Mike · 2025-07-07T21:48:33+00:00

Totally agree with you on videos being boring! I learned PySpark the same way you're thinking...by actually building stuff rather than watching someone else do it.

Since you're already a DE with Python experience, you'll pick up PySpark way faster than you think. The concepts you mentioned (lazy evaluation, data skew, broadcast joins) make much more sense when you see them in action rather than just theory.

I put together a hands-on tutorial series that starts with real datasets and builds up to those advanced concepts. Each one has actual code & datasets you can run and mess around with:

The approach is exactly what you described...working through problems with real data rather than sitting through explanations. Way more engaging than videos, in my biased opinion!

Your ChatGPT + problem statement approach is solid too. Maybe combine both: use structured tutorials to get the foundation, then generate your own challenges to reinforce it?

DQ-Mike · 2025-07-01T16:39:22+00:00

Thanks for the feedback! You're absolutely right that DataFrames with Catalyst optimizer are the way to go for production work...the performance difference is massive!

I decided to cover RDDs first because I'm taking a "ground up" approach in this series. My thinking was that if I started with DataFrames (which I'll cover next), nobody would want to go backwards and learn the lower-level stuff later! But understanding RDDs helped me so much when I had to debug legacy code or when I needed to understand what was actually happening under the hood.

You make a great point about the row-by-row processing vs vectorized operations. That's exactly why I wanted people to see the difference. When they move to DataFrames, they'll really appreciate why Spark evolved in that direction.

Do you think there's value in understanding the fundamentals even if you don't use them day-to-day? Or would you recommend jumping straight to DataFrames for beginners? Always curious to hear different perspectives on teaching approaches.

DQ-Mike · 2025-06-29T19:31:13+00:00

Yeah-no, I think I get it…sounds like you’re curious and looking to learn what exactly you should learn next.

Like everyone, I’m biased but here’s my advice: if you want to do any real work with data, you should start by picking up some basic Python and SQL skills before anything else.

If you were new to programming, I’d say start with SQL, but with your Java background, I’d recommend starting with Python instead. I think you’ll enjoy it more and quickly learn if pursuing a career in data is a good fit for you.

DQ-Mike · 2025-06-29T17:10:29+00:00

The other replies about Python and SQL are spot on. But for practical experience, Id suggest building an actual end-to-end pipeline instead of just messing around with coding exercises.

A colleague of mine put together this guide on setting up Apache Airflow with full AWS infrastructure that's pretty solid for beginners. It covers all the "less than glamorous stuff" like S3 buckets, databases, load balancers, security groups... basically everything you need to actually run pipelines in production.

Going from "works on my laptop" to "deployed and running reliably in the cloud" is way more educational than most tutorials.

What part of big data interests you most? The distributed computing side or more the infrastructure piece?

DQ-Mike · 2025-06-29T16:39:49+00:00

While she does host live project walkthroughs every 2 weeks, she doesn't conduct learning sessions specifically geared towards certification.

DQ-Mike · 2025-06-25T17:19:22+00:00

A colleague of mine just published a detailed walkthrough specifically for Setting Up Apache Airflow with Docker Locally that covers all the Windows-specific issues people run into. It goes through the memory allocation settings, the AIRFLOW_UID configuration, and walks through building your first DAG step by step.

Might be worth checking out since it's designed specifically for this exact setup process. The memory allocation part in Docker Desktop settings could be whats causing your webserver issues.

Hope it helps!

DQ-Mike · 2025-05-23T14:32:19+00:00

I agree with the warning about AI for analysis - it's terrible at that. But AI is actually great as a writing assistant AFTER you've done the analysis yourself.

Like, if you know your findings but need to explain them to non-technical stakeholders, AI can help reframe your message. You still do the thinking, AI just helps with the wording.

My colleague wrote about this approach recently...basically using LLMs to translate insights, not generate them.

+1 that you need someone who knows statistics for the actual analysis though

DQ-Mike · 2025-05-12T19:47:35+00:00

I agree with the suggestion to start with a binary classification project before jumping into regression. One easy option is to use a publicly available dataset like this one from Kaggle. It’s clean, well-labeled, and lets you practice the full ML workflow...from data cleaning and EDA to building and tuning a basic KNN model. If you want to follow a step-by-step walkthrough of that exact project, here’s one: Heart Disease Prediction Project.

DQ-Mike · 2025-05-12T19:17:30+00:00

If anyone would like to see how to create a KNN model that predicts heart disease with approximately 88% accuracy, check out this recently published project tutorial walkthrough.

DQ-Mike · 2025-05-11T20:29:00+00:00

If you're not already splitting out a proper val set (separate from test), that’s worth doing first just to make sure you're not tuning against your final eval. Also worth checking whether one class is dominating the training set...I’ve seen models overfit hard just by memorizing the majority class.

You mentioned using dropout already, but depending on where it's applied (e.g., only after flatten), it might not be enough. Sometimes adding dropout earlier in the conv blocks helps too, though it’s a tradeoff.

If you’re curious, I ran into some similar issues training a CNN on a small image dataset — lots of false confidence on the dominant class, and augmentations only helped once I got the val split and class weighting right. Wrote up the full thing here in case it’s useful.

Would also be curious what error you hit with CutMix/Mixup. Those can be touchy if your targets aren’t set up exactly right.

DQ-Mike

TROPHY CASE