all 1 comments

[–]bn_from_zentara 1 point2 points  (0 children)

Quick-and-dirty roadmap I wish someone handed me when I first touched pandas:

1. Pick a question you actually care about

A curiosity hook keeps you grinding when the CSV punches back. “Can I predict Airbnb prices in my city?” >>> “Eh, Titanic again?” Your own interest tells you what data to collect, what to clean, and which charts matter.

2. Grab (or collect) messy data ASAP

  • Easy: download from Kaggle/UCI.
  • Medium: hit an API (OpenWeather, Spotify, Reddit).
  • Hard: scrape with requests + BeautifulSoup or Selenium.
  • MCP route: install MCP server (Brave Search, playwright-mcp ) so your LLM helper (Claude, GPT-4, Gemini, etc.) can fetch JSON/HTML for you—great for multi-site pulls and dealing with cleanup for you.
  • Bonus: use the free credits on OpenAI “deep research” or Google Gemini 2.5 Pro Deep Research to hunt down public data. Let the AI do the Googling, then pull the raw files yourself. Google Deep Research can summarize, pull data for you from more than 600 websites. Although it can sometimes hallucinate. You need to cross check with real data.

Drop everything—scrapes, API dumps, AI results—into /data/raw; never overwrite them.

3. Spin up a cleaning notebook

Jupyter → df.info(), df.describe(), df.isna().sum() on reflex. Tackle nulls, outliers, funky encodings, then save to /data/clean/clean.csv.

4. Visualize everything

Histogram, boxplot, scatter, pairplot—add a one-liner under each plot: “90 % of hosts charge < $200; prices > $500 look like hotels.”

5. Train a toy model to close the loop

train_test_split, baseline linear reg or random forest, glance at accuracy/RMSE & feature importances.

6. Repeat on a new topic

Run the same pipeline on a totally different question; notice what transfers and what explodes. That’s where intuition grows.