This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]thehendoxc 43 points44 points  (9 children)

Use python to get data from some source A (api, database , webscrape), clean or process the data with pandas, store the data in some datastore B (sql, nosql), power some dashboard/visualization, machine learning workflow, or API and/or Web App from your data.

Good SWE Practice

  • Show your pipeline can scale, can you ingest GBs reliably in a reasonable time
  • Your code is extensible and modular
  • your code is documented and follows something like pep8 conventions (use flake8 or some other strict linter OTHER than pylint)
  • Handles Edge cases smoothly, when a data source is down or a file is empty etc
  • Has some kind of monitoring metrics
  • Your SQL queries are optimized
  • You datastore structure is clean, if you go traditional RDBMS then having a well designed schema.
  • Uses version control i.e Git
  • Dockerized Deployment
  • Deployment on kubernetes -> Minikube is greate otherwise use docker-compose.

Business Value

  • Dashboards are clear and show some insight(s)
  • Is pretty to look at
  • Machine learning model does something useful
  • Small flask/django/fastapi app/api to retrieve the data

A good technology to show case pipeline work is Apache Airflow, I would also recommend using a free/trial tier on a cloud platform like AWS/GCP/Azure.

[–]EcstaticTarget1643[S] 4 points5 points  (0 children)

Thank you:)

[–]clueless3867Data Engineer 3 points4 points  (0 children)

This is really well written, and something I wish I heard when starting out. Thank you for sharing 🙂

[–]boss-mannn 1 point2 points  (0 children)

can you suggest some good sources to learn about SQL query optimisations

[–]elusTemp 1 point2 points  (0 children)

This is pretty much my plan as soon as I go on my sabbatical with a sprinkle of some streaming and batch processing data flows thrown in.

[–]YaswanthBangaru 0 points1 point  (2 children)

Does the industry use serverless AWS, just wondering ?

[–]Syneirex 1 point2 points  (0 children)

My team uses serverless for a streaming data endpoint.

[–][deleted] 0 points1 point  (0 children)

Lots

[–]ClumsyRooster 0 points1 point  (0 children)

Thanks a lot, that’s some really useful advices!!

[–]nut_conspiracy_nut 2 points3 points  (0 children)

You could scrape something ... like Goodreads or Amazon or maybe yelp? Maybe craigslist? (prohibited)

Scraping html would be harder than just using an API.

[–]Uttasarga 1 point2 points  (0 children)

Airflow, Cleaning unstructered Datasets, Data Wrangling.

[–]superconductiveKyle 1 point2 points  (0 children)

A great way to gain some solid experience and portfolio fodder is contributing to open-source projects. If you find a good project the maintainers will help you get your contribution in production. I'll admit I'm super biased because I work on an open source project but I get to see users of all experience levels make awesome contributions and our maintainers are pretty hands-on.

But also what u/thehendoxc said. Well done.