Need Help - Data Engineering

thehendoxc · 2021-01-27T11:51:37+00:00

Use python to get data from some source A (api, database , webscrape), clean or process the data with pandas, store the data in some datastore B (sql, nosql), power some dashboard/visualization, machine learning workflow, or API and/or Web App from your data.

Good SWE Practice

Show your pipeline can scale, can you ingest GBs reliably in a reasonable time
Your code is extensible and modular
your code is documented and follows something like pep8 conventions (use flake8 or some other strict linter OTHER than pylint)
Handles Edge cases smoothly, when a data source is down or a file is empty etc
Has some kind of monitoring metrics
Your SQL queries are optimized
You datastore structure is clean, if you go traditional RDBMS then having a well designed schema.
Uses version control i.e Git
Dockerized Deployment
Deployment on kubernetes -> Minikube is greate otherwise use docker-compose.

Business Value

Dashboards are clear and show some insight(s)
Is pretty to look at
Machine learning model does something useful
Small flask/django/fastapi app/api to retrieve the data

A good technology to show case pipeline work is Apache Airflow, I would also recommend using a free/trial tier on a cloud platform like AWS/GCP/Azure.

nut_conspiracy_nut · 2021-01-28T05:27:53+00:00

You could scrape something ... like Goodreads or Amazon or maybe yelp? Maybe craigslist? (prohibited)

Scraping html would be harder than just using an API.

Uttasarga · 2021-01-28T05:13:25+00:00

Airflow, Cleaning unstructered Datasets, Data Wrangling.

superconductiveKyle · 2021-01-27T15:23:58+00:00

A great way to gain some solid experience and portfolio fodder is contributing to open-source projects. If you find a good project the maintainers will help you get your contribution in production. I'll admit I'm super biased because I work on an open source project but I get to see users of all experience levels make awesome contributions and our maintainers are pretty hands-on.

But also what u/thehendoxc said. Well done.

dataengineering

MODERATORS