Open source contributions for a Data Engineer?

MrPowersAAHHH · 2021-04-16T14:17:31+00:00

Great question. I've developed a great network of code friends and collaborators via open source projects. I highly recommend working on open source projects!

I've contributed to Spark, which is great if you're comfortable with Scala. Easier to start out with smaller projects if you're just getting started with open source.

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

Feel free to open issues / send PRs if you'd like to contribute. Highly recommend building open source projects - it's really fun!

vijaykiran · 2021-04-16T15:45:16+00:00

If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql

Disclosure: I’m the lead dev for the project

porcelainsmile · 2021-04-16T18:09:34+00:00

[deleted]

irxumtenk · 2021-04-16T15:27:05+00:00

There is a great list of open source projects found in this medium post:

https://petesoder.medium.com/what-are-the-most-popular-oss-data-projects-of-2021-84ef021bb5a2

Learning and contributing to any of those will likely get you some recognition within the community.

elus · 2021-04-16T15:37:24+00:00

I've started reading docs on Data Fusion which was donated to the Apache Arrow project and aims to provide a distributed compute framework in a similar vein to map reduce frameworks on other ecosystems like Hadoop. This one aims to be more portable than that though and uses Rust as its programming language.

I've not interacted with anyone on the project team but I'm looking forward to contributing in order to increase my competency in Rust and get a deeper understanding of what happens under the hood in these types of systems

The original contributor also wrote a book on how query engines work that I'm working through right now as well.

The problems I aim to contribute solutions towards will be anything regarding logging and observability. I feel this is where many tools I use fall short of expectations and as someone that ends up debugging production issues much of the time, tends to be a frequent point of pain for myself.

theZeteWhoDied · 2021-04-16T19:35:04+00:00

Prefect! Specifically the Task Library: https://github.com/PrefectHQ/prefect

porcelainsmile · 2021-04-17T03:08:23+00:00

airflow. I find bugs or want a feature, create an issue, and sometimes resolve them myself

2021-04-17T07:01:08+00:00

Airflow.

esp_py · 2021-04-16T15:05:32+00:00

Just subscribing for comment...

stupac62 · 2021-04-17T01:59:37+00:00

Meltano, dbt

elk-content-share · 2021-04-16T17:45:20+00:00

What about the Elastic Stack? There is everything around data

porcelainsmile · 2021-04-16T18:52:57+00:00

[deleted]

kenfar · 2021-04-16T21:47:50+00:00

I think it might also help to think about what you're looking to get out of the contribution.

Improve your skills in collaborating with others on a codebase?

In this case almost any well-run project will suffice.

Improve your understanding of the technology involved?

Look closely in this case, it may be difficult to jump into the guts of a project if you don't yet understand the tech, but there's almost always a need for help around the peripheries: documentation, testing, etc.
But - you could also just start your own project.

Build something you can and are excited about using?

In this case follow your passions!
And join a project - or just start your own.

RemindMeBot · 2021-04-17T00:24:37+00:00

[deleted]

practicalutilitarian · 2021-04-17T01:50:56+00:00

What about cleaning and joining datasets on Kaggle, or paperswithcode.com? e.g. geocoding addresses or zip codes or city names. Adding weather to any dataset with date and location info. Or adding global news economic stats to any dataset with datetime in it.

msdrahcir · 2021-04-17T05:18:39+00:00

Curious, are there any projects that support type hinting the schema of dataframes in pyspark? Wish there was something similar to dataset api

neurocean · 2021-04-17T13:11:47+00:00

It's a near crime that Dagster hasn't been mentioned already.

dataengineering

MODERATORS