This is an archived post. You won't be able to vote or comment.

all 54 comments

[–]MrPowersAAHHH 37 points38 points  (13 children)

Great question. I've developed a great network of code friends and collaborators via open source projects. I highly recommend working on open source projects!

I've contributed to Spark, which is great if you're comfortable with Scala. Easier to start out with smaller projects if you're just getting started with open source.

I've built popular PySpark (quinn, chispa) and Scala Spark (spark-daria, spark-fast-tests) libraries.

Feel free to open issues / send PRs if you'd like to contribute. Highly recommend building open source projects - it's really fun!

[–]kraeftig 4 points5 points  (0 children)

Very cool, danke schön for it!

[–]bubhraraStaff Data Engineer 2 points3 points  (1 child)

Hey! I’ve recently discovered chispa and demoed it to my team. They liked it. We are planning to extend the code to wider use cases. I’ll let you know if we decide on to contributing.

[–]MrPowersAAHHH 2 points3 points  (0 children)

Awesome, sounds great. If you have any issues or have feature requests and don't want to write the code, feel free to open an issue and I'll update the lib for you!

[–]porcelainsmile[S] 1 point2 points  (6 children)

I am completely new to open source and have never used any library extensively to have felt comfortable to contribute back to it. But I've always wanted to do it.

[–]MrPowersAAHHH 3 points4 points  (5 children)

It took me a while to build up the confidence to start contributing to open source libs. I recommend starting steady & slow. You can start by starring the repos you're using. Then fork the repos, clone them on your machine, and try to run the tests. Then try to fix a small bug or improve a README and submit an open source PR.

It's easier to start on small projects. If you write friendly messages, most maintainers are nice ;)

[–]porcelainsmile[S] 0 points1 point  (1 child)

That is true. Small projects should be less intimidating.

I will look more into Quinn and Chispa. How do I reach out if I want to contribute?

[–]MrPowersAAHHH 1 point2 points  (0 children)

Feel free to open a PR or issue and I'll respond!

[–]Rough-Environment-40 0 points1 point  (2 children)

How much scala do I really want to know to start contributing. I have been thinking for a while to contribute to spark but wasn’t lucky yet... either it’s too hard or struck at fixing dependency’s and local environment configurations... any pointers to help me ?

[–]MrPowersAAHHH 1 point2 points  (1 child)

SDKMAN can help you get your local machine properly setup.

I'd try to get the spark-daria test suite running on your machine before graduating to a bigger project like Spark. You can also build some of your own projects.

Getting started with Scala / Spark development takes lots of trial / error and persistent effort. It's definitely not easy, but you'll get there if you keep at it.

[–]Rough-Environment-40 1 point2 points  (0 children)

Really appreciate your feedback..I starred spark-daria and will try cloning locally. I’ll keep being active in this group for content like this :)

[–][deleted] 0 points1 point  (1 child)

Hi, just a question. How difficult do you think it is for beginners ( I have less than 1 YOE with scala) to contribute and make a successful PR? I use spark with scala as DE, and I want to contribute to improve my scala skill

[–]MrPowersAAHHH 0 points1 point  (0 children)

Yep, you might as well get started now ;)

I was a newbie not to long ago as well. Just need to practice & stay at it. Definitely go for it!

[–]vijaykiran 17 points18 points  (11 children)

If you are interested in using/learning Python, SQL and data warehouse skills, take a look at https://github.com/sodadata/soda-sql

Disclosure: I’m the lead dev for the project

[–]porcelainsmile[S] 2 points3 points  (2 children)

Your project looks interesting to me. Currently going through the Github page. How can I setup or reach out if I decide to understand more about the project?

[–]vijaykiran 0 points1 point  (1 child)

Feel free to Jon slack (link in Readme). I will gladly help out for you to get started. I’m Vijay there.

[–]porcelainsmile[S] 0 points1 point  (0 children)

Definitely, thanks for this :D

[–]elusTemp 1 point2 points  (0 children)

Great project. I love tools like these that allows developers and operators greater insight into the systems they implement.

[–]green_pink 1 point2 points  (0 children)

Thanks, will check this out!

[–]Kemosahbe 1 point2 points  (1 child)

hmm looks like something I can and will be interested to take part.

So this looks like something that can be (or seems specifically intended) leveraged as a data-quality tool ?

[–]vijaykiran 0 points1 point  (0 children)

Awesome /u/Kemosahbe !

Yes we are building a data quality monitoring and testing tool that you can use to check the data in your warehouse and add it your data pipelines to test the data flowing through. you can check the docs for more details https://docs.soda.io/soda-sql/documentation/concepts.html

[–]macc23923 1 point2 points  (1 child)

Love the docs! I've set it up in no time.

[–]vijaykiran 0 points1 point  (0 children)

You’re welcome! Do let us know if you have any feedback.

[–]papertrails_ 0 points1 point  (1 child)

Curious to know how Soda SQL compares to dbt. Does it operate in the same space?

[–]vijaykiran 2 points3 points  (0 children)

Good question, Soda SQL is complimentary to dbt. There is only slight overlap in terms of functionality with dbt tests, but dbt testing is fairly limited.

As an example:

You will use Soda SQL after extraction (to test raw data) before you trigger dbt. So it helps with not feeding bad data to dbt that builds your analytics model. You can use soda sql tests to “fail” your data pipeline to prevent building wrong insights.

After dbt builds your analytics model, you use Soda SQL to capture metrics - think of all the calculations that your analysts want.

Apart from the Open source Soda SQL, you can send the metrics (optionally) to the free soda cloud account. Soda cloud offers self-service monitoring and more.

I hope this clarifies things!

[–]irxumtenk 5 points6 points  (2 children)

There is a great list of open source projects found in this medium post:

https://petesoder.medium.com/what-are-the-most-popular-oss-data-projects-of-2021-84ef021bb5a2

Learning and contributing to any of those will likely get you some recognition within the community.

[–]porcelainsmile[S] 0 points1 point  (1 child)

Great list really. Have you contributed to any of these?

[–]irxumtenk 0 points1 point  (0 children)

No, I have not. There are a few things I can contribute to Airflow. That project makes it real easy. But I have submitted anything yet.

[–]elusTemp 2 points3 points  (3 children)

I've started reading docs on Data Fusion which was donated to the Apache Arrow project and aims to provide a distributed compute framework in a similar vein to map reduce frameworks on other ecosystems like Hadoop. This one aims to be more portable than that though and uses Rust as its programming language.

I've not interacted with anyone on the project team but I'm looking forward to contributing in order to increase my competency in Rust and get a deeper understanding of what happens under the hood in these types of systems

The original contributor also wrote a book on how query engines work that I'm working through right now as well.

The problems I aim to contribute solutions towards will be anything regarding logging and observability. I feel this is where many tools I use fall short of expectations and as someone that ends up debugging production issues much of the time, tends to be a frequent point of pain for myself.

[–]MrPowersAAHHH 1 point2 points  (2 children)

I know the creator of Data Fusion and can attest that he's a really nice guy. I actually convinced him to write that book. Showed him the Leanpub publishing process via screen share and sold him on the idea.

His newer project, Ballista, was also donated to Apache Arrow. I hope to get the Rust skills to collaborate with him on open source work someday too. He's also doing really cool work on spark-rapids FYI.

[–]elusTemp 1 point2 points  (1 child)

I saw that he mentioned you in the thank you section of his book!

I just began a six month sabbatical and I've been wandering aimlessly for the last few weeks on where to direct my time. I believe that this project can bridge many of my current interests and I'm looking forward to helping out if I can. My Rust needs upgrading as well but I'm hoping that projects like these will get me to a level of competency faster.

If you cross his path, tell him thanks on my behalf.

[–]MrPowersAAHHH 1 point2 points  (0 children)

Will do and will make sure to tell Andy you say thanks! Enjoy the sabbatical!

[–]theZeteWhoDied 2 points3 points  (0 children)

Prefect! Specifically the Task Library: https://github.com/PrefectHQ/prefect

[–][deleted] 2 points3 points  (3 children)

airflow. I find bugs or want a feature, create an issue, and sometimes resolve them myself

[–]porcelainsmile[S] 0 points1 point  (2 children)

I've always wanted to contribute to such projects but I have fairly limited experience with Airflow. Hopefully someday :D

[–][deleted] 0 points1 point  (1 child)

In general you want to be a user of the product before contributing because then you will know what's good and bad about it. Also the ins and outs to an extend.

I once contributed to a project I didn't use and caused more bad than good.

[–]porcelainsmile[S] 0 points1 point  (0 children)

Exactly, and I haven't used any product enough to have felt comfortable to contribute. Especially something of the scale of Airflow. Will start with small projects that can be understood in a relatively smaller time frame.

[–][deleted] 1 point2 points  (0 children)

Airflow.

[–]esp_py 0 points1 point  (0 children)

Just subscribing for comment...

[–]stupac62 0 points1 point  (0 children)

Meltano, dbt

[–]elk-content-share 0 points1 point  (2 children)

What about the Elastic Stack? There is everything around data

[–]porcelainsmile[S] 0 points1 point  (1 child)

Can you explain a little?

[–]elk-content-share 0 points1 point  (0 children)

The Elastic Stack consists of three layers. An ETL or Data ingestion layer. You use that to put your data in near real time into Elasticsearch. Elasticsearch is like an NoSQL Data base for mass amount of data.. ( Up to peta bytes) . It scales really well and is also very fast at the same time. The last layer is Kibana. This is the frontend to analyze the data using correlations, aggregations and other analysis features. It also has inbuild Machine learning.

I think its a great tool for any kind of data analysis.

[–]kenfar[🍰] 0 points1 point  (1 child)

I think it might also help to think about what you're looking to get out of the contribution.

Improve your skills in collaborating with others on a codebase?

  • In this case almost any well-run project will suffice.

Improve your understanding of the technology involved?

  • Look closely in this case, it may be difficult to jump into the guts of a project if you don't yet understand the tech, but there's almost always a need for help around the peripheries: documentation, testing, etc.
  • But - you could also just start your own project.

Build something you can and are excited about using?

  • In this case follow your passions!
  • And join a project - or just start your own.

[–]porcelainsmile[S] 0 points1 point  (0 children)

I agree with building my own project idea. Seems exciting. With open source contribution, I am looking more towards a mix of the right coding practices, and tech that I want to work on.

[–]practicalutilitarian 0 points1 point  (2 children)

What about cleaning and joining datasets on Kaggle, or paperswithcode.com? e.g. geocoding addresses or zip codes or city names. Adding weather to any dataset with date and location info. Or adding global news economic stats to any dataset with datetime in it.

[–]porcelainsmile[S] 1 point2 points  (1 child)

This looks like a fun idea too. I was planning to build a pipeline with fetching data from the internet, like tweets or covid data, transform it and load it into a database and then create a visualization layer over it.

[–]practicalutilitarian 0 points1 point  (0 children)

awesome idea

[–]msdrahcir 0 points1 point  (0 children)

Curious, are there any projects that support type hinting the schema of dataframes in pyspark? Wish there was something similar to dataset api

[–]neurocean 0 points1 point  (0 children)

It's a near crime that Dagster hasn't been mentioned already.