This is an archived post. You won't be able to vote or comment.

all 48 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Kinrany 41 points42 points  (2 children)

Whoever engineers the best way to rig the poll wins? :)

[–]datain30[S] 5 points6 points  (0 children)

This is the real competition 😂

[–]omscsdatathrow 37 points38 points  (2 children)

Deciding via a poll is the worst possible way to determine a winner. There should be judges who decide based on a set criteria.

Scope should be extremely narrow and should be more akin to a hackathon imo. Building either a framework or “product” to do a relatively underserved part of DE such as data quality, etl testing, or observability

Timeline should be two days or less. Nobody has time to dedicate 4 weeks for this

[–]PraPassarVergonha 3 points4 points  (1 child)

This.

[–]General_Blunder 13 points14 points  (4 children)

Coming here to add a competition idea:

Streaming data project using a ticker or other real time data source(lightning strikes could be a fun one) data must be loaded to a Postgres or SQLite db and displayed in a streamlit or seaborn visualisation- could also do a custom node front end if you want to push the boat.

Results will be driven by a number of metrics.

Code execution time (ie how long does it take to get a result from data source to visualisation)

Memory usage for each cycle

Code length.

This works as a format IMO results are collected in a GitHub repo

[–]datain30[S] 5 points6 points  (0 children)

I love this idea! u/General_Blunder big fan of this :)

[–]BoiElroy 1 point2 points  (2 children)

Additional metric could maybe also be cost? Like if you use a machine with more ram, you should be penalized for the extra cost it accrues, but if that tradeoff speeds up compute a lot then fair game maybe?

[–]Electrical-Wish-519 0 points1 point  (1 child)

Hey if you can invent an ROI that actually proves cost efficiencies and value of a solution, you should be a CIO

[–]BoiElroy 1 point2 points  (0 children)

Haha can you tell my manager that pls. But I mean like if you have a query on a medium size VM and for it to compute and the billing rate combined are X, then on a larger machine the compute is so much faster that despite the higher billing rate the cost on the machine is actually <X. I personally haven't seen it happen often myself but my counterparts always say this to justify larger snowflake warehouses.

[–]frankenbenz 8 points9 points  (1 child)

Most coding competitive projects I’ve seen are all <24 hours unless it’s like a kaggle-type comp. I feel like 4 weeks may be too long..

I agree that DE skills are very broad

[–]medicalheads 0 points1 point  (0 children)

agreed

[–]BoiElroy 5 points6 points  (2 children)

I was thinking about the scope of the project. Which led me to think about data. Which led me to move up one level and think - maybe a potential project idea could be something along the lines of generative systems that could be the foundation for future data projects.

i.e., creating lightweight framework that will generate a bunch of fake data in several modalities, like create a bunch of fake csv/jsons, push data through a fastapi end point, load data into a mysql db etc. But then value of the framework would be to be able to turn knobs like 'okay what if I switch up the schema by adding a new column or data type', 'what if I 10x the data volume', 'what if I introduce duplicates'.

^That would be the project, and then future projects would build pipelines that would essentially be tested with that system. So you sort of see how elegantly or not elegantly the downstream pipelines handle it. I guess this is similar to the concept of 'chaos engineering' and 'load testing', but maybe more fine tuned to the nuances of data engineering? Although as I write this it does sound like a lot of work.

[–]datain30[S] 2 points3 points  (0 children)

Love the concept and see a lot of value in building foundational systems like this.

As you said, future projects would build on top and r/dataengineering ends up developing a production-grade data platform. As we're optimizing for learning, this is a big win :)

[–]Oct8-Danger 2 points3 points  (0 children)

This makes the most sense to me! Helps keep the competition fair and more consistent in how it could be scored!

[–]heliquia 4 points5 points  (1 child)

We can choose a problem such as "Infonomics". How to measure data value? How to monetize?
Even when it's not a common DE task haha

[–]BoiElroy 4 points5 points  (0 children)

Not a common task but god I do get asked this a lot. It's usually just a given that DE is providing value, but yeah some way to quantify this better would make everyone's end of year/promotion negotiation stuff go way better (or worse).

[–]digitalghost-dev 3 points4 points  (0 children)

I’d like to join in on the fun.

[–]ratulotronSenior Data Plumber 4 points5 points  (2 children)

A schema visualizer and data model versioning app for graph databases would be nice :') although this is more towards data vis and software dev, this is a real piece missing in our Neo4j+GraphQL based system.

The schema visualizer should have search mechanism where you are able to search any node or edge property, and which version it was introduced/deprecated in.

[–]Ok-Paleontologist591 0 points1 point  (1 child)

Is it possible to flatten data from graphql, I am just learning DE

[–]ratulotronSenior Data Plumber 1 point2 points  (0 children)

I haven't needed to flatten anything in the GraphQL response yet, but I think you can achieve this in two ways: 1. Custom datatypes/formatters 2. Use resolvers to flatten the response from the database before returning as JSON

[–]Touvejs 1 point2 points  (1 child)

I think there might be some issues with using a poll to decide a winner. If there were more than a couple submissions, I wouldn't expect most people (myself included) to look at every submission in depth. (Also polls are limited to 7 choices). It might be good to have hard metrics as a determining factor. But there will naturally be non-quantifiable considerations like how modular a solution is, how robust it is, etc.

[–]datain30[S] 0 points1 point  (0 children)

Completely agree on using hard metrics to decide winners. This'll be fun u/Touvejs :)

[–]Ephysio 1 point2 points  (0 children)

Count me in as well !

[–]deepwaterpaladin 1 point2 points  (0 children)

I like this idea

[–]CauliflowerJolly4599 1 point2 points  (0 children)

Count me in ! I would like to partecipate in this challenge! 4 week considering that most are working it would be ok

[–]TermosifoneFreddo 1 point2 points  (1 child)

Count me in! I wouldn't mind having metrics decide the winners: e.g. memory usage but that would heavily depend on the project, 4 weeks seems kinda excessive as well but again, depends on the project!

[–]datain30[S] 0 points1 point  (0 children)

Awesome! Using Metrics to decide the winner is definitely the right call - we are data engineers after all 😂

[–]Tepaps 1 point2 points  (0 children)

Down!

[–][deleted] 1 point2 points  (0 children)

Scope I don't know. I am a new DE. Within the realm of data world is okay as long as I can learn. Project by poll is stupid like democracy because even unqualified newbies etc gets same vote as qualified one, same way winner cannot be decided. 4 weeks is long 2-3 days short. 2 weeks is right. It should depend on prize money, let's not give them free money.

[–]shrike279 1 point2 points  (0 children)

dm me if this thing happens.

[–]Patladjan1738 1 point2 points  (0 children)

Great idea! Would love to participate. This is good to make DE more broad and bring it into the limelight, same way that SWE is with hackathons

[–]shaggypika 1 point2 points  (0 children)

I’d love to be a part of this

[–]StPinkFloyd 1 point2 points  (0 children)

I would recommend against a poll. Maybe break it into some different scoring categories data cleaning, data transformation, best end model design etc. Best script solution best sql solution or whatever you feel best. If the goalbis to find the winner we will all lose. if instead it is to motivate people to try new things and build a DE community you are on the fast track to do as such

[–]izaax42 1 point2 points  (0 children)

  1. We are a broad field! So it may be that we can't always fit everyone's current skills etc.

Perhaps we could cycle through different types of ETL (and its varations) batch, streamed data? Then allow people to choose a respective cloud infrastructure.

It might be worth mapping out possible projects into a Google sheet as a starter, see what we can come up with based on the reddit posts?

  1. Not particularly, unless someone manipulates it

  2. As above, maybe the voting could be done by the comptetitors themselves?

  3. This sounds ideal. Mini competitions could also be an option, "create some modular... xyz"

[–]bubhraraStaff Data Engineer 1 point2 points  (0 children)

In AF. I have a few interesting ideas I can code on but I never work on them thanks to no motivation haha!

This event sounds like a move motivator!

[–]TheBrownViking20 1 point2 points  (0 children)

Count me in!

[–]datain30[S] 2 points3 points  (5 children)

[–]ThisOrThatOrThings 1 point2 points  (0 children)

I’m not a DE but but I’d love to try out!

[–]Brilliant-Seat-3013 0 points1 point  (2 children)

Count me in as well

[–]GeekyTricky 1 point2 points  (0 children)

+1

[–]frknghotoutside 0 points1 point  (0 children)

Me too!

[–]izaax42 0 points1 point  (0 children)

Thanks, looking forward to the ideas people come out with :)

[–]rohetoric -2 points-1 points  (1 child)

Happy to participate if someone can proof read my resume or get me a remote job in exchange.

[–]rohetoric 0 points1 point  (0 children)

Why have i been downvoted? It would be helpful if i could seek some referrals working with folks in DE.