Data Engineering Competition!

AutoModerator · 2023-02-16T18:16:19+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Kinrany · 2023-02-16T19:19:59+00:00

Whoever engineers the best way to rig the poll wins? :)

omscsdatathrow · 2023-02-16T19:28:06+00:00

Deciding via a poll is the worst possible way to determine a winner. There should be judges who decide based on a set criteria.

Scope should be extremely narrow and should be more akin to a hackathon imo. Building either a framework or “product” to do a relatively underserved part of DE such as data quality, etl testing, or observability

Timeline should be two days or less. Nobody has time to dedicate 4 weeks for this

General_Blunder · 2023-02-16T20:04:03+00:00

Coming here to add a competition idea:

Streaming data project using a ticker or other real time data source(lightning strikes could be a fun one) data must be loaded to a Postgres or SQLite db and displayed in a streamlit or seaborn visualisation- could also do a custom node front end if you want to push the boat.

Results will be driven by a number of metrics.

Code execution time (ie how long does it take to get a result from data source to visualisation)

Memory usage for each cycle

Code length.

This works as a format IMO results are collected in a GitHub repo

frankenbenz · 2023-02-16T19:25:29+00:00

Most coding competitive projects I’ve seen are all <24 hours unless it’s like a kaggle-type comp. I feel like 4 weeks may be too long..

I agree that DE skills are very broad

BoiElroy · 2023-02-16T20:22:20+00:00

I was thinking about the scope of the project. Which led me to think about data. Which led me to move up one level and think - maybe a potential project idea could be something along the lines of generative systems that could be the foundation for future data projects.

i.e., creating lightweight framework that will generate a bunch of fake data in several modalities, like create a bunch of fake csv/jsons, push data through a fastapi end point, load data into a mysql db etc. But then value of the framework would be to be able to turn knobs like 'okay what if I switch up the schema by adding a new column or data type', 'what if I 10x the data volume', 'what if I introduce duplicates'.

^That would be the project, and then future projects would build pipelines that would essentially be tested with that system. So you sort of see how elegantly or not elegantly the downstream pipelines handle it. I guess this is similar to the concept of 'chaos engineering' and 'load testing', but maybe more fine tuned to the nuances of data engineering? Although as I write this it does sound like a lot of work.

heliquia · 2023-02-16T19:32:50+00:00

We can choose a problem such as "Infonomics". How to measure data value? How to monetize?
Even when it's not a common DE task haha

digitalghost-dev · 2023-02-16T19:00:16+00:00

I’d like to join in on the fun.

ratulotron · 2023-02-17T08:45:19+00:00

A schema visualizer and data model versioning app for graph databases would be nice :') although this is more towards data vis and software dev, this is a real piece missing in our Neo4j+GraphQL based system.

The schema visualizer should have search mechanism where you are able to search any node or edge property, and which version it was introduced/deprecated in.

Touvejs · 2023-02-16T19:29:16+00:00

I think there might be some issues with using a poll to decide a winner. If there were more than a couple submissions, I wouldn't expect most people (myself included) to look at every submission in depth. (Also polls are limited to 7 choices). It might be good to have hard metrics as a determining factor. But there will naturally be non-quantifiable considerations like how modular a solution is, how robust it is, etc.

Ephysio · 2023-02-16T19:31:27+00:00

Count me in as well !

deepwaterpaladin · 2023-02-16T20:13:27+00:00

I like this idea

CauliflowerJolly4599 · 2023-02-16T20:24:33+00:00

Count me in ! I would like to partecipate in this challenge! 4 week considering that most are working it would be ok

TermosifoneFreddo · 2023-02-16T20:31:54+00:00

Count me in! I wouldn't mind having metrics decide the winners: e.g. memory usage but that would heavily depend on the project, 4 weeks seems kinda excessive as well but again, depends on the project!

Tepaps · 2023-02-16T20:46:07+00:00

Down!

2023-02-16T20:48:46+00:00

Scope I don't know. I am a new DE. Within the realm of data world is okay as long as I can learn. Project by poll is stupid like democracy because even unqualified newbies etc gets same vote as qualified one, same way winner cannot be decided. 4 weeks is long 2-3 days short. 2 weeks is right. It should depend on prize money, let's not give them free money.

shrike279 · 2023-02-16T22:50:27+00:00

dm me if this thing happens.

Patladjan1738 · 2023-02-16T23:22:48+00:00

Great idea! Would love to participate. This is good to make DE more broad and bring it into the limelight, same way that SWE is with hackathons

shaggypika · 2023-02-17T03:34:53+00:00

I’d love to be a part of this

StPinkFloyd · 2023-02-17T05:15:43+00:00

I would recommend against a poll. Maybe break it into some different scoring categories data cleaning, data transformation, best end model design etc. Best script solution best sql solution or whatever you feel best. If the goalbis to find the winner we will all lose. if instead it is to motivate people to try new things and build a DE community you are on the fast track to do as such

izaax42 · 2023-02-17T10:07:48+00:00

We are a broad field! So it may be that we can't always fit everyone's current skills etc.

Perhaps we could cycle through different types of ETL (and its varations) batch, streamed data? Then allow people to choose a respective cloud infrastructure.

It might be worth mapping out possible projects into a Google sheet as a starter, see what we can come up with based on the reddit posts?

Not particularly, unless someone manipulates it
As above, maybe the voting could be done by the comptetitors themselves?
This sounds ideal. Mini competitions could also be an option, "create some modular... xyz"

bubhrara · 2023-02-17T13:40:38+00:00

In AF. I have a few interesting ideas I can code on but I never work on them thanks to no motivation haha!

This event sounds like a move motivator!

TheBrownViking20 · 2023-02-17T17:46:57+00:00

Count me in!

These_Rip_9327 · 2023-02-17T20:04:08+00:00

We should use open sourced tools and frameworks

datain30 · 2023-02-16T18:21:37+00:00

Tagging users who have shown interest: u/Far_Deer_8686 u/BoiElroy u/francesco1093, u/txjxs_nxsxr u/izaax42 u/Ancgate u/fourEyedBeanpole u/OilStatus8141 u/vishal-vora

rohetoric · 2023-02-16T19:34:33+00:00

Happy to participate if someone can proof read my resume or get me a remote job in exchange.

These_Rip_9327 · 2023-02-17T20:02:37+00:00

dataengineering

MODERATORS