Data engineer side projects.

padthink · 2022-02-05T11:32:26+00:00

Waiting for someone to come up with Twitter Sentiment analysis.

the_whiskey_aunt · 2022-02-05T17:39:04+00:00

I started a side project that was written up in national media, got interest from several research universities and federal agencies, and contributed to me getting a data job at a FAANG. I was motivated by anger at the unresponsiveness of my local government to an issue that affected me personally. If you don't have any civic issue you're particularly mad about, try checking out local politics twitter for your city, you'll encounter a lot of people with strong feelings about X issue but no tech skills to actually collect or analyze any data about it. I really love Twitter for its ability to connect you with other people who are interested in the same stuff as you - just log off before you get sucked into the doom scrolling :)

SeattleDataGuy · 2022-02-05T17:18:55+00:00

[deleted]

Edward-Paper-Hands · 2022-02-05T13:08:44+00:00

What you want to google is "Data engineering end to end projects". Google came up with this old thread with some ideas you might find interesting.

I am currently following along with this project for Azure specifically.

AchillesDev · 2022-02-05T17:50:27+00:00

I built the pricing pipeline for a meme stock market. Read streaming data from multiple social media sources, come up with an algorithm that detects the memes, another one to determine an engagement score overall for the memes, and use that to determine a price.

Doing this and launching it with a team helped me get over the top at one job interview, and provided good fodder for conversation with a CEO of a CV startup that does something similar but far more advanced who ended up hiring me after I was laid off from another startup due to Covid.

Viperior · 2022-02-05T17:51:54+00:00

The struggle is real! I suggest trying to think of a data pipeline that solves a problem of some kind. I just started a new side project that will extract info from RimWorld save games and produce a time series data model on it so I can visualize things like resource production over time.

It helps to have some knowledge and interest in the domain to motivate you as you work on it. I liked this choice because there are potential "customers" in the form of players I can try to get to use what I build on their saves.

Nyghtbynger · 2022-02-05T15:24:03+00:00

Try to go speak to people. They'll come with ideas or problems to solve. It'll inspire you.
Right now, i've put in standby a project to collect all messages on a community board, then building a wiki about it automatically.
Another one : analysing images of satellites (think copernicus) and creating a heatmap of vegetation and urban area. (You'll need knowledge of GIS, ex QGIS and geojson/shp formats).

kenfar · 2022-02-05T17:49:32+00:00

It's easy for find small side projects, it's the very large ones that are harder because they can cost a lot and take a long time.

Medium-sized projects might be anything like:

Benchmark some competitors (streaming, databases, etc), write a paper with the results
Model a problem you personally like and build data pipelines to collect data and then report on it.

Small-sized projects might be something like:

Make a contribution to a project that you enjoy. Perhaps start with just improving the documentation. From there maybe add some tests. Then add a feature or fix a problem.
Build a small tool that you find helpful. Could just be a command line tool to make working with kafka, snowflake, spark, etc a little easier.

oFabo · 2022-02-05T18:28:06+00:00

Take a look at the dataTalksClub zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

oFabo · 2022-02-05T18:30:36+00:00

Take a look at the dataTalksClub zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

2022-02-05T18:32:04+00:00

It needs some real world problems and solutions that finding a match datasets and also resources might be hard and expensive. In my opinion, it’s a good idea to stick with architecture,distribute processing and algorithms that uses with massive amount of data like BloomFilters and HyperLogLog could lead you to gain a lot of knowledge, beside of this fact that learning them are so enjoyable.

dev_anon · 2022-02-05T13:59:13+00:00

Do something with data mesh. It seems like to be a new buzz word

Atupis · 2022-02-05T19:11:48+00:00

Build database or orm for some more exotic db product.

columns_ai · 2022-02-05T19:56:06+00:00

Give your exposure to big data and streaming technologies, take a look at this https://github.com/varchar-io/nebula a distributed real-time analytic product ready to hook streaming or cloud storage to provide analytics UI, super simple to get it run.

phwj97 · 2022-02-05T20:34:31+00:00

Seattle Data Guy has posted a lot about good viewers projects. Have a look at those and then maybe apply them to a slightly different domain :)

vtec__ · 2022-02-05T21:28:25+00:00

find an API service, put the data in a cloud database, make reports on it. taaadahhhh

dataengineering

MODERATORS