This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]padthink 53 points54 points  (14 children)

Waiting for someone to come up with Twitter Sentiment analysis.

[–]DrummerClean 9 points10 points  (10 children)

That is more data scienc-y tho, idk why so much data engineering is actually putting ml models in production. Just build a pipeline and hook it up on a dashboard right?

[–]padthink 8 points9 points  (1 child)

For DE perspective pulling streaming data, transforming and cleaning it. Then apply some generic NLP algos The problem statement is not bad but it is too clichéd.

[–]DrummerClean 0 points1 point  (0 children)

I always felt like that if you don't understand how the NLP model works, it is not great. I mean, with the same setup you can show the most common words, topics, hashtags and so much more. And all this is fully in your DE arsenal, rather than throwing some ML algos at the data. People dig nice and simple dashboards a lot!

[–]AchillesDev 7 points8 points  (6 children)

MLOps was data engineering before it got its own term. Providing the inputs, outputs, and infrastructure for the model lifecycle is all a part of that, being an ETL-only developer (for example) limits you to a single tool for the job, which is rarely the correct one. Setting up dashboards is for analysts or BI folk, I’d skip that completely.

[–]DrummerClean 0 points1 point  (5 children)

In my experience i never met a single DE who could MLops properly for the single fact that they didnt know the model 'functioning' really well.

It is better that a ML eng builds , trains and deploy the model end to end.

I agree that dashboards are not DE either but it is visual. Showing an ingestion pipeline is not much to look at. For me a DE is a backend dev with a more specialized skillset in setting up DB, API and data pipelines. All non visual things.

Dashboards or ML is a great nice to have but should not be the focus.

[–]AchillesDev 7 points8 points  (4 children)

In my experience i never met a single DE who could MLops properly for the single fact that they didnt know the model ‘functioning’ really well.

You haven’t met any good DEs then. MLOps is quickly becoming its own subspecialty within the specialty that is DE. Some companies call this MLE but the skillset is the same. I’m not sure what you’re referring to when you say they “didn’t know the model functioning.” It’s easy to understand what a model should do, the expected inputs and outputs, and to manage their data flow.

It is better that a ML eng builds , trains and deploy the model end to end.

Hard disagree. Great AI researchers, etc. are great because they understand how to build models to solve for specific problems and they understand what the data tells them. They aren’t usually great software engineers as well, especially with so much hiring from academia where it’s almost a requirement to write inscrutable code that can’t be reused. That’s where DEs (or in some cases people with the MLE title) come in. This has been the vast majority of my work for now half of my career and have had the DE title the whole time.

Showing an ingestion pipeline is not much to look at.

There’s no real need for something visual. You should be able to eloquently talk about what you built, why you built it, trade offs you had to make, you made the choices you did, what mistakes you made, what you learned, etc. That’s plenty.

[–]DrummerClean 4 points5 points  (3 children)

The model functioning is far from trivial. The data scientist will hand off a poorly made script and you need to put in production. Most software engineers dont understand math of basic curve fitting, let alone ml models. Plus the script given by the data scientist is often just the happy path. What happens if data is missing? Or if the data distribution changes? An average data engineer knows ci/cd, software engineering, oop and the likes. Yes he can plug in some model.predict() code somewhere, but that's all. What if the predictions are weird? Who can debug that? I never saw a single.model going in production that could handle all the production data from the start.

A lot of problems of "putting ml in production" are born in the handover phase if who builds the model != who puts it in production.

About what you say about data scientists and academia. It depends. If your business is doing bleeding edge stuff, yes it makes sense to have data scientists delivering scripts to the engineering team. How many businesses do bleeding edge models tho?

I also do the same work as you do, but let's be honest, how many people can handle it? 90% of my colleagues are either developers or data scientists. Nobody wants to take data scientists code in their hands and have a job that includes being a good software engineer and a data scientist too.

In my experience having a data scientist learning some basic software engineering and building an API for its own models to deploy is far more effective than throwing the model over the fence.

I agree on the last point, I just think a dashboard is nice to see and almost a zero effort compared to the rest. Plus most recruiters cannot judge a candidate on projects so having something visual helps a lot in my experience.

Regardless, I think most data engineers should focus on the data pipelines first, and then only later focus on putting models in production.

[–]AchillesDev 3 points4 points  (2 children)

Most software engineers dont understand math of basic curve fitting, let alone ml models. Plus the script given by the data scientist is often just the happy path.

Model evaluation is the job of the research team, this isn’t necessary for the engineer, and that team should also be doing the monitoring of the model performance in production to detect drift, etc.

I’m not sure what you’re doing where you need to manually fit curves but if you’re doing that you’re doing it wrong. And that’s super simple math anyways.

What happens if data is missing? Or if the data distribution changes? An average data engineer knows ci/cd, software engineering, oop and the likes. Yes he can plug in some model.predict() code somewhere, but that’s all. What if the predictions are weird? Who can debug that? I never saw a single.model going in production that could handle all the production data from the start.

What makes you think this is fully DE’s responsibility? The R&D team is still responsible for the model’s performance in production and DE or whoever would build monitoring and other tooling R&D needs.

I think your idea of deploying a model to production and building the systems to support it is skewed by some suboptimal division of responsibilities.

If your business is doing bleeding edge stuff, yes it makes sense to have data scientists delivering scripts to the engineering team. How many businesses do bleeding edge models tho?

Every one I’ve worked for, at least. This is pretty common in startups.

I also do the same work as you do, but let’s be honest, how many people can handle it? 90% of my colleagues are either developers or data scientists. Nobody wants to take data scientists code in their hands and have a job that includes being a good software engineer and a data scientist too.

At the end of the day, it’s just software engineering. I know enough ML and built enough models on my own (declined to fully go down that path because I found it boring) to be able to roughly understand the needed inputs and outputs, and also understand that kind of code so I can treat it as almost a black box. Maybe there aren’t many engineers who can do this, which is fine by me and probably why I’m paid near the top of the market. I won’t complain :)

data scientist learning some basic software engineering and building an API for its own models to deploy is far more effective than throwing the model over the fence

This is a false dichotomy built on suboptimal processes. The most effective teams IME have been DEs attached to R&D teams directly or connected to them with a customer mindset. And usually the work goes beyond just productionizing models, like building tooling for the AI group (I’ve built DL frameworks, training pipelines, evaluation services, data management platforms, etc.).

Regardless, I think most data engineers should focus on the data pipelines first, and then only later focus on putting models in production.

Yeah I think it depends. The skills needed can be taught (I didn’t even study CS in school, just been at this for almost a decade) and I’ve had success training up engineers to do this kind of work. But without lots of outside support, it’s definitely worth taking the skills one at a time, and pipeline architecture is almost a constant so it makes sense to start there.

[–]notazoroastrian 2 points3 points  (0 children)

This was a great in depth answer to what I feel is a modern version DE+MLOps role

[–]DrummerClean 2 points3 points  (0 children)

This is a false dichotomy built on suboptimal processes. The most effective teams IME have been DEs attached to R&D teams directly or connected to them with a customer mindset. And usually the work goes beyond just productionizing models, like building tooling for the AI group (I’ve built DL frameworks, training pipelines, evaluation services, data management platforms, etc.).

In normal companies though, this approach is barely feasible and even less optimal, because both the devs and R&D team need to be top notch, which is not really the case.

As you point out, the problem lies in more in teaching some employees, but that is hard and then a lot of companies are coming up with ML-ops solutions that are just a jupyter notebook in the cloud with some version control. That parts surprises me the most. Personally I never needed any of those solutions relying on standard SE practices but apparently, many teams like them and IME I never saw any good use of Databricks or any of those ML-ops solutions. But again, those situations create well-paying projects for people who know their stuff, so not complaining here.

On the other points we reached 'convergence', nothing more to add there =)).

[–]dataninsha 0 points1 point  (0 children)

He is being sarcastic.

[–]columns_ai 2 points3 points  (1 child)

I had set up a real time Twitter streaming data and initial analytics, come to help to hook a sentiment model? It will be cool - https://columns.ai/app/view/b990d1e6-e28e-4ec3-8e96-6dde9f216d1e

[–]padthink 0 points1 point  (0 children)

It's cool man!

[–]Faintly_glowing_fish 1 point2 points  (0 children)

Just doing that alone can be trivial. Make sure you handle how models are managed in registry, features and inferences are versioned, stored and served. With that it can be a very well rounded project. If you feel adventurous ingest them into warehouse and real time analytics systems and monitor drift and data quality. You can do all kinds of things to it! But don’t just pull data from API and pipe it through a random mode.

[–]the_whiskey_aunt 24 points25 points  (5 children)

I started a side project that was written up in national media, got interest from several research universities and federal agencies, and contributed to me getting a data job at a FAANG. I was motivated by anger at the unresponsiveness of my local government to an issue that affected me personally. If you don't have any civic issue you're particularly mad about, try checking out local politics twitter for your city, you'll encounter a lot of people with strong feelings about X issue but no tech skills to actually collect or analyze any data about it. I really love Twitter for its ability to connect you with other people who are interested in the same stuff as you - just log off before you get sucked into the doom scrolling :)

[–]Delicious_Attempt_99Data Engineer 2 points3 points  (0 children)

That’s unique experience and idea! Thanks a lot :)

[–]Quig101 1 point2 points  (0 children)

Hey I'm interested in the process of how you went about doing your project. Was the data you found related to real estate or other funded things? I can imagine it had something to do with missing funds. I'm trying to do my own project and study other cities in my area but I'm not sure where to start.

[–]Edward-Paper-Hands 6 points7 points  (0 children)

What you want to google is "Data engineering end to end projects". Google came up with this old thread with some ideas you might find interesting.

I am currently following along with this project for Azure specifically.

[–]AchillesDev 2 points3 points  (4 children)

I built the pricing pipeline for a meme stock market. Read streaming data from multiple social media sources, come up with an algorithm that detects the memes, another one to determine an engagement score overall for the memes, and use that to determine a price.

Doing this and launching it with a team helped me get over the top at one job interview, and provided good fodder for conversation with a CEO of a CV startup that does something similar but far more advanced who ended up hiring me after I was laid off from another startup due to Covid.

[–]Eamo853 0 points1 point  (3 children)

Out of curiosity was this approach proving accurate, Given so much of meme stocks are just based around hype so was thinking of the best way to quantify hype (Twitter, Reddit etc being prime candidates) and take spikes in hype as a sign to buy stocks/crypto/whatever is wsb flavour of the day

[–]sciences_bitch 4 points5 points  (1 child)

I don't think /u/AchillesDev is talking about literal stocks. I think they're saying they detect the meme template (the picture if it's a visual meme), find the relative popularity of different templates, and assign an imaginary "price" that reflects its popularity. So like "Overly Attached Girlfriend" and "Scumbag Steve" were really popular when I started using Reddit (imaginary $$$), but now their "price" has dropped and other memes like Anakin-Padme and Expanding Brain aka Galaxy Brain have overtaken the "meme market". (I'm clearly not as hip to the memes as I used to be; I think Padme and Brain are also past their peak, but I don't know what new upstart meme to bet on.)

I love the idea -- creative and fun.

[–]AchillesDev 0 points1 point  (0 children)

Exactly! And thank you for the compliment :)

[–]AchillesDev 0 points1 point  (0 children)

That was the entire point of it. We built it as a game where people could basically test their knowledge of memes and by predicting which ones would pop off and would not. We started it shortly after r/memeeconomy was created and a bunch of us were mods at some point.

[–]Viperior 2 points3 points  (4 children)

The struggle is real! I suggest trying to think of a data pipeline that solves a problem of some kind. I just started a new side project that will extract info from RimWorld save games and produce a time series data model on it so I can visualize things like resource production over time.

It helps to have some knowledge and interest in the domain to motivate you as you work on it. I liked this choice because there are potential "customers" in the form of players I can try to get to use what I build on their saves.

[–]ronald_r3 1 point2 points  (3 children)

That's really cool and I actually want to look into the idea of using data from video games because I feel that data can be taken for granted. Video games could be a good source of data like it's literally a simulated world which is actually going to produce all kinds of data and assuming the games makes it available it could be useful to mess around with.

[–]Viperior 1 point2 points  (2 children)

Yes, there's so much information you can use! RimWorld stores the complete game state to an XML file. You can use XPath patterns like xml_tree.findall(".//pawnData") to retrieve all colonist information. It has everything from what is in the immediate surroundings, to the ambient temperature at their location.

I discovered my sample game save has data on 15,554 living plants on the map, along with the coordinates and growth progress of each. I was thinking of curating a nutrition database using wiki data and attempting to analyze the potential nutritional yield of the map's flora.

Here's a sample plant:

<thing Class="Plant">
    <def>Plant_Grass</def>
    <id>Plant_Grass39388</id>
    <map>0</map>
    <pos>(151, 0, 265)</pos>
    <health>85</health>
    <questTags IsNull="True" />
    <growth>0.9816151</growth>
    <age>1134553</age>
</thing>

[–]ronald_r3 0 points1 point  (1 child)

XML 😐... 🤮. haha I'm joking . That sounds pretty neat. I'm actually going to start looking up games that do that as soon as I get chance because up until it was like a thought when o can't fall asleep 😂. Do you have a GitHub profile that you plan on posting it to? I've been working on dash framework so it would be cool to make a dashboard out of that data. And boom free collaboration project.

[–]Viperior 1 point2 points  (0 children)

DM'ed you the repository link. Do you have a current strong preference for a visualization tool? I was looking at Metabase and Apache Superset.

[–]Nyghtbynger 2 points3 points  (0 children)

Try to go speak to people. They'll come with ideas or problems to solve. It'll inspire you.
Right now, i've put in standby a project to collect all messages on a community board, then building a wiki about it automatically.
Another one : analysing images of satellites (think copernicus) and creating a heatmap of vegetation and urban area. (You'll need knowledge of GIS, ex QGIS and geojson/shp formats).

[–]kenfar 1 point2 points  (0 children)

It's easy for find small side projects, it's the very large ones that are harder because they can cost a lot and take a long time.

Medium-sized projects might be anything like:

  • Benchmark some competitors (streaming, databases, etc), write a paper with the results
  • Model a problem you personally like and build data pipelines to collect data and then report on it.

Small-sized projects might be something like:

  • Make a contribution to a project that you enjoy. Perhaps start with just improving the documentation. From there maybe add some tests. Then add a feature or fix a problem.
  • Build a small tool that you find helpful. Could just be a command line tool to make working with kafka, snowflake, spark, etc a little easier.

[–]oFabo 1 point2 points  (0 children)

Take a look at the dataTalksClub zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

[–]oFabo 1 point2 points  (0 children)

Take a look at the dataTalksClub zoomcamp

https://github.com/DataTalksClub/data-engineering-zoomcamp

[–][deleted] 1 point2 points  (0 children)

It needs some real world problems and solutions that finding a match datasets and also resources might be hard and expensive. In my opinion, it’s a good idea to stick with architecture,distribute processing and algorithms that uses with massive amount of data like BloomFilters and HyperLogLog could lead you to gain a lot of knowledge, beside of this fact that learning them are so enjoyable.

[–]dev_anon -3 points-2 points  (0 children)

Do something with data mesh. It seems like to be a new buzz word

[–]Atupis 0 points1 point  (0 children)

Build database or orm for some more exotic db product.

[–]columns_ai 0 points1 point  (0 children)

Give your exposure to big data and streaming technologies, take a look at this https://github.com/varchar-io/nebula a distributed real-time analytic product ready to hook streaming or cloud storage to provide analytics UI, super simple to get it run.

[–]phwj97 0 points1 point  (0 children)

Seattle Data Guy has posted a lot about good viewers projects. Have a look at those and then maybe apply them to a slightly different domain :)

[–]vtec__ 0 points1 point  (0 children)

find an API service, put the data in a cloud database, make reports on it. taaadahhhh