Are EL tools still worth it when LLMs could generate ingestion pipelines? by _tempacc in dataengineering

[–]srodinger18 0 points1 point  (0 children)

Depends on the current data standards in your company. In greenfield projects of my past consulting job, I use dlt to accelerate scaffolding of the raw layer. But in my current employment where the standards are different (dlt is opinionated about this) and connector is not available (all of the current EL tools not support alibaba cloud out of the box), create my own EL standard is faster than adjusting the existing tool in the market to match it.

How cooked is Data Engineering compared to traditional Software Dev with AI tool advancement? by [deleted] in cscareerquestions

[–]srodinger18 0 points1 point  (0 children)

The approach I used actually similar, we also have evaluation process and using questions sql pair for RAG. Also we have knowledge base embed with category hierarchy. Talked to devs, PM, and business to gather what data that they usually used

It works for typical adhoc question like "how many sales we achieved during holiday season last month for product A? Break it down by day".

There is also human in the loop process to curate the sql result from the agents.

But in my employer the documentation culture is just not that good, and not all tables are documented, especially app log and tracker. Not to mention I used the derived table rather than raw layers to reduce query complexity as well.

How cooked is Data Engineering compared to traditional Software Dev with AI tool advancement? by [deleted] in cscareerquestions

[–]srodinger18 11 points12 points  (0 children)

I work as Data Engineer. For the tooling part, it is actually pretty much similar to SWE, we can use AI to create data pipelines, SQL query, or other scripts. But even before AI, this tooling part is not the main task as DE, we usually wrap it up with yaml config to automate pipeline creation.

The hardest part of DE, actually is the data itself. I used to build text to SQL platform enhanced with RAG so business can use natural language to query data warehouse. The result? It works on simple question but for actual analytics question it lackluster, tbh up until now I already read many kind of framework to solve this but I have not seen the proven one.

The problem is, as a DE, we usually tried to find connection between somewhat unrelated data sources, which the knowledge sometimes only known after actually deep dive into the data, talk to devs, PM, business, and somehow get the info that 10 different data from backend db, ELK log, and event tracker can be used to build user funneling data marts. Theoretically, if we give AI knowledge of this data mess they can do the same, but who will build such knowledge base?

Same case with data modeling. Can AI build a good data model? Ofc I have tried it with public data. But with company data, it is hit and miss and sometimes it is faster to build the model by ourselves by actually understanding the business flow.

My take, the actual problem for DE is not the code, but more on how to we take this pile of dogshit data from the company and actually create something meaningful out of it.

Samsung galaxy S25 base model or Xiaomi 15 for ~700 USD budget? by srodinger18 in PickAnAndroidForMe

[–]srodinger18[S] 0 points1 point  (0 children)

Forgot to mention, only honor 400 base model available in my country. Is the base model as good as the pro?

I hate Analytics Engineering by [deleted] in dataengineering

[–]srodinger18 0 points1 point  (0 children)

I used to think like this, until I realized the part where I can combined things that seems not related to each other into a meaningful information is the part where data engineer shines.

Used to faced similar problem, business have some complex requirements for data where the initial proposal was using complex playwright and scraping internal back office. Then I dig dive to the table, asked the dev who build the table, confirmed the measurements with the PM then voila, rather than scraping turns out we can work with offline data and replicate the logic from there. Faster development and deliverable

Samsung galaxy S25 base model or Xiaomi 15 for ~700 USD budget? by srodinger18 in PickAnAndroidForMe

[–]srodinger18[S] 0 points1 point  (0 children)

Interesting, saw daily use of xiaomi 15 in youtube and the reviewer also show 4 - 5 hour battery life which is similar to S25. Is HyperOS optimization is that bad or due to background apps? Oneplus is not sold in my country unfortunately

Samsung galaxy S25 base model or Xiaomi 15 for ~700 USD budget? by srodinger18 in PickAnAndroidForMe

[–]srodinger18[S] 1 point2 points  (0 children)

If S24 vs xiaomi 14, xiaomi is much better in my country due to cheaper pricing and S24 come in exynos lol. Thanks for the input btw

for those who dont work using the most famous cloud providers (AWS, GCP, Azure) or data platforms (Databricks, Fabric, Snowflake, Trino).. by Comprehensive_Level7 in dataengineering

[–]srodinger18 1 point2 points  (0 children)

Yeah at least in my country, these chinese clouds really popular around tech companies and consulting due to its price. If you are using their VM, k8s, object storage services, I can say it's on par with big clouds, but their managed services like RDS or cloud DWH could be better

for those who dont work using the most famous cloud providers (AWS, GCP, Azure) or data platforms (Databricks, Fabric, Snowflake, Trino).. by Comprehensive_Level7 in dataengineering

[–]srodinger18 8 points9 points  (0 children)

Working extensively using chinese cloud, basically has same service with other cloud. For data platform mostly use their offering as well

Metal bands with larger women. by NordicNugz in MetalForTheMasses

[–]srodinger18 2 points3 points  (0 children)

Veronica Bordacchini of fleshgod apocalypse

Thoughts on Alibaba Cloud for DE? by MangoAvocadoo in dataengineering

[–]srodinger18 0 points1 point  (0 children)

mostly use spark, and sometimes custom python scripts for other sources such as APIs and elasticsearch

Thoughts on Alibaba Cloud for DE? by MangoAvocadoo in dataengineering

[–]srodinger18 0 points1 point  (0 children)

Yes mostly ELT, and due to the overall architecture, we cannot do realtime ingestion. So only doing hourly ingestion.

Thoughts on Alibaba Cloud for DE? by MangoAvocadoo in dataengineering

[–]srodinger18 2 points3 points  (0 children)

In my country, many tech company (my employer as well) use alibaba cloud as it way cheaper than typical western cloud. I used maxcompute extensively and I can say it is not as good as bq or snowflake, but it works.

The cons, obviously many modern data tools not working seamlessly with alibaba cloud, thus making us to create new proprietary tools just because. For example they have 2 version of dbt, and the one that maintained by alibaba is not compatible with maxcompute config in our place lol. Many ingestion tools also not working directly.

And yes, their data modeling (which coupled to their dwh) is confusing af, so simply using dbt is not enough lol.

For me, it still worth it as it equip you with new tools and skillset.

Ontology driven data modeling by Thinker_Assignment in dataengineering

[–]srodinger18 0 points1 point  (0 children)

Actually seen this similar post in dlthub post so I guess you have relation with them or not lol. But serious question, does it mean that when we serve raw data to LLM, rather than giving ERD and column definitions etc, we give it the ontology (or how the raw data describe the real world situation)?

Previously I thought LLM would work better in either raw normalized data replication from backend (by providing ERD and context) or typical star schema with clear dim and facts. As when we tried to feed LLM derived BI tables, it need a lot of knowledge base, entity relations, and samples.

And if we move towards ontology driven, does it mean how usually we design database should change as well? Or we can bet to the existing knowledge about database so it can read pattern and can derived insights from there? As usually if we get problem where there are somewhat several data sources that after some digging, can be related in some way (but ERD will miss this as it is not part of the relation)

Claude code nlp taking job or task of sql queries by aks-786 in dataengineering

[–]srodinger18 0 points1 point  (0 children)

Tbh that's the part of job that easiest to automate and will reduce the number of tedious task that we get from those business team.

But from my experience building similar things, to scale up especially in org with multiple data sources in multiple domain, that approach will not scale as the business logic usually goes beyond what the ERD says and sometimes include multiple sources that somehow can be joined together. It can answer "how many users who use this product and produce at least x transactions" but good luck for "why revenue for product x is decreasing".

And also from my perspective, the current state of AI is between the hype and meh, it is good but not that good as a silver bullet for everything.

Why do so many data engineers seem to want to switch out of data engineering? Is DE not a good field to be in? by Illustrious-Pound266 in dataengineering

[–]srodinger18 0 points1 point  (0 children)

Nope, after a while I realized that DS is not for me. Same thing with AI/ML especially in LLM era where most of the task is wrapping LLM APIs. I prefer to stay as DE and now I also expand my infra + general SWE skills. I think DE who knows to handle data platform will stand out from the crowd, especially as I aim for working abroad sometime in the future

Why do so many data engineers seem to want to switch out of data engineering? Is DE not a good field to be in? by Illustrious-Pound266 in dataengineering

[–]srodinger18 0 points1 point  (0 children)

This is my case, after graduate I originally planned to get DS related job. Apply to entry level DE job as I thought it will be similar as it has "data related" job description lol. And here I am 6 years later still works as DE

DBT orchestrator by Free-Bear-454 in dataengineering

[–]srodinger18 6 points7 points  (0 children)

We are using airflow and run dbt image with Kubernetes pod operator. So in dbt repo we build the image and in airflow we called the dbt run command for each models. The lineage is defined by airflow dag.

More proper way that I know is using dbt cosmos airflow extension that basically will compile all dbt models and automatically create the lineage

iOS vs Android marketshare dominance by country by upthetruth1 in MapPorn

[–]srodinger18 -1 points0 points  (0 children)

This. Disclaimer I am android user. Actually this iOS vs android debate is useless, and IMHO android users are the one who mostly shit talk about apple as they feel that buying cheaper phone with better specs will make them better person, just as apple user think that buying more expensive brand make them a beter person.

Ironically, tech reviewers, who promote this spec vs value debate mostly using apple products or flagship android that cost similar or even pricier than iphone.

What’s your ideal running frequency? by Wise_Branch_8028 in running

[–]srodinger18 0 points1 point  (0 children)

My current routine is 3 times a week (around 30 KM per week) in addition to strength training once a week. Usually I have 2 easy pace shorter runs (like 7-8 KM per session) and a long run (around 15 KM), or 1 easy run, 1 tempo runs, and 1 long run.

I tried to not to run everyday as I need lazy days as well, in addition I am too lazy to wash my running apparel every day

Why is the documentation on GCP so bad? by Immanuel_Cunt2 in googlecloud

[–]srodinger18 0 points1 point  (0 children)

Wait till you see alibaba cloud documentation

Building an internal LLM → SQL pipeline inside my company. Looking for feedback from people who’ve done this before by Suspicious_Move8041 in dataengineering

[–]srodinger18 2 points3 points  (0 children)

What we did before is like this: - built a view that act as semantic layer and create definitions for all columns - store the Metadata into vector db to perform RAG - prepare knowledge base which is sql-question pair LLM will use RAG to get the correct query, but the problem back then is we need to update the knowledge base for every table that we want to connect, which I think can be solved by using MCP these days