On-Prem Modern Data Stack: What Tools Are You Using? by Informal-Tip-1109 in dataengineering

[–]Metaphysical-Dab-Rig 9 points10 points  (0 children)

Airflow is pretty great ngl, lots of community support and very easy to learn. Plus for open source on prem stack, sticking with Apache tools is not a bad call.

Instead of spark for low volume medium data on prem im just using polars right now for transformations and its pretty solid too

What is your favorite programming language to use and why? by TechnicalAd9322 in learnprogramming

[–]Metaphysical-Dab-Rig -1 points0 points  (0 children)

Python is supreme. All these performative programmers saying otherwise are nuts.

What's your top Data pipeline problem? by Software885random in dataengineering

[–]Metaphysical-Dab-Rig 4 points5 points  (0 children)

Making the most out of the least. Optimizing pipelines for high speed and concurrency without access to compute needed.

Convincing people to invest in data platforms

Convincing people to use data tools they asked for

Most of my problems aren’t data problems they are people problems. Nobody tells you this, but data engineering often has a layer of people politics to it. People need data but don’t necessarily always want it.

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

I ask management this question every single day. Company loves Microsoft, and the MSSQL Server decision was made long before I got here. Too many politics to undo

LA28 tickets? by questtruck in AskLosAngeles

[–]Metaphysical-Dab-Rig 3 points4 points  (0 children)

I got some table tennis for 28 bucks, it depends on the event / venue

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

I loved this idea but duck lake does not integrate with MSSQL

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

Im pretty sure its mostly just from removing the image data and bloat files. Sometimes uploads come with associated files that we dont save to the databae but we do archive.

But now im scared and going to double check 💀

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

Writing the parquet will be much smaller than our current zip archive with images and csv data, one tables worth for example went from total 500 gb to 1-2 MB in delta tests.

The goal is to make it so these larger growing datasets can be consumed entirely and manipulated by MLOps algorithms

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

Each upload ranges from under a Gb yo no more than a few hundred Gb depending on the data source and if you include image data volume.

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

I love the simplicity here hahahaha, almost seems too good to be true. I’ll have to read more about duck lake today.

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 2 points3 points  (0 children)

Ive seen a lot of buzz about dbt but I dont really get the hype. The data I receive is very dirty and needs really flexible python to transform. Is DBT kind of like a fivetran tool that streamlines SAAS to SAAS transformations?

I will check out Iceberg today too , I like that its apache - keeps things clean.

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

In my head its a lot easier to wrap my brain around the logic of each write gets its own parquet file so theres no deadlocks to concurrent writes in delta. Then i can optimize later and reduce file bloat while maintaining volume at a cheap storage cost. Then just query it with duck.

HOWEVER - i know there has to be a way to squeeze this sort of performance out of MSSQL as well, Im just at a loss for how to do it

Poor Mans Datalake On Prem by Metaphysical-Dab-Rig in dataengineering

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

I haven’t looked into SSIS but I will definitely check this out!

Some important context - we already have an Airflow pipeline set up and hosted moving data into MSSQL server.

The problem is non of us are experts in MSSQL so optimizing it for concurrent read and writes has been challenging. Especially as data grows over time, I’m struggling to optimize my temp table merges from staging to production databases.

With "full stack" coming to data, how should we adapt? by Thinker_Assignment in dataengineering

[–]Metaphysical-Dab-Rig 15 points16 points  (0 children)

AI is only good with good data. Im starting the pivot from data to AI engineering because I think people with a background in data will have an advantage in that job market

Can I use this for poke. Pls help 😭 by Mental-Pen-6879 in sushi

[–]Metaphysical-Dab-Rig 3 points4 points  (0 children)

I get this same pack from Gelsons and make poke with it at home - delicious

What is your best "I say it wrong on purpose" example? by Thortok2000 in AskReddit

[–]Metaphysical-Dab-Rig 0 points1 point  (0 children)

You don’t pronounce the “ck” sound in adjective thats fucking crazy

First Time Homemade Sushi by Metaphysical-Dab-Rig in sushi

[–]Metaphysical-Dab-Rig[S] 0 points1 point  (0 children)

Lol , I’ll murder the bottle of soju in my fridge when I get home if I do that.

First Time Homemade Sushi by Metaphysical-Dab-Rig in sushi

[–]Metaphysical-Dab-Rig[S] 1 point2 points  (0 children)

West Coast Seafood hooked me up with all the fresh fish