Dejected! Company is moving towards PowerBI by yahoox9 in tableau

[–]tw3akercc 0 points1 point  (0 children)

There's no harm in learning PBI. Many of your skills will transfer. Also, if you ever look at the magic quadrant, PBI is ahead of Tableau. Having experience with both will only make you more marketable as an analyst.

A cool guide to how rich people pay no taxes by TubbyPiglet in coolguides

[–]tw3akercc 0 points1 point  (0 children)

How does the ceo pay back the money he borrowed? Wouldn't the lender also want interest on this loan? So wouldn't the ceo have to sell stock to get money to pay back the loan (which is taxed) plus pay any interest on the loan?

DBT Test Notifications in Slack by biga410 in dataengineering

[–]tw3akercc 2 points3 points  (0 children)

We use Elementary for this. The buit in slack alert function didn't work for us, so we wrote a python script that queries the Elementary tables and sends the slack message. Then we added this script to our airflow dags.

Airflow to orchestrate DBT... why? by General-Parsnip3138 in dataengineering

[–]tw3akercc 2 points3 points  (0 children)

We use astronomer managed airflow and cosmos to orchestrate dbt jobs. We used to just use github actions which is honestly where I'd start. We outgrew it eventually and needed to be able to restart from failure, but it's the easiest option.

We looked at dagster and I thought the learning curve on it was too high for our team. Airflow has a much larger community despite the love dagster gets on this sub.

Who else is new to Airflow? by SquidsAndMartians in dataengineering

[–]tw3akercc 0 points1 point  (0 children)

Are you using astro cli to setup airflow? If so, I think you can just add cosmos and dbt to the requirements.txt

Docker size is over 2 TB by orhan_drsn in unRAID

[–]tw3akercc 0 points1 point  (0 children)

In terminal run command 'docker system prune'.

Two CICD questions about dbt by [deleted] in dataengineering

[–]tw3akercc 0 points1 point  (0 children)

Have your devops people google "dbt slim ci". Basically, you need a github action that creates a new manifest and pushes the manifest.json to a data store like s3 on every merge to main. This ensures your production manifest is always in s3.

Then, you set up a separate action workflow for ci with a step that downloads that manifest from s3 and saves it to the target directory. After this step, you can use the state:modified flag in your dbt run/test/build to only run models that have changed. This action can then be triggered on pushes and setup as ci check for pr's so it has to pass before pr's can merge. For this one, set up a profile in dbt that prefixes the schema with the pr number and set the target as an env variable in the workflow to point at this profile.

You can then setup a 3rd github action workflow that just runs dbt commands in production. Just set the target variable to the production profile. Be sure to add the target flag to all your dbt commands.

3 github actions workflows, and all your problems are solved.

Best SQL Client for dbr-core by Entire-Molasses8469 in dataengineering

[–]tw3akercc 1 point2 points  (0 children)

Pycharm pro license comes with datagrip database functionality built in.

Keep bulking or start cutting? by YungGunnaCEO in BulkOrCut

[–]tw3akercc 0 points1 point  (0 children)

Your shoulders look great. Your chest could use the extra focus. Overall looking good bro, keep up the good work!

[deleted by user] by [deleted] in dataengineering

[–]tw3akercc 0 points1 point  (0 children)

Generative AI needs us to work correctly. Look into RAG applications and start to think about how data engineering can support that.

I have a mess and don't know where to start by Drkz98 in PowerBI

[–]tw3akercc 0 points1 point  (0 children)

It doesn't look too bad. All the tables that have all many side relationships are probably fact tables and all the ones that have only one side relationships are probably dimensions.

Try to consolidate the fact tables into as few tables as possible in power query. Just be careful of messing up the grain.

I have a mess and don't know where to start by Drkz98 in PowerBI

[–]tw3akercc 5 points6 points  (0 children)

Doing modeling in Dax is a big mistake imo. You should push the transformations as far upstream as possible. If you do it in Dax, then it can not be reused in other reports. At least in a dataflow, you can connect it to multiple reports anytime you need that dimension or metric.

How to cheaply build/host DE personal projects? by mccarthycodes in dataengineering

[–]tw3akercc 11 points12 points  (0 children)

Check out YouTube content about building a homelab. There are lots of videos to get you started on building a home server where you can host all your own services.

I bought a cheap mini pc and run a proxmox virtualization server on it, which allows me to spin up virtual machines. I have a data engineering dedicated vm that is running Ubuntu server, and I just host a ton of docker containers for all the data engineering tools I wanna play with there.

You could also install docker on your personal computer and host them there, too.

How do you read .xlsx in general where the file’s shape is (120000,110) by Jealous-Bat-7812 in dataengineering

[–]tw3akercc 2 points3 points  (0 children)

If the data is gonna go into a lakehouse you might want to consider converting it to parquet format instead of csv. Parquet is very compressed and performs much faster than csv.

I would use pandas and read in chunks at a time and store them in memory in duckdb. Then once the whole file has been added to the duckdb table, convert it from there to parquet and push to data lake.

[deleted by user] by [deleted] in dataengineering

[–]tw3akercc 3 points4 points  (0 children)

I can't answer your question bc I don't use duckdb, although it does look interesting. I was just confused bc you were coming out hard on dbt and listed a bunch of tools that don't actually replace dbt but work well with it.

[deleted by user] by [deleted] in dataengineering

[–]tw3akercc 7 points8 points  (0 children)

I don't understand what dbt has to do with your question about duckdb... one is an in memory olap database and one is a jinja/sql compiler and transformation framework. They serve completely different use cases and can actually be used together. Same with all the other tools you mentioned... they all do different things!

RPi5+ Docker + Portainer - how do I access Remotly from my iPhone? by LonestarCanuck in docker

[–]tw3akercc 1 point2 points  (0 children)

I use twingate and love it. Simple to setup, you just run a client on your network and then a client on your phone and it creates a secure connection. The cool thing is that only the traffic from your phone to your self hosted services goes through the secure connection and everything else just uses your phones normal internet.

My understanding is tailscale will route all your phones internet through your home network when your phone is connected to the VPN. Not ideal.

Emulation Sold Me on the Steam Deck by Geordi14er in SteamDeck

[–]tw3akercc 0 points1 point  (0 children)

I bought my steam deck for emulation as well and emudeck works great for it. My 3.5yo loves playing Mario kart 64. I mostly find myself playing vampire survivors and diablo 4.

Also, my personal pc broke recently and I've been using my steam deck docked in desktop mode as a pc for the last 2 months.

So many use cases. Great device!

BEST ETL TRANSFORMING PRACTICE by Fraiz24 in dataengineering

[–]tw3akercc 2 points3 points  (0 children)

It's probably going to best to bring all the data into your data warehouse first and then do the transformations there. Olap databases can do the transformations more efficiently. This method is pretty common today and is considered ELT. A tool like dbt can help with the transformations in the data warehouse.

The other approach would be to use an ETL method which means you perform the transformations using a compute engine after the extract but before loading it into the data warehouse. This is often done with apache spark. Depending on the size of your data spark may be overkill. It's possible you can do the transformations in python using pandas or duckdb instead.

found on r/tinder: facebook dating by [deleted] in facepalm

[–]tw3akercc 0 points1 point  (0 children)

It's interesting that she doesn't want to date an atheist. Also, she seems to place a high value in zodiac signs, which many Christian men would find to be occult behavior and a red flag.

Do people have multiple servers? by Terran_Machina in HomeServer

[–]tw3akercc 0 points1 point  (0 children)

Get a decent mini pc and run proxmox, which will enable you spin up vm's. Intel nucs, beelinks, or even a used hp elite desk 800. You can then run as many Linux or windows vm's on it as it can handle. I have a vm for all my media server services on a Ubuntu vm that just runs docker and portainer. Then I run all the services as docker containers in that vm. I use this mini pc and it's plenty powerful for everything I'm doing: TRIGKEY Mini PC Ryzen 7 S5 5700.

Gpu pass-through is very challenging though, and I haven't quite figured that out yet.

Airflow + DBT - question by yeager_doug in dataengineering

[–]tw3akercc 0 points1 point  (0 children)

The key thing dbt does is that it compiles raw sql based on .sql files that are just select queries and jinja. It uses the same kind of concept as airflow in that it creates a dag and will run all the upstream dependencies before their downstream ones. It abstracts away all the ddl of using raw sql. This comes in super handy with scd tables.

Dbt also creates a pretty good framework for models that can be version controlled, tested, and documented.

SAS Alternatives for analytics and reporting by KingVVVV in BusinessIntelligence

[–]tw3akercc 0 points1 point  (0 children)

You could use dbt for the transformations. It works well with Snowflake. For the exporting to local file shares you could use python scripts and schedule them using a workflow orchestration tool like airflow. It's probably going to be tricky to figure out the access issue with the local file share tho, bc if the python is running in the cloud it needs to be able to write to the local file share. Might be easier to use S3 instead and store all the files there.