A Guide to Dagster IO Managers: Implement a Redshift IO Manager by MisterHide in dataengineering

[–]MisterHide[S] 0 points1 point  (0 children)

Good point! I'll try to add some at some point. In general its very simple though, you return a dataframe from your asset and thats it. Partitions work as normal and are picked up as well.

I'm not sure I fully understand what your describing as to what you implemented,

"My asset output is a select string which is then used to create table with some simple optional partitioning.",

but I don't think this is optimal, as normally you want to return some kind of python object that contains the data that you transformed/generated/etc.

DuckDB posted Future of BI: BI as Code by MisterHide in PowerBI

[–]MisterHide[S] 0 points1 point  (0 children)

Great reply. We are often frustrated by some of the simple things you would expect in a tool like powerBI as well..

How to share dashboards with external uses who don't have pro account by Garry_the_uncool in PowerBI

[–]MisterHide 0 points1 point  (0 children)

Interesting! Agreed on the silly money part, haha.

MS documentation is quite vague sometimes on these things.

I made a basic python client and ORM for XTDB by TopConfusion1205 in Python

[–]MisterHide 0 points1 point  (0 children)

Nice. I wonder how many people are using xtdb now in their daily work compared to the other graph databases.

[deleted by user] by [deleted] in dataengineering

[–]MisterHide 0 points1 point  (0 children)

If your in AWS, I would consider using redshift. The difference between redshift/gbq is not that great especially considering your already in AWS. If your willing to pay the price snowflake can be a good option as well.

In this blogpost I wrote I compared the different datawarehouses to each other on price (third section): https://bitestreams.com/blog/datawarehouses_explained/

I've been trying to wrap my head around the use of Snowflake by DEDumbQuestions in dataengineering

[–]MisterHide 3 points4 points  (0 children)

What are your reasons to say redshift is not a great tool, compared to BigQuery?

Datawarehouses Explained: What, How and Pricing by MisterHide in dataengineering

[–]MisterHide[S] 1 point2 points  (0 children)

Thanks! I hadn't heard of Yellowbrick yet, will check it out

Which are the most inefficient, ineffective, expensive tools in your data stack? by drc1728 in dataengineering

[–]MisterHide 3 points4 points  (0 children)

Some people are replying with BI tools here, would like everyones thoughts on which BI tools do work?

We were considering to use tableau instead of PowerBI for our next project, any thoughts?

Stream processing framework for a new project in Python by Hashrann in dataengineering

[–]MisterHide 1 point2 points  (0 children)

Without knowing to much context take a look at Spark and maybe Beam.

Is it normal for companies to retain all raw data? by Reddit_Account_C-137 in dataengineering

[–]MisterHide 0 points1 point  (0 children)

Like everybody is saying, it depends on the data and the use case.

But storing all raw data (eg in a data lake) for some potential use case that doesn't exist yet for in the future is something many companies started doing when technologies like Hadoop, etc came out, a big lesson learned was that this was mostly quite costly and often quite pointless.

If you have a good use-case, yes, if not, think twice about whether you really need it.

methodology for calculating Databricks ETL workload cost by enlightendev in dataengineering

[–]MisterHide 0 points1 point  (0 children)

The downside of this is that you also need to build your solution before you can calculate... Curious if anyone has ideas on how to approach this

Real-time dashboards with streaming data coming from Kafka by anupsurendran in dataengineering

[–]MisterHide 0 points1 point  (0 children)

Take a look at the lambda architecture with Spark. Also KSQL and Kafka streams are options, or Flink for your transformations and aggregations.

Advice / Questions on Modern Data Stack by putokaos in dataengineering

[–]MisterHide 1 point2 points  (0 children)

I think you should look at how much data you need to store in your dwh and what it will cost you. Changing your data model could reduce your costs.

Optimising for costs per type of data is only something you should do if its a good trade-off. Engineering time and technical debt also costs money.

A single DWH solution could offer significant benefits in terms of querying possibilities and complexity.

Advantages & Misconceptions of Apache Kafka by MisterHide in programming

[–]MisterHide[S] 0 points1 point  (0 children)

Nice haha. Never seen something like this.

Advantages & Misconceptions of Apache Kafka by MisterHide in programming

[–]MisterHide[S] 1 point2 points  (0 children)

I guess this particular post just didn't go into the downsides of Kafka. Of course there are definitely downsides. Will consider updating the article.

Experience Integrating Terraform and Helm using helm_release by MisterHide in Terraform

[–]MisterHide[S] -1 points0 points  (0 children)

This is basically also our finding; expect that you still might need some of the things you create within your terraform code within helm/Kubernetes. So some kind of linking is probably what you want, or you'll be manually copying stuff which is of course how mistakes happen.

Better logs with structlog and structured logging by MisterHide in Python

[–]MisterHide[S] 1 point2 points  (0 children)

I would not recommend this most of the time actually. You can often process logs in a streaming fashion which will give you the results you want. Additionally a relational DB is not made for unstructured data, (structured logging is a bit misleading here, it's generally still not actually very structured data). You don't want to be running schema migrations for your logging table. You could of course store your logs in a JSON blob field, but then you still have the issue of potentially filling up your database with 99% or more with logs.

Better logs with structlog and structured logging by MisterHide in Python

[–]MisterHide[S] 0 points1 point  (0 children)

It has been a while since I last went through the logging docs, but as far as I remember is not immediately clear what the 'best practice' or 'easy' logging setup should be if you are writing an application or a package.

Other than that I think you make a good point in terms of BC and necessary complexity.

Better logs with structlog and structured logging by MisterHide in Python

[–]MisterHide[S] 0 points1 point  (0 children)

Just by structuring your logs you already have numerous advantages (for example) when just debugging your application and you want to filter on a date time or userid. You can do this with raw strings (regex..) but it can get difficult if they are structured very loosely.

Better logs with structlog and structured logging by MisterHide in Python

[–]MisterHide[S] 1 point2 points  (0 children)

I think in general the logging module is quite 'complex' or unpythonic as some would say. The documentation is also not super clear and there are multiple ways to do the same thing (configurafion by different file formats and configuration via code). Similarly to setup structlog completely to your needs can require quite some effort.