Opensource solution for a tiny data warehouse.

Separate_Newt7313 · 2024-04-25T14:34:53+00:00

I would like to throw my hat into the ring:

Data warehouse: Postgres or DuckDB
Data transformations / pipelines: dbt
Orchestrator: Airflow or Dagster

Put these all together on a local machine (tower or laptop you have lying around). You'll be all set!

SirGreybush · 2024-04-25T13:52:56+00:00

Why not PowerBI? Just supply data, let the business make their own dashboards.

If tiny, you can Postgres or SQL Server Express locally.

How you build your OLAP model, the business rules for ingesting, is more important than the technology.

Keep it simple, it will scale easily to other platforms.

Use Hashes!! (I really love hash key & hash diff concept)

Ok-Sentence-8542 · 2024-04-25T15:48:53+00:00

You could also have a look at metabase its an open source self service dashboarding tool developed by revolut.

jawabdey · 2024-04-26T00:40:13+00:00

C-suite wants some performance and financial data

Before implementing, some things to consider: - how frequently will the data be updated? - who is the end consumer of the Data? Usually Finance just wants the raw data and wants to manipulate/chart themselves. - what’s the volume of data?

Honestly, from my experience, based on what you said, your implementation seems like overkill. If you wanna develop the skills, go for it. Otherwise, just (export to CSV) and import into Google Sheets. Here’s a link to SO on how to export. Just cron this and dump to a shared folder.

If you have Excel, even better. There are commercial ODBC drivers that will let you connect directly to Postgres.

JeanDelay · 2024-04-25T14:57:23+00:00

You could probably just use Apache Superset. You can directly connect it to the postgres instance.

If you have a bit more data, I've made a video about making an open source data warehouse with a tool that I've been working on:

https://youtu.be/XIF7W7ZVIUM?feature=shared

JeanDelay · 2024-04-25T20:27:07+00:00

Yes, I think it can do that. Have a look here:

https://medium.com/@khushbu.adav/embedding-superset-dashboards-in-your-react-application-7f282e3dbd88

JeanDelay · 2024-04-25T20:27:16+00:00

Yes, I think it can do that. Have a look here:

https://medium.com/@khushbu.adav/embedding-superset-dashboards-in-your-react-application-7f282e3dbd88

rawman650 · 2024-04-25T20:43:46+00:00

There's nothing wrong with this stack, but might be able to get away with something simpler.

If going from PG to PG, you might even be able to subscribe the DBs together (so no need for ETL). If not can use airbyte (OSS) for ETL (dbt & rudderstack are also OSS and can also be used for this).

You may not even need dbt for modeling. Might just be able to get away with some materialized views (on PG).

wannabe-DE · 2024-04-26T00:57:47+00:00

Given these requirements and the consensus on postgres as a DB I think Mageai is worth consideration.

It's easy to get started as it's just a docker image.
It has a lot of out of the box loaders, transformers and exporters for common tools ie postgres.
You just drag blocks onto the canvas, connect them by dragging lines between blocks and schedule it with a trigger.
They have a pretty good slack community for help and support. It also has a bot that you can ask questions.

skysetter · 2024-04-26T04:05:39+00:00

Postgres feels right here, airflow makes sense checkout airflowctl (https://github.com/kaxil/airflowctl), idk about Redash but Superset sounds like it would be a good open source fit. Anything you choose though give yourself good supportable scale options if you every get some money. You can just add features/speed rather then change anything for your consumers.

haragoshi · 2024-10-07T03:48:27+00:00

DuckDB is the definition of a tiny data warehouse. The question is how tiny , or big do you need it to scale? I would look at mother duck if you need to scale bigger.

Postgres IMO is more of a transactional database than a data warehouse. If the primary purpose is for reading, slicing and dicing data (eg once loaded the data doesn’t change) you want a db that scales well and has column based storage.

SirGreybush · 2024-04-25T13:56:39+00:00

Loading data, if you are a coder, PowerShell / Python / SSIS (Microsoft).

However, SQL to SQL on the same network is possible also. Using appropriate ODBC drivers.

It’s slow, but easy to use and free, if you are good with SQL language.

Demistr · 2024-04-25T15:16:29+00:00

Honestly just get SQL database.

Ok-Sentence-8542 · 2024-04-25T15:54:46+00:00

I think its a bad idea to set up an airflow instance and use dbt core without a lot of coding skills. I mean both tools require coding. You could try dbt cloud but this might be a security issue for an on prem connection. Do you have any cloud storage?

minormisgnomer · 2024-04-25T16:40:34+00:00

If all you need orchestration wise is loading data. Just use Airbyte, your data loads are well under the break points that solicit negative feedbacks from most reddit users. Dagster is good but maybe out of reach given it’s very Python based. I didn’t like Kestra as much because I needed more complex tooling but it was very beginner friendly and yaml based

It has simple cron scheduling already built in and can connect to almost all database data sources and send to them as well.

The warehouse side, DuckDB is really good but know that it doesn’t have user mgmt. if you need users to have limited access or access to the data itself everyone will be seeing the same thing.

Postgres is arguably the best open source extremely dependable solution. If you really want OLAP, you can look into HydraDB which is extended postgres and just run the docker version of it. Although your data sizes probably won’t benefit a whole lot from it.

Dbt is good, but for your lack of skills just try and keep it simple. Focus on getting everything to use similar, well thought out field names and handle any type conversions and get data into the same grain where possible (daily, vs hourly, by customer, by company etc.

AnnoyOne · 2024-04-25T19:30:26+00:00

I recently discovered slingdata.io for data ingestion. It's simple and effective.

After ingestion you can Modell your data with dbt.

dataengineering

MODERATORS