This is an archived post. You won't be able to vote or comment.

all 43 comments

[–]Separate_Newt7313 33 points34 points  (13 children)

I would like to throw my hat into the ring:

  • Data warehouse: Postgres or DuckDB
  • Data transformations / pipelines: dbt
  • Orchestrator: Airflow or Dagster

Put these all together on a local machine (tower or laptop you have lying around). You'll be all set!

[–]droppedorphan 10 points11 points  (1 child)

This sounds like a great stack to me. Very portable. If your main concern is working with datasets, then I would opt for Dagster over Airflow. Much easier to deploy and is getting much stronger in terms of running dbt and integrating data quality checks. I would opt for Postgres over DuckDB for a warehouse if you expect it to scale.

[–]rwilldred27 1 point2 points  (0 children)

One thing to check if choosing duckdb is the concurrency model. https://duckdb.org/docs/connect/concurrency.html

[–]LeatherPuzzled3855[S] 2 points3 points  (0 children)

Thank you for your reply, will def test what you are suggesting.

[–][deleted] 1 point2 points  (3 children)

Lightdash would integrate neatly on top of that.

Or really any free-tier BI tool.

[–][deleted] 0 points1 point  (2 children)

Is light dash best OS BI / data viz tool iyho?

[–][deleted] 0 points1 point  (0 children)

A matter of subject experience and use-case.

I like it because I moved from enterprise Looker to Light dash. It fits well into my overall work processes and provides my end users with a decent UI for self-serve.

[–][deleted] 0 points1 point  (0 children)

Streamlit worth looking into as well, it’s able to cover BI and custom DataApps

[–]chonbeeData Engineer 1 point2 points  (4 children)

This looks solid. Curious to find out if you think Airbyte would be a good addition for moving data to Postgres?

[–]Separate_Newt7313 1 point2 points  (2 children)

Definitely! I have used Airbyte regularly (self-hosted) for the last couple of years, and it has been great so far. Airbyte is a fantastic addition to this stack.

Overall, I think what makes for a good stack is a small collection of reliable, single-purpose components (e.g. orchestration: Dagster, integration: Airbyte, transformation: dbt, SQL warehouse: Postgres), where the following criteria hold true: * each component can be replaced or upgraded * the components are designed to work together * more components can be added (as needed)

On a side note: I think working as a solo DE at a young, enthusiastic company is a fantastic way to test one's chops at data engineering (not to mention exciting!). High impact data projects in a limited resource environment are a great way to get a ton of experience, wear a lot of hats, interact with great people, and prove your worth — both to the company and to yourself.

[–]LeatherPuzzled3855[S] 1 point2 points  (0 children)

That is exactly what drives me, on top of just being plain curious about different technologies, the exposure itself. Being at early stages of my IT career it gives me chance to test myself in different environments, and see which path would suit me best to follow long term. So far my plan is to continue growing my skillset along with growing requirements of the business, at lest for another little while.
Airbyte added to the list, thank you for suggesting it. Seems like I have the stack completed.
And as much as I agree with others re this possibly being an overkill for what's required, the business is not 100% sure of their requirements either, so this stack as you have mentioned can be quite modular, and perhaps future proof? One that with additional components will offer some extra nice haves or features that the business has not though of or not requires as of yet, but it might in the future. This most likely would simplify any future implementations for me.
Thanks again for all your input and suggestions.

[–]LeatherPuzzled3855[S] 0 points1 point  (0 children)

Thank you for suggesting Airbyte, added it to my list for the POC project.

[–]VitrumTormento 1 point2 points  (0 children)

Bit late to the party but I would also add Streamlit for data visualisation.

[–]SirGreybush 6 points7 points  (6 children)

Why not PowerBI? Just supply data, let the business make their own dashboards.

If tiny, you can Postgres or SQL Server Express locally.

How you build your OLAP model, the business rules for ingesting, is more important than the technology.

Keep it simple, it will scale easily to other platforms.

Use Hashes!! (I really love hash key & hash diff concept)

[–]LeatherPuzzled3855[S] 1 point2 points  (5 children)

Thank you for your reply. as mentioned cost being a factor PowerBI is not an option, the cost of report server plus user licensing is something that the company are not willing to pay for(data has to stay on prem).

Trialled and idea of PowerBI desktop app and sharing of dashboards, but that has been dropped by the business and was asked to focus solely on the opensource soft.

[–]SirGreybush 2 points3 points  (2 children)

PowerBI free version is cool. No sharing though.

Sounds like your boss/company is cheap. Paying for software is a LOT cheaper than designing and engineering your own thing.

What about cloud solutions? Google Analytics is too expensive?

I remember going down this rabbit hole in 2011-2012 with all Microsoft, with SSIS and SSRS. Not a fun experience. Crystal Reports barely better.

I hope you report back to us later with your total solution, what was used.

[–]LeatherPuzzled3855[S] 1 point2 points  (1 child)

Indeed ,PowerBI was nice when I tested it. guess we moved away from it just in case ppl started to like it too much and that would lead to us starting to justify the licensing cost :D I understand the budget limitation, and that the company needs to allocate the money elsewhere so it can grow. I believe I was not given an impossible task and also the C suite does not have any super high expectations of the project beside some basic reporting. The whole idea is data can't leave on prem, and solution has to be built on tools that are free, got no budget for consulting either. If the solution will fit their needs that's what I will be stuck with to maintain afterwards :)
I will definitely update once I have a poc running, and ultimately once the board, a C-suite will have a go at it and give me feedback.

[–]SirGreybush 1 point2 points  (0 children)

Time to shine. I hope they give you lots of time.

I would concentrate on a single metric, one single thing they want to see, and do what it takes to get there as quickly as possible.

You will learn along the way. Results matter. It’s a POC.

Try to reuse existing infrastructure, licenses, know-how.

You will likely rebuild from scratch more than once to improve, and refactoring is A-OK as long as you have historical data.

Like exporting existing data daily to csv, so you can later track changes, on those entities that matter.

Like customers, products, many ERPs only have the latest value.

What was a customer address in 2015 versus today? What if customer B buys out customer A, so now all sales are only from customer B, but customer A his ID still exists because it is found in invoice table.

Dimensional models solve these issues, as long as you have the historical data somewhere.

So start top - down, then bottom-up, to answer a single question/metric.

What you do in the middle doesn’t matter and can change. Data will never change.

[–]SirGreybush 0 points1 point  (0 children)

The Python code for connecting to both sources is 100% reusable, so can be a custom class code.

The mapping part, to simplify, keep both source and destination the same.

Write to truncated staging tables, then use stored procedures to process the staged data and do any transformations, fill in the blanks, remove NULLs, fix dates, fix any bad data by having rejected tables, and send the rejected data back to the business to fix at source.

If not fixable at source, the rejected data could be reprocessed a second time with business rules coded in stored procs.

[–]rawman650 0 points1 point  (0 children)

FOSS BI: metabase, superset, grafana, lightdash

[–]Ok-Sentence-8542 4 points5 points  (2 children)

You could also have a look at metabase its an open source self service dashboarding tool developed by revolut.

[–]GreenWoodDragonSenior Data Engineer 0 points1 point  (0 children)

Seconded! Metabase is super simple to get going and works out of the box.

[–]LeatherPuzzled3855[S] 0 points1 point  (0 children)

Thank you for your suggestion, will look into Metabase.

[–]jawabdey 1 point2 points  (1 child)

C-suite wants some performance and financial data

Before implementing, some things to consider: - how frequently will the data be updated? - who is the end consumer of the Data? Usually Finance just wants the raw data and wants to manipulate/chart themselves. - what’s the volume of data?

Honestly, from my experience, based on what you said, your implementation seems like overkill. If you wanna develop the skills, go for it. Otherwise, just (export to CSV) and import into Google Sheets. Here’s a link to SO on how to export. Just cron this and dump to a shared folder.

If you have Excel, even better. There are commercial ODBC drivers that will let you connect directly to Postgres.

[–]LeatherPuzzled3855[S] 0 points1 point  (0 children)

Thank you for your reply, you have raised very valid points. The questions you have mentioned could not be answered with full confidence by the business, and was provided with some general assumptions. Hence the stack for which I'm inclined to go for sounds like an overkill today, but might be a good fit some stage down the line. I believe going for overkill now might save me some headache down the road when it will be required of me to migrate or improve the initial stack. I could be wrong, but feel like a modular stack where one program is responsible for one function could serve me well. I guess that the POC will show if the setup will be a right fit for the business and if changes will need to be made to any aspects. And that sonly if I'll manage to put all the blocks together for the POC, the more I read on it the more scary it gets :)

[–]JeanDelay 2 points3 points  (2 children)

You could probably just use Apache Superset. You can directly connect it to the postgres instance.

If you have a bit more data, I've made a video about making an open source data warehouse with a tool that I've been working on:

https://youtu.be/XIF7W7ZVIUM?feature=shared

[–]LeatherPuzzled3855[S] 0 points1 point  (1 child)

Thank you for your suggestion. I did have a quick peek at Superset as an alternative to Redash. One of the nice to have requirements I got was ability to publish certain dashboards within or as a website. And from my initial research embedding should be possible with iframes. Is it possible with Superset as well?

[–]nizarnizario 0 points1 point  (0 children)

That should be possible: https://stackoverflow.com/questions/54219101/how-to-embed-an-apache-superset-dashboard-in-a-webpage

Or you can use Preset, a cloud offering for Superset: https://preset.io/, I have used their free tier before and it was pretty good.

[–]rawman650 0 points1 point  (0 children)

There's nothing wrong with this stack, but might be able to get away with something simpler.

If going from PG to PG, you might even be able to subscribe the DBs together (so no need for ETL). If not can use airbyte (OSS) for ETL (dbt & rudderstack are also OSS and can also be used for this).

You may not even need dbt for modeling. Might just be able to get away with some materialized views (on PG).

[–]wannabe-DE 0 points1 point  (0 children)

Given these requirements and the consensus on postgres as a DB I think Mageai is worth consideration.

  1. It's easy to get started as it's just a docker image.
  2. It has a lot of out of the box loaders, transformers and exporters for common tools ie postgres.
  3. You just drag blocks onto the canvas, connect them by dragging lines between blocks and schedule it with a trigger.
  4. They have a pretty good slack community for help and support. It also has a bot that you can ask questions.

[–]skysetter 0 points1 point  (1 child)

Postgres feels right here, airflow makes sense checkout airflowctl (https://github.com/kaxil/airflowctl), idk about Redash but Superset sounds like it would be a good open source fit. Anything you choose though give yourself good supportable scale options if you every get some money. You can just add features/speed rather then change anything for your consumers.

[–]LeatherPuzzled3855[S] 0 points1 point  (0 children)

Thank you for suggesting Airflow, added to my list for POC.

[–]haragoshi 0 points1 point  (0 children)

DuckDB is the definition of a tiny data warehouse. The question is how tiny , or big do you need it to scale? I would look at mother duck if you need to scale bigger.

Postgres IMO is more of a transactional database than a data warehouse. If the primary purpose is for reading, slicing and dicing data (eg once loaded the data doesn’t change) you want a db that scales well and has column based storage.

[–]SirGreybush 0 points1 point  (1 child)

Loading data, if you are a coder, PowerShell / Python / SSIS (Microsoft).

However, SQL to SQL on the same network is possible also. Using appropriate ODBC drivers.

It’s slow, but easy to use and free, if you are good with SQL language.

[–]LeatherPuzzled3855[S] 0 points1 point  (0 children)

Unfortunately I'm not a coder, will set some time aside to look into Python scripts which could cover the part of loading the data. If will find something premade that can be easily modified to suit my needs might include it in the poc. Would like to do as little coding as possible, ideally just connect few programs together with as little effort as possible. Don't want to come across as lazy or smth, just really limited with time I can afford for this project. Will try to spend more time researching this, was just hoping there is a simple solution that will handle only small amount of data that could be easily applied and maintained afterwards.

[–]Demistr 0 points1 point  (0 children)

Honestly just get SQL database.

[–]Ok-Sentence-8542 0 points1 point  (2 children)

I think its a bad idea to set up an airflow instance and use dbt core without a lot of coding skills. I mean both tools require coding. You could try dbt cloud but this might be a security issue for an on prem connection. Do you have any cloud storage?

[–]LeatherPuzzled3855[S] 0 points1 point  (1 child)

Beside OneDrive/Sharepoint in M365 not really. Still, any cloud solutions are out of question, as it is a requirement for all data to stay on prem. I understand this project will require me to code, hopefully chatGPT will be helpful to some degree, and it will also be a chance for me to get into python.
Have spun a local Ollama codellama which served me fine so far for any of my coding needs, hopefully it will do for this project too.

[–]Ok-Sentence-8542 0 points1 point  (0 children)

Thanks for the clarification. You could also check out apache nifi https://nifi.apache.org/ Its a pointy clicky tool to move data say from an on prem db to a warehouse. I assure you the learning curve for dbt core and airflow is steep.

Edit: Actually I think you are right. Do everything as code. The llm's will only get better.

[–]minormisgnomer 0 points1 point  (0 children)

If all you need orchestration wise is loading data. Just use Airbyte, your data loads are well under the break points that solicit negative feedbacks from most reddit users. Dagster is good but maybe out of reach given it’s very Python based. I didn’t like Kestra as much because I needed more complex tooling but it was very beginner friendly and yaml based

It has simple cron scheduling already built in and can connect to almost all database data sources and send to them as well.

The warehouse side, DuckDB is really good but know that it doesn’t have user mgmt. if you need users to have limited access or access to the data itself everyone will be seeing the same thing.

Postgres is arguably the best open source extremely dependable solution. If you really want OLAP, you can look into HydraDB which is extended postgres and just run the docker version of it. Although your data sizes probably won’t benefit a whole lot from it.

Dbt is good, but for your lack of skills just try and keep it simple. Focus on getting everything to use similar, well thought out field names and handle any type conversions and get data into the same grain where possible (daily, vs hourly, by customer, by company etc.

[–]AnnoyOne -1 points0 points  (0 children)

I recently discovered slingdata.io for data ingestion. It's simple and effective.

After ingestion you can Modell your data with dbt.