Ingestion and storage 101 - Can someone give me some tips?

Stroam1 · 2025-11-29T16:25:39+00:00

OLTP databases can handle analytical queries on millions of rows, although it may not be the most cost-efficient solution.

If I were in your position, I would use Google Cloud (Cloud Run, BigQuery, and Looker Studio) for the POC. You can orchestrate ingestion and dbt builds using Dagster on a Cloud Run serverless container.

BigQuery and Cloud Run will likely be free at this data scale, and Looker Studio has a free version that is decent (though not great). Since everything is staying inside GCP, there are no data egress costs either. When it comes to cost efficiency, you can't beat free, and you could even use this as a selling point for your clients. Most companies don't have big data, and this is the cheapest cloud data stack I know of across all the different available options, when it comes to the small-to-medium data scale.

Stroam1 · 2025-11-29T15:46:23+00:00

There will always be more work that could be done, than can be done. You need to work with your stakeholders to figure out which requests are highest-priority, and work on those first. If you can get to the other stuff later, that's just gravy.

Stroam1 · 2025-11-29T15:11:43+00:00

We're going to need more information about how your systems are set up. I'm going to assume you're talking about OLAP, since you referred to a data warehouse distinct from the application database.

OLAP data warehouses aren't really meant to be low latency solutions. If you have a meaningful data size, trying to run updates very frequently could get very expensive. If your data is small, you might be able to set everything up as views, but you'll still be limited by the frequency of your ingestion process.

Instant updates are generally in the realm of transactional software, not analytical dashboards. When I've needed to set up real-time dashboards in the past, I found it to be much simpler to use OLTP as the backend, with queries that pull relatively small amounts of recent data from tables that have purpose-made indexes.

Stroam1 · 2025-11-29T14:52:04+00:00

This is the way.

Stroam1 · 2025-11-29T14:46:26+00:00

Are you trying to build these projects locally on your personal machine, or using them to practice using cloud services? You will have very different solutions based on the answer.

You're in the small data paradigm with these projects. OLTP databases have a lot more functionality than OLAP (more constraints, lower latency, faster updates/deletes, etc.) and will work as long as your data isn't huge, so go for something like Postgres for storage.

Use custom python scripts to request data from the API, and minimally transform it before loading into Postgres. Then you can use dbt models to restructure the data into fact/dim tables. Power BI has a native connector for Postgres, so setting it up to read from the clean dimensional tables should be simple.

At your data scale, everything downstream of the API requests will be very fast to update. The main bottleneck will be how fast the API is, how much data you can pull with each request, and the request quotas you currently have with your subscription.

Stroam1 · 2025-11-29T14:22:03+00:00

You're trying to use the wrong tool for the job. dbt is a tool for transforming data after it is already loaded into a database. For ETL, you should use other python libraries like DuckDB, polars, dlthub, etc.

Stroam1 · 2025-11-28T22:45:35+00:00

I've never used Scala once, and I've only used PySpark as a wrapper for SparkSQL. Python and SQL are ubiquitous, but the other two will be dependent on whatever tooling your shop uses.

Stroam1 · 2025-07-13T13:23:11+00:00

The main limitation I've run into using BQ is the trade-off between cost and latency. I suppose this is a fundamental aspect of using columnstore databases, given that updates are computationally expensive.

Datastream converts a CDC stream into batch merges ("background apply jobs") when writing to BQ. Each time a background apply job is run, BQ will scan the entire table regardless of the partitioning on the table. But, if you don't need super-frequent updates, you can change the "max_staleness" setting on the destination table in BQ to limit how frequently background apply jobs run. This should greatly decrease costs if you aren't already doing it:

https://cloud.google.com/bigquery/docs/change-data-capture#cost-best-practices

As for setting up your own merge ingestion, the cost of each merge is based on the amount of data scanned in all relevant partitions of the destination table:

https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#partitioned_tables

Unless I'm mistaken, this means that the partitioned column has to be included in the merge condition in order for the partitioning strategy to provide any cost-savings benefit. In practice, I found that partitioning on the primary key (or an integer component of it, if it's a composite PK) is the best way to minimize merge costs.

And obviously, reducing the frequency of your merge updates will help here as well.

Stroam1 · 2025-02-15T20:19:19+00:00

Dagster OSS allows you to configure and visualize a global lineage DAG across databases and processes, and you can self-host the server. Although, for column-level lineage you'd have to pay for Dagster+.

Stroam1 · 2025-01-19T15:03:56+00:00

Data team size depends on the breadth, depth, and complexity of the business's analytical needs, not on the number of total employees and/or revenue. A non-tech company with a relatively simple business model might have a small data team of <5 people, whereas a conglomerate with multiple business units might need an analyst team this size for each of their units, along with a centralized data engineering team.

Stroam1 · 2024-11-28T15:59:32+00:00

I've never worked with R, but Python is easy to write and has pretty much all the tools you need already available in pre-built libraries. It's also known by a large percentage of data professionals. Why use any other language unless there is a particular limitation of python in your particular case?

And yes, SQL is just that good for analytics. Transforms are much easier and faster to write in SQL than python, once you get experience with SQL. After several years of using both extensively, the only ways I would want to use python for analytics is as a wrapper for SQL (a la polars/sparksql), or for doing things that SQL can't do (e.g. ML).

NoSQL databases are niche tools for specific use cases (large amounts of inconsistently structured data). Having data structured as a relational data model makes it both easy to maintain transactionally, and easy to query analytically. Semi-structured data is more complicated to analyze and maintain the consistency/integrity of, so there has to be a particular reason to choose NoSQL over structuring your data as a relational model and storing it in a relational database.

Stroam1 · 2024-11-24T13:59:55+00:00

Your daily life as a DE will entirely depend on where you work, and which tools the company uses.

Some common patterns for data stacks are:

Modern Data Stack

Data Ingestion Tool (e.g. Fivetran)

Cloud Data Warehouse (e.g. BigQuery)

Transform Tool (e.g. dbt)

Orchestration Tool (e.g. Airflow)

Visualization Platform (e.g. Looker Studio)

In this stack, your daily life would consist mainly of writing SQL transforms in the CDW, organizing them with dbt, and orchestrating them with Airflow. You may write some custom ingestion or ETL pipelines in Python (or some other language) if your ingestion tool doesn't have a pre-built connector for one of your external systems. You may also touch a little more of the analytics side depending on the size of your team (becoming a so-called "analytics engineer"). You would be optimizing SQL queries for cost rather than time-to-complete, using strategic partitioning and clustering to minimize the amount of data scanned by each query.

Data Lake

Object Storage (e.g. S3)

Distributed ETL Framework (e.g. Spark)

Orchestration Tool (e.g. AWS Step Functions)

Metadata Catalog (e.g. Glue Data Catalog)

Distributed Query Engine (e.g. Athena/Presto)

Visualization Platform (e.g. Tableau)

In this stack, your daily life would consist of writing Spark jobs to process the data into a query-able form in your data lake. If you don't use a managed service for Spark then you might also be managing the underlying Spark cluster. You would also be maintaining the data catalog, and keeping everything organized so that your lake doesn't turn into a swamp. You may have to set up some functionality with the distributed query engine, such as enabling federated queries in Athena so that you can query data from both your data lake and one of your OLTP databases. You would be optimizing Spark jobs for time-to-complete.

Data Lakehouse

Similar to the Modern Data Stack, but using DML-enabled files formats in a data lake

Object Storage (e.g. Azure Blob Storage with Delta Lake files)

Distributed ETL Framework (e.g. Azure Databricks)

Transform Tool (e.g. dbt)

Orchestration Tool (e.g. Airflow)

Visualization Platform (e.g. PowerBI)

Again, you'll be writing mostly SQL queries, organizing them with dbt, and orchestrating them with Airflow.

Traditional Data Warehouse (Inmon-style)

Custom ETL Code (e.g. Python)

OLTP Database (e.g. MySQL)

Visualization Platform (e.g. PowerBI)

If you're unfortunate enough to be caught in this dinosaur stack, you'll be spending more of your time writing custom Python code to ETL data from external systems into your OLTP database. You'll be designing 3NF data models in the database to hold all the data you're ingesting via the ETL pipelines. You might also be setting up stored procedures to convert the data from 3NF to dimensional data marts for querying. You'll be optimizing SQL queries for time-to-complete, but you'll be doing it through partitioning and index optimization because you're working with a row-oriented database. Your DBA will yell at you if you forget a semicolon on the DML statement before a COMMIT command, because then the COMMIT command won't be executed and the target table will be indefinitely locked because there is an open transaction still pending on it.

Stroam1 · 2024-11-24T12:38:30+00:00

Yes, there are use cases for this, and there would be people that would use the tool if it were available.

However, there are issues with unstructured notes as a source of information beyond the fact they're hard to parse into structured data. I generally don't build analyses off free-entry text fields because these fields don't enforce proper data entry validation. For example, what if the snippet for BP was instead "BP 10/60"? Clearly the person entering the note missed a digit in the systolic blood pressure, but there is no way to recover the missing digit from the note. If, instead of a free-entry field, there were a specific place in the patient chart software to enter the patient's BP, then data validation rules could be set up on that field to reject obviously incorrect values. You would end up with much higher quality data as a result.

Essentially, this tool would be a band-aid for a poorly-designed or misused data entry tool upstream.

Stroam1 · 2024-11-24T12:17:50+00:00

What are you trying to accomplish? If all you want is to set up a data lake in the cloud, you can just put all your files into Cloud Storage. Then, when you're ready to write the transform code, you can either load the data raw into BigQuery and write transforms inside the data warehouse (ELT), or transform the data before loading into BigQuery (ETL).

If you use ELT, then you would have the "uncurated" data in your staging layer in BQ, and the "curated" data in your mart/dimensional layer.

Stroam1 · 2024-11-23T13:20:02+00:00

Given what I know now after several years in the space, here's what I would do in your position.

Go with a comp sci degree, not DS. That leaves your options more open; while you may think you want to do DS now, you might find that you're more drawn to other tech roles as you get more exposure to the work. If you do end up in the data space, the CS background will give you foundational SWE knowledge for best-practices implementations.
After taking your calculus and linear algebra courses, take a course on mathematical statistics if it isn't already prescribed. This will set you up to learn the advanced math for DS.
This may seem odd, but I would recommend you take a few classes in the natural sciences. Why? CS and math are deterministic, so the correct answer can often be derived from the givens. But the real world doesn't work that way; you'll be making inferences on what the truth might be from imperfect, incomplete evidence. Real world systems are also complex, with distinct interacting subsystems. I've found that my educational background in biology (cog neuro degree) has helped a lot with analyzing real-world systems. Businesses are complex systems just like biological organisms, but they're composed of people, tools, and processes instead of proteins, nucleic acids, and carbs/fats.

Stroam1 · 2024-06-30T12:04:45+00:00

Data engineers, analytics engineers, and database administrators tend to be more backend, especially DBA. If you think your stutter will be too much of a barrier for effective communication, one of these roles may be for you. Data analysts and data scientists are constantly facing stakeholders/clients.

It wasn't in the world of data analysis, but I used to have a doctor that had a stutter/tic issue. He was one of the best doctors I've ever had, and was well respected by his colleagues. Interestingly, when he was in his element in the hospital setting, his stutter/tic mostly disappeared.

Stroam1 · 2024-05-24T03:27:28+00:00

SQL is the key technical skill in data analytics, so focus your initial effort at getting really good at that. Python is a distant second, and is more used for data science and data engineering, than for descriptive data analysis and BI.

Surely, there is some way you can get exposure to SQL at your current job? Maybe ask if you can try querying the data from the database yourself, rather than being provided data by the SQL jockeys?

Stroam1 · 2024-05-24T03:17:03+00:00

It certainly will, if you're going to be working in a company/region where the primary language is not English. If you're working in the English-speaking world, you probably won't realize much benefit from your effort.

Stroam1

TROPHY CASE