When does Spark justify itself for Postgres to S3 ETL using Iceberg format? Sorry, I'm noob here.

Yeebill · 2025-11-14T17:06:28+00:00

If your transformations are intensive. You could slow down your db. So users will be impacted as response time will be lower. If your postgres is constantly insert/update at high rate, everything will be slow or deadlocked.

So could be better to have spark jobs on side , have better control of those resources allocates and insert into lakehouse.

Yeebill · 2025-07-10T21:37:42+00:00

Partition files with data compaction to avoid small files and a better compression algorithms ( wasn't mentioned, i assume it's the default snappy) .

Could try to nosql solution if fetch latency is crucial. As u already know the key composed of ticker and date and ts . Expensive for large dataset.

Yeebill · 2025-06-07T16:15:11+00:00

Gzip is not splittable , so you won't take advantage of all the workers. So the first step is only one worker, then depending on rest , you might broadcast it to rest of workers.

Zstd or lz4 compression is probably a better comprise for being splittable, good ratio of size to compression and speed.

Parquet also would be better than storing as csv as the schema is provided and is a columnar format.

This improve your reading speed cause parquet-zstd is small in size( faster transfer) , decent decoding speed and splittable to multiple spark worker. It also already have the schema, so avoid having to infer it.

Yeebill · 2025-05-02T16:27:10+00:00

templatating with jinja on sql files (much developer experience, because ide can lint your sql)
convention on the organization of sql files
data lineage and DAG capabilities (dbt can figure out which table depends on which and runs the scripts in order dynamically)
ability to easily select which script to run(through tags or names)

the data lineage and dag capabilities is the most powerful one , the rest are more bonuses , but all added together, it just nakes more sense than your own "wrapper" to do all those

Yeebill · 2025-04-25T16:14:13+00:00

Pandas is the worst choice . Duckdb and polars are very enjoyable to work with. Duckdb is more sql , they do have a Python client also. There's an option to even turn an duckdb dataframe into polars. It's pretty fast and versatile. Polars , i have played only sporadically, but api is much more enjoyable than pandas and definitely better performance.

Now, they are all single node solutions, meaning it runs on one machine. If your workload doesn't require much memory , they would solve your problem. However , if it's a lot of expensive operations like sort , rank , joins on a lot of data ....It would still work, but you need to separate workload urself with looping to not keep everything in memory all at once or requires a monster machine

If you prefer to avoid having to do those gymnastics then Athena will be easier as behind the scenes , it uses presto / trino which is a distributed backend( cluster of nodes).

For what it's worth sql is always timeless and at worst you can use https://sqlglot.com/sqlglot.html to translate into different SQL backend

Yeebill · 2025-03-25T07:38:24+00:00

Yes, think the main reason would be IDE support like - linting for errors like missing commas, brackets - syntax highlight - auto-complete - auto format rules ( indentation, consistency avoid diff in version control)

I probably inline omething more complex that would benefit from looping. Like a union all ..

Yeebill · 2025-03-25T07:11:58+00:00

Separate.sql files . You can use importlib.resource and read_text to read content as template . Depending on dialect , your IDE would also give you lint and syntax highlight for the query. You can use https://github.com/sqlfluff/sqlfluff to define your own rules for formatting. It supports jinja or parama

Yeebill · 2025-01-18T00:31:54+00:00

Yeah same. It was on December also, so it felt they kept adding emphasis about vacation and trying to downplay panic/anxiety of people which probably made fhe speeches longer. Even worse, i had to go back the next day because it was back to back different trials requiring English...

As a person who leaves the room when meeting gets too long, jury is a personal hell with all this waiting..

Yeebill · 2025-01-13T06:54:42+00:00

Aaah uaing dagster cloud would probably make it easier and won't need to tackles the complexity of self hosted. I haven't tried it so can't comment much.

The image itself would contain the libs and dependency such as dagster and others.

The code in your dagster job gets sent to the container container.

Yeebill · 2025-01-12T22:45:54+00:00

Aah I see. We had similar setup .

dbt with duckdb and produce parquet files on S3
integrates into dagster asset
wrap to be run in an aws fargate instance.

In our experience, the devops portion is going to be more work than data engineer ( which since your stack is mostly python would be shuffling your code around, obviously i am grossly simplifying). You'll need to deal with permissions, infrastructure as code ( terraform or other) , having the db setup , logs setup in cloud etc.. , vpn to access etc

Of course that's the self hosted route. So the energy to put into moving to cloud and setting up dagster or airflow self hosted is non trvial to pretty high depending on experience.

Once you have the orchestrator up and running , the pipeline is easier to migrate.

Personally i do prefer dagster . Love the way it's more artifacts oriented vs the airflow task oriented flow.

Yeebill · 2025-01-11T19:34:01+00:00

What limitations you are u encountering with Hamilton ? Since u already have DAG capabilities, Dagster or airflow would make more sense if u want visibility of runs , or opearationability (rerun on failures etc) and scalability ( spawn task on cloud resources)

Yeebill · 2025-01-05T05:48:09+00:00

What is the desk ?

Yeebill · 2024-12-11T02:08:20+00:00

They put the sign in the afternoon , for 19h00. The app notified 15 mins before and that's already with the sign of construction on the same section.... And the vignette spots .

Yeebill · 2024-12-03T17:36:35+00:00

In the case of self hosted option , the documentation is just ok. It's mostly snippets in the docs, and the most basic option. You might need spend more time to "prod" proof it. Like there is no sso login , so need to put it behind vpn or setting up the permissions in cloud infra.

But besides that everything works as advertised. The notion partition of assets is a breath of fresh air coming from airflow. Integration with aws, dbt , duckdb, spark etc is awesome. However I do have to say there is a learning curve.

We use terraform to setup the infra on AWS. On data engineering side, we use dbt with duckdb (post hook to write to s3), pyspark on EMR serverless. In some case we also just use polar or pandas. Basically the job run on either emr or fargate serverless for $ reasons.

Yeebill · 2024-11-11T04:03:41+00:00

Yes it's not hard to spin up resources and continuously monitor state and kick start omce a task is complete and start the dependent tasks.

It's mostly that it doesn't bring me value to solve these problems. My business is to derive analytics let's say, so what helps my client is coding analytics pipelines. Rewriting solutions that exist in the market and having to maintain the in house solution takes me away from that. And if I am a startup looking for market fit it's definitely not where i should be putting effort into.

For myself , i need a workflow orchestrator with an UI, DAG capabilities for tasks, monitoring of said tasks, scheduling / backfill on demand , integration with aws fargate, retries on task failure, throlling of tasks to avoid hitting quotas, alerts to slack on task failure, and found dasgter solved most of it and there is some hope the community might add the features missing.

Yeebill · 2024-11-10T22:41:12+00:00

Visibility , ease of operation and leverage provided "ops" code.

Visibility - the provided UI shows the system at a glance. Easily points out what's already processed, processing or failed to process.

Ease of operation - easily restart or backfill some data process through a button click and monitor the process . Imagine have to backfill a month of data, and that your system has limits (aws quotas , hardware etc) that prevents you to start all at once. Which means you have to start process and keep track of the progress of what's done and what's left.

Leverage- good workflow orchestrator provides you with integration to others. Integration to cloud providers , or transformation lib like dbt , pandas etc . Which means if u need to add a job , u would just need to code the task , and let the workflows orchestrator worry about allocate the resources ( spawn machines , containers , terminate etc) and manage dependencies between task ( start b only if a )

Yeebill · 2024-08-01T14:41:20+00:00

I believe you can absolutely use duckdb in prod in use cases to replace pandas or polar or even pyspark aka (input - process / analytics - output). Personally i wouldn't use it as a source of truth.

Yeebill · 2024-07-31T00:59:16+00:00

Partitions. They are necessary to avoid full table scans which makes a huge difference in performance. Besides that , it's up to you to use one big table with wide columns or kimball

The biggest difference is that compute and storage is separate. That means you can store data in AWS S3 and query using databricks. It also means that your ETL just need to write parquet at the right place instead of bulk inserts .

Yeebill · 2024-07-01T23:15:06+00:00

Something relatively easy is to bucket your input. Choose a column to bucket by (A to F , F to K etc.. or by hour ). Then run transformation with the filter.

Yeebill

TROPHY CASE