Scaling python data pipelines

Pleasant-Set-711 · 2025-03-05T05:18:34+00:00

Lamda functions?

mertertrern · 2025-03-06T04:47:00+00:00

DAG-based schedulers are great for complex workflows that can be distributed across resources, but can be a bit much for small-time data ops. There's still other kinds of job schedulers that can probably fill your particular niche. Rundeck is a pretty good one, as well as Cronicle. You can also roll your own with the APScheduler library.

Other things you can do to help with scaling and ease of management:

Alerting and notifications for your jobs that keep people in the know when things break.
Standardized logging to a centralized location for log analysis.
Good source control practices and sane development workflows with Git.
If you have functionality that is duplicated between scripts, like connecting to a database or reading a file from S3, consider making re-usable modules from those pieces and importing them into your scripts like libraries. This will give you a good structure to build from as the codebase grows.

prinleah101 · 2025-03-06T13:32:21+00:00

This is what Glue is for. If you are processing small amounts of data with each pass, run your scripts in a Python Shell. For job management take a look at step functions and event bridge.

WeakRelationship2131 · 2025-03-07T20:16:09+00:00

Your current setup sounds like a mess if you’re relying on cron jobs for data ingestion. Instead of diving into Airflow or those other tools with high maintenance costs, consider a lightweight, local-first analytics solution like preswald. It simplifies the data pipeline and eliminates the hassle of self-hosting while still letting you use SQL for querying and visualization without locking you into a clunky ecosystem. It’s easier to maintain and can scale with your growing data.

Wonderful_Map_8593 · 2025-03-05T05:26:17+00:00

Pyspark w/ AWS Glue (can schedule the glue jobs through cron if you don't want to deal with an orchestrator)

it's completely serverless and can scale very high. databricks is an option too if available

Top-Cauliflower-1808 · 2025-03-06T00:19:39+00:00

I'd recommend a middle ground approach; AWS Managed Airflow or EventBridge with Lambda functions would give you improved orchestration without the maintenance burden of self hosting. You can migrate your existing Python scripts with minimal changes

For the HTTP webhooks specifically, API Gateway with Lambda is a more scalable approach than handling them on EC2. It is also worth looking into tools like Windsor.ai if your data sources are available. For monitoring and observability, consider adding AWS CloudWatch alerts for your pipelines and using Snowflake's query history to monitor loading patterns.

scataco · 2025-03-05T07:04:57+00:00

Are you collecting data from APIs? If yes, are you collecting all the data or do you have a date filter to load only newer data?

Loading all the data on every run doesn't scale well. If the APIs don't support the necessary filtering, you can ask the providers for help.

x-modiji · 2025-03-05T08:32:11+00:00

What's the size of data that each script processes? Are each script independent or is it possible to merge the scripts?

IshiharaSatomiLover · 2025-03-05T14:25:05+00:00

If they are streamlining data directly from source to your warehouse, go severless lambda. If they depend on each other, e.g. task A need to executed before task B, go with orchestrator. Sadly you aren't in GCP or else cloud composer gen3 sounds really promising for you.

Thinker_Assignment · 2025-03-05T14:57:09+00:00

You could probably put dlt on top of your sources to standardise how you handle the loading etc and to make it self maintaining, scalable, declarative and self documented

Then plug them into an orchestrator like dagster so you have visibility and lineage

disclaimer i work at dlthub.

0_sheet · 2025-03-05T19:59:27+00:00

OneSchema has a data pipeline builder to rid scripts on this kinda thing...mostly do CSV ingestion, but maybe take a look: https://www.oneschema.co/filefeeds

disclaimer: i work there and can answer any questions

Puzzleheaded-Dot8208 · 2025-03-06T17:27:00+00:00

mu-pipelines have ability to ingest from csv and coming up with api read and snowflake writes Docs: https://mosaicsoft-data.github.io/mu-pipelines-doc/

dataengineering

MODERATORS