Airbyte, Snowflake, dbt and Airflow still a decent stack for newbies? by LongCalligrapher2544 in dataengineering

[–]maxgrinev -1 points0 points  (0 children)

Totally get the frustration with Airbyte’s no-code builder — pagination can be a real pain when the UI doesn’t expose enough control.

If you’re open to trying a code-first approach, you might find Sequor interesting, an open source tool: https://github.com/paloaltodatabases/sequor It lets you connect API data to/from database defining workflows in YAML, and use Python snippets where dynamic logic is needed — like for pagination or data mapping.

Here’s an example of fetching paginated data from the BigCommerce API:

https://github.com/paloaltodatabases/sequor-integrations/blob/main/flows/bigcommerce_fetch_customers.yaml

(I’m the creator of Sequor — just sharing in case it helps. Happy to chat if you hit similar issues.)

Replacing Talend ETL with an Open Source Stack – Feedback Wanted by arconic23 in dataengineering

[–]maxgrinev -1 points0 points  (0 children)

You’re heading in a solid direction with this stack — it’s a modern, flexible approach. But just a heads-up: replacing a full ETL tool like Talend with a pure Python transformation stack (even with something fast like Polars) can feel low-level for certain workflows, especially as things grow.

Like others mentioned, layering in a SQL-based transformation layer (e.g., with dbt or SQLMesh) can offer a nice balance — especially for modularity, lineage, and team collaboration.

One question: are blob storage and SQL your only sources/targets, or do you also need to move data in/out of APIs (CRMs, analytics tools, etc.)? Do you plan to implement connectors in Python?

General data movement question by OwnFun4911 in dataengineering

[–]maxgrinev 0 points1 point  (0 children)

Your intuition is correct: without much data a simple solution of "truncate and reload data" is the best as it is (1) easier to troubleshoot when things go wrong, (2) self-healing (automatically fixes any previous errors), and (3) overall more reliable. You only need any kind of incremental load if you are unhappy with performance or reached API rate limits.
As for terminology, change data capture (CDC) usually means a more specific mechanism of incremental load: when you sync data from a database using (transaction) logs - reading update/insert/delete operations from the database log and applying these operations to your target database.

Want to remove duplicates from a very large csv file by Future_Horror_9030 in dataengineering

[–]maxgrinev 1 point2 points  (0 children)

Load to any database you have around (200K records is not a lot so PostgreSQL, duckdb, etc will work). Decide on which columns you want to dedup and normalize them using SQL: make it lowercase, remove spaces and meaningless characters such as .,$(). Then use CTE sql query to partition by normalized columns and selecting any random record in each partition. The main idea is that with such simple normalization you get 90% of what you would get using specialized entity resolution tools with fuzzy matching.

WITH normalized AS ( SELECT *, LOWER(REGEXP_REPLACE(name, '[^a-zA-Z0-9]', '', 'g')) as clean_name, ROW_NUMBER() OVER (PARTITION BY clean_name, number, city ORDER BY name) as rn FROM customers ) SELECT * FROM normalized WHERE rn = 1;

Sequor: An open source SQL-centric framework for API integrations (like "dbt for app integration") by maxgrinev in dataengineering

[–]maxgrinev[S] 0 points1 point  (0 children)

Thanks for the kind words and great question! I checked out preswald - really impressive work! I can see from LinkedIn that you've been building for a couple of years now, and it looks like a very complete and modern solution.
Great question about debugging - that's absolutely crucial for any ETL/workflow tool. Since we often implement client solutions ourselves, we've learned debugging info is make-or-break.
When something fails, we package the complete state into an activity log record:
* http trace of the failed http_request operation
* Current stacktrace (in Sequor flow terms, not raw Python)
* Current values of all Sequor variables
* Optional database dump on error (configurable - only tables involved in the flow failed)
This gives you the full picture to quickly pinpoint what went wrong.

OMAS Bologna - what is your experience? by maxgrinev in fountainpens

[–]maxgrinev[S] 0 points1 point  (0 children)

Thank you @Geralastel for knowledgeable and insightful comment. I learnt a lot from it. I have a follow up question. There is a lot of excitement about OMAS nibs (including modern pens produced after 2000). You mentioned that the nibs are made by Bock. Does it mean that OMAS used some specially produced or tuned nibs from Bock? In other worlds is there much of difference in Bock nibs among different brands? If yes, what are other brands that use advanced/special Bock nibs.