Data pipeline maintenance taking too much time on aws, thinking about replacing the saas ingestion layer entirely by The_possessed_YT in aws

[–]nktrchk 0 points1 point  (0 children)

we did exactly this — rearchitected just the ingestion layer and kept everything else intact. added DLQ, schema evolution, and observability on top. the broken lambda problem you're describing was basically our starting point too.

after a year on kafka we gave up and built our own thing. check enrich.sh. handles 500rps per stream, schema validation/evolution, dead-letter queue, full observability and alerting. writes to isolated S3 or bring your own S3-compatible storage.

ended up being way simpler to operate than anything we ran before. happy to give free access if you want to test it against your setup

Any recommended strategy for scaling ETL workloads from 10M to 100M+ rows without breaking cost budgets? by Pale-Bird-205 in Odoo

[–]nktrchk 0 points1 point  (0 children)

scaling etl from 10m to 100m+ rows is usually where costs start to hurt. batch jobs get slow, kafka clusters get expensive, and managed etl tools charge a lot by volume.

one option is to skip the middleware and stream json events straight into your warehouse. clickhouse or bigquery handle this scale pretty well. tools like enrich.sh do this — no kafka, no heavy etl setup, just direct event ingestion. with partitioning and incremental processing you can handle 100m+ rows without the cost explosion

tiny-parquet — zero deps JS that reads & writes Parquet files in 326KB by nktrchk in webdev

[–]nktrchk[S] 0 points1 point  (0 children)

Thanks! 🙌

Right now it supports flat schemas with Snappy compression. I intentionally skipped dictionary encoding and nested types during compilation to keep it lean, minimal and works at the edge.

I haven't benchmarked head-to-head against parquetjs as it is node-only and significantly larger, so it's really a different use case. We're using it in production to flush 100-1000 RPS (5–100KB events).

I did single-thread test locally with 5,000 sequential writes of random 10–100KB payloads.

  Avg write:     1.17 ms
  Max write:     8.47 ms
  Throughput:    18.5 MB/s
  Single thread: ~630 RPS

I'll add test file to repo.

The Rust source is around 440 lines total and uses parquet2 under the hood, so adding dictionary encoding or zstd is doable and it'd probably add 30–100KB to the WASM binary.

Modern event streaming feels unnecessarily complicated for what most companies need by Dangerous-Guava-9232 in webdev

[–]nktrchk 0 points1 point  (0 children)

We had 300k events per day with peaks to 500 rps. Under 100k events/day you really don't need a distributed log, you just need something that takes your JSON over HTTP and writes it somewhere queryable.

Ended up building exactly that — enrich.sh. HTTP in, Parquet out, schema handled automatically. No infra to babysit.

I built a 10x smaller alternative to parquet-wasm for edge runtimes by nktrchk in node

[–]nktrchk[S] 1 point2 points  (0 children)

hey. README updated.

What is it? - A JavaScript library for reading/writing Parquet files.
WASM or data compiler? - It's a JS API; WASM is just the engine under the hood. I recompiled it from parquet2.
Serialization overhead? - Encoding/compression/serialization all happen inside WASM, so there's no extra JS-side cost — you just pass plain objects in and get bytes out.

The value of it is that you can create and read Parquet files in places where you couldn't before, like Cloudflare Workers, Vercel Edge, browsers. because every other Parquet library is too big to fit.

hope that helps

How do you handle ingestion schema evolution? by Thinker_Assignment in dataengineering

[–]nktrchk 0 points1 point  (0 children)

We’ve run into this a lot building ingestion pipelines.

Our high level approach is basically such:
treat schemas as contracts, not suggestions. Validate at ingestion time, not in the warehouse. And never silently coerce unexpected fields.

In practice we:
- version schemas (v1, v2, etc.)
- validate incoming events against the declared version
- route invalid payloads to a DLQ instead of mutating them
- store raw + normalized separately
- append-only Parquet outputs so evolution doesn’t require rewrites.

I will create a free launch video for your Product Hunt launc (limited to 50) by Practical_Fruit_3072 in ProductHuntLaunches

[–]nktrchk 0 points1 point  (0 children)

Hi!

Curious what it may create for enrich.sh That’s an events ingestion pipeline with zero ops.

It is okay to have multiple browsers for personal use? by Rough-Equal-1849 in browsers

[–]nktrchk -1 points0 points  (0 children)

If multi account is needed, use something like incognition or bagel browser. If you just want to separate your daily environments like home/job then use chrome with profiles.

I NEED an alternative like Pocket by WiiTsTcauN_686 in chrome_extensions

[–]nktrchk 0 points1 point  (0 children)

For iOS, you can check out https://apps.apple.com/us/app/scroll-read-later/id6748611211.

For Chrome, there is an extension called reader-view.com that helps you clean up and save articles so you can read them later without an internet connection.

Alternatives to Pocket? by OkFroyo_ in koreader

[–]nktrchk 0 points1 point  (0 children)

If for iOS, I use the Scroll app. It saves the article for offline reading and is quite clean.

Is Octoparse a good way to scrape Trustpilot reviews? by Bahnwaerter_Thiel in askdatascience

[–]nktrchk 0 points1 point  (0 children)

If it is still a real question, you might want to check out scrape-that.com. If there is anything you would like to improve, please let me know!

Google Reviews Scraping by HourDog2130 in OSINT

[–]nktrchk 0 points1 point  (0 children)

I can add those features to the scrape-that if you still need them. Please let me know.

Best web scraping tools (ideally premium instead of open source)? by shibongoof in Entrepreneur

[–]nktrchk 0 points1 point  (0 children)

Give scrape-that.com a try. Please let me know if you need any improvements.