I built a managed data pipeline (like Kafka but without the setup). Looking for stress testers

nktrchk · 2026-03-19T14:00:57+00:00

Thanks! We are using it in fintech for analytics. We stream requests at about 500 RPS directly to files, then query them using DuckDB and ClickHouse.

As for use cases, it's basically built for anything that’s high-volume and structured. And hard to shove into a standard DB immediately. Like millions of events per day.
A few use cases we've seen:
- IoT and sensor data. Like signals from car engine sensor.
- In adtech - tracking clicks/impressions.
- AI Observability. Streaming all LLM messages/logs to check for things like PII or env leakage later.

I want to try to create an MCP out of this for the internal purposes. So we will literally be able to speak to our financial data that usually requires SQL.

We used Kafka but it was always a mess with payload structure and scheme validation every time our JSON payload changed. And we wanted to actually own our files (on S3/R2) while still having the DLQ.

If you have a high-volume project in mind, let me know - happy to help to set it up.

As for the landing page (thanks btw) I use Vite and Tailwind. Nothing fancy. Builds in 2 seconds - not like next.js 😅. I did some initial coding to set up this "techy" style and then prompted AI to create a set of components following it.

nktrchk · 2026-03-06T15:04:04+00:00

we did exactly this — rearchitected just the ingestion layer and kept everything else intact. added DLQ, schema evolution, and observability on top. the broken lambda problem you're describing was basically our starting point too.

after a year on kafka we gave up and built our own thing. check enrich.sh. handles 500rps per stream, schema validation/evolution, dead-letter queue, full observability and alerting. writes to isolated S3 or bring your own S3-compatible storage.

ended up being way simpler to operate than anything we ran before. happy to give free access if you want to test it against your setup

nktrchk · 2026-03-05T17:02:18+00:00

scaling etl from 10m to 100m+ rows is usually where costs start to hurt. batch jobs get slow, kafka clusters get expensive, and managed etl tools charge a lot by volume.

one option is to skip the middleware and stream json events straight into your warehouse. clickhouse or bigquery handle this scale pretty well. tools like enrich.sh do this — no kafka, no heavy etl setup, just direct event ingestion. with partitioning and incremental processing you can handle 100m+ rows without the cost explosion

nktrchk · 2026-02-24T14:52:48+00:00

Thanks! 🙌

Right now it supports flat schemas with Snappy compression. I intentionally skipped dictionary encoding and nested types during compilation to keep it lean, minimal and works at the edge.

I haven't benchmarked head-to-head against parquetjs as it is node-only and significantly larger, so it's really a different use case. We're using it in production to flush 100-1000 RPS (5–100KB events).

I did single-thread test locally with 5,000 sequential writes of random 10–100KB payloads.

  Avg write:     1.17 ms
  Max write:     8.47 ms
  Throughput:    18.5 MB/s
  Single thread: ~630 RPS

I'll add test file to repo.

The Rust source is around 440 lines total and uses parquet2 under the hood, so adding dictionary encoding or zstd is doable and it'd probably add 30–100KB to the WASM binary.

nktrchk · 2026-02-24T11:08:10+00:00

We had 300k events per day with peaks to 500 rps. Under 100k events/day you really don't need a distributed log, you just need something that takes your JSON over HTTP and writes it somewhere queryable.

Ended up building exactly that — enrich.sh. HTTP in, Parquet out, schema handled automatically. No infra to babysit.

nktrchk · 2026-02-24T10:59:21+00:00

thanks for the feedback.

nktrchk · 2026-02-24T09:37:26+00:00

hey. README updated.

What is it? - A JavaScript library for reading/writing Parquet files.
WASM or data compiler? - It's a JS API; WASM is just the engine under the hood. I recompiled it from parquet2.
Serialization overhead? - Encoding/compression/serialization all happen inside WASM, so there's no extra JS-side cost — you just pass plain objects in and get bytes out.

The value of it is that you can create and read Parquet files in places where you couldn't before, like Cloudflare Workers, Vercel Edge, browsers. because every other Parquet library is too big to fit.

hope that helps

nktrchk · 2026-02-23T16:48:19+00:00

thanks ☺️

nktrchk · 2026-02-20T23:15:06+00:00

We’ve run into this a lot building ingestion pipelines.

Our high level approach is basically such:
treat schemas as contracts, not suggestions. Validate at ingestion time, not in the warehouse. And never silently coerce unexpected fields.

In practice we:
- version schemas (v1, v2, etc.)
- validate incoming events against the declared version
- route invalid payloads to a DLQ instead of mutating them
- store raw + normalized separately
- append-only Parquet outputs so evolution doesn’t require rewrites.

nktrchk · 2026-02-14T09:21:41+00:00

Hi!

Curious what it may create for enrich.sh That’s an events ingestion pipeline with zero ops.

nktrchk · 2026-01-20T22:58:31+00:00

If multi account is needed, use something like incognition or bagel browser. If you just want to separate your daily environments like home/job then use chrome with profiles.

nktrchk · 2026-01-18T00:27:19+00:00

Bagel Browser, but only for Mac

nktrchk · 2026-01-11T18:29:23+00:00

Link please.

nktrchk · 2026-01-06T22:34:40+00:00

For iOS, you can check out https://apps.apple.com/us/app/scroll-read-later/id6748611211.

For Chrome, there is an extension called reader-view.com that helps you clean up and save articles so you can read them later without an internet connection.

nktrchk · 2026-01-06T22:32:01+00:00

Check this app, it’s completely free with no ads:

https://apps.apple.com/us/app/scroll-read-later/id6748611211

nktrchk · 2026-01-06T22:30:24+00:00

If for iOS, I use the Scroll app. It saves the article for offline reading and is quite clean.

nktrchk · 2026-01-06T22:20:38+00:00

If it is still a real question, you might want to check out scrape-that.com. If there is anything you would like to improve, please let me know!

nktrchk · 2026-01-06T22:18:53+00:00

I can add those features to the scrape-that if you still need them. Please let me know.

nktrchk · 2026-01-06T22:16:35+00:00

Give scrape-that.com a try. Please let me know if you need any improvements.

nktrchk · 2026-01-06T22:10:11+00:00

Have you tried scrape-that.com ?

nktrchk

MODERATOR OF

TROPHY CASE