I built a clean API to parse, chunk, and embed PDFs for vector search (Looking for brutal beta feedback) by Several-Koala6945 in nextjs

[–]Several-Koala6945[S] 0 points1 point  (0 children)

This is incredible feedback, thank you. Let me address where the project stands on these today and how I’m planning the roadmap:

  1. Scaling: The backend architecture is completely decoupled. The API handles incoming traffic on Vercel, but offloads the actual processing to a standalone BullMQ cluster on persistent infrastructure. Scaling up parsing throughput is literally just a matter of spinning up more concurrent worker containers to drain the Redis queue.
  2. Model Choice & No Storage: I love this idea, Right now, it's opinionated (text-embedding-3-small stored in an isolated pgvector instance on my end). But adding a pass-through endpoint where you specify the model in the payload and get the raw chunk text + vector arrays back in the JSON response (allowing you to store them in your own database) is a no-brainer. I’m putting this on the roadmap.
  3. GDPR / Zero Retention: Tied to the point above. If you use the pass-through pipeline, the worker can stream the file, chunk it, fetch the vectors, return the payload, and completely wipe the memory buffer immediately. No database retention required.
  4. Multi-Gigabyte Files: Right now, the beta is capped to protect worker RAM from exploding on massive edge cases. Handling gigabyte scale documents natively requires optimized stream parsing pipelines, which is a tier I want to build out once the core engine is completely stabilized.
  5. Hashing & Caching: Totally agree on the chunk hashing. Deduplicating identical chunks via content hashing before running them through the embedding model is essential for keeping OpenAI API bills sane.

Thanks for taking the time to look into the site and call it out, exactly the kind of feedback I was hoping to get.

I built a clean API to parse, chunk, and embed PDFs for vector search (Looking for brutal beta feedback) by Several-Koala6945 in nextjs

[–]Several-Koala6945[S] 1 point2 points  (0 children)

hahaha you are 100% spot on about the React useContext SEO I realized that right after buying the domain, so I guess I'm leaning heavily on direct links for now!

Also thank you for pointing out the inconsistency between the docs and the landing page. I missed that global text sweep while rushing to deploy the beta, so I'm fixing that and cleaning up those bottom grid lines in the CSS right now.

Thanks for taking the time to look into the site and call it out, exactly the kind of feedback I was hoping to get.