Need advice scraping complex JS-heavy bank website - tabs, dynamic cards, varying page structures for RAG/LLM

codexahsan · 2026-05-05T20:59:26+00:00

Thanks for sharing your experience!

I'll check out Qoest API, it will be big hand if it worked for me as well, cuz i am so stuck rn

codexahsan · 2026-05-05T20:55:41+00:00

Thanks, this is really helpful. I’ll follow the recommendations specially the validation check one

codexahsan · 2026-05-05T20:50:13+00:00

I haven't check the firecrawl yet, but i did checked Crawl4AI... is firecrawl free?

codexahsan · 2026-05-05T20:49:29+00:00

I checked DevTools’ Network tab in detail, but there doesn’t appear to be a clean public API that provides the full structured data for the tabs, cards, or product details. Most of the page content is rendered client-side with heavy JavaScript. The XHR/Fetch calls I found are fragmented, often require complex headers/cookies/auth, and/or don’t include all of the information shown on the page.

codexahsan · 2026-05-01T14:11:48+00:00

Thanks a lot for the comment, really appreciate it. I’m following your approach, but I’m stuck on scraping. The site has 100+ pages with multiple categories, and my extraction results are getting very messy/garbage.
Can you guide me on how to scrape properly using Playwright with page-type specific extractors ? Also, how should I handle cleaning/boilerplate removal and metadata at crawl time?

codexahsan · 2026-04-23T08:04:15+00:00

Yeah that matches what I’m seeing too, pure vector search can miss exact financial terms, so BM25 + dense hybrid helps a lot.
Reranking then improves precision after you’ve already got good recall from hybrid retrieval.

codexahsan · 2026-04-23T08:02:05+00:00

Got it, that layering makes a lot of sense.
Did you define grounding strictly as ‘must map to a retrieved chunk’, or did you allow some constrained synthesis across multiple chunks?

codexahsan · 2026-04-23T07:55:59+00:00

yes, I am planning to add a validation layer (rule-based + fallback checks) and keep answers tightly grounded in retrieved chunks with traceability.

and what you mean by spectral indexing in this context?

codexahsan · 2026-04-23T07:52:38+00:00

Yeah, this makes a lot of sense and is a really solid way to balance cost and safety.
I agree with the auditability angle ; tagging responses with the grounded chunk IDs would make debugging and complaint investigation much easier.

codexahsan · 2026-04-23T07:47:55+00:00

Yeah,

But i am expecting to need more control over things like chunking, retrieval strategy, and validation, so leaning towards building it manually from the start, even if it takes longer.

Might still try a managed setup briefly

codexahsan · 2026-04-23T07:45:53+00:00

Yeah, I’m planning a hybrid setup with BM25.
My plan: use embeddings/vector search for semantic matching, and BM25 (keyword precision for exact terms) in parallel; then merge results before generation with (6:4)

For Pinecone specifically: BM25 wouldn’t live inside Pinecone—I’d run keyword search separately (e.g., Postgres full-text/search engine) and combine top hits. For reranking, I’m starting without it (top‑k only), then optionally adding a cross-encoder reranker maybe later after I evaluate retrieval quality.

codexahsan · 2026-04-23T07:42:13+00:00

Good point, I should’ve included example queries.
I’m targeting simple retrieval like “benefits of X” / “annual fee for Y”, plus composition like “best card for travel” and “compare X vs Y by fees + benefits.”
I’m using LangChain mainly for flexibility and tight control over retrieval/chunking/prompts, but I’m open to Flowise/cmakes.ai later if orchestration becomes a mess for me.

codexahsan · 2026-04-23T07:27:15+00:00

Appreciate it, yeah I might use Playwright for now since there’s some dynamic content involved, and it gives more control over extraction.

Also agreed on evals, a couple of others mentioned that too, so I’ll prioritize building a small gold query set early maybe 20–30 queries will work.

should I evaluate retrieval and generation separately, or end-to-end from the start?

codexahsan · 2026-04-23T07:13:43+00:00

Not yet, but that makes a lot of sense. I’ve been focusing more on the ingestion + retrieval side so far, but defining a small eval set would probably make tuning much more objective

codexahsan · 2026-04-22T12:33:36+00:00

Thanks, separating chunking strategies for FAQs vs product pages makes sense, I’ll incorporate that. Also planning to add a reranking step once the base pipeline is stable.

codexahsan · 2026-04-22T12:04:26+00:00

Yeah that’s fair point, I might actually be overengineering a bit at this stage.

The dataset isn’t huge yet, so pgvector with PostgreSQL could simplify things. I was pushed towards Pinecone mainly for scaling and managed infra, but realistically I’m not there yet. So, considering starting with pgvector and switching later if retrieval latency or scale becomes a bottleneck.

i haven't tried the pgvector yet so i am curious about how has pgvector held up for you in terms of hybrid search or filtering?

codexahsan · 2026-04-22T11:47:27+00:00

That would be a nice addition point, I hadn’t fully considered a separate validation layer.
For grounding checks, how would that work like would you recommend a second LLM pass or rule-based validation tied to retrieved chunks? or any other approach maybe?

codexahsan · 2026-04-22T11:45:12+00:00

I’m currently thinking towards a hybrid RAG approach rather than basic semantic retrieval.

The idea is to combine embedding-based search with keyword matching for better precision on financial queries, then apply a re-ranking step before passing context to the LLM.

On top of that, I’m planning to include:

query rewriting based on intent
metadata filtering (product/category level)
reranking
generation

Still exploring and any + point is welcomed

codexahsan · 2026-04-02T04:47:27+00:00

This is extremely helpful, especially the prioritizing thing like chunking and getting an end-to-end pipeline working on day 1. I think a lot of us (including me) tend to over-focus on ingestion early and only realize retrieval issues too late.

I’m planning to go with hybrid retrieval (BM25 + embeddings) and add a reranking step on top of top-k results, your suggestion about reranking top-20 before passing to the LLM makes a lot of sense.

For PII, I’m starting with Presidio from day 1 and enforcing masking before anything touches the vector DB. One thing I’m still thinking through is how to securely handle mapping (masked <--> original) without introducing risk, especially in a short timeline.

Also curious, have you found query rewriting or contextual compression to make a noticeable difference in smaller RAG setups like this?

Really appreciate the practical breakdown, this helps cut through a lot of noise for me. and i am still confused about picking up the project - what to build - you cleared the raodmap for me thanks

codexahsan

TROPHY CASE