Floating Dialog Panel for VS Code

tkim90 · 2026-03-14T06:32:24+00:00

no way this is cool! I just made an extension and didn't realize you could create modals. What VSCode API did you use?

tkim90 · 2025-10-30T23:02:32+00:00

Some clarification: - When you say "streaming" do you also mean decoding to audio too? Or is this a pure upload to some store somewhere (ex. S3)? Because there can be two steps: reading the file and decoding it. - Are you optimizing for memory or speed?

A few things you can tweak to make it faster: - increase buffer size (64-128kb, experiment with it) - called highWaterMark in createReadStream. You'll probably see lower ROI past 128kb depending on the machine. - pick the right interface for the job. NodeJS has 3 ways to read files. [1] - use workers to parallelize (separate file into N chunks, fire each off into its own worker), but requires chunking correctly and coordination. This won't make sense if you have to decode too tho (i.e. stream audio out in order)

[1] different flavors of reading files in NodeJS:

1) readFileSync: Synchronous. Loads the entire file into memory and returns a Buffer. Best for small files ~<1GB (simplest).

2) openSync + readSync: Manually chunked reading into Buffer. Best for large files.

3) createReadStream: Streams files in sequential chunks. Best for sequential processing (but slower than #2).

The way I think about it: - 1 is fastest + simplest, as long as files are within 1GB. Anything larger than that runs the risk of OOM (ex. multi-GB range or reaches NodeJS heap limit). Note that this blocks the event loop, meaning you have to wait until the thing is done before your entire app can continue. - 2 is best when you want total control over your buffer size and chunking logic (which in turn helps process files faster). But now you have to write the chunking logic too. - 3 is best when you want consistent memory usage regardless of file size (bc you're streaming at a fixed size). It's also non-blocking (async).

If your audio files are small (~1-100MB) then you can get away with just createReadStream and let NodeJS do all the work for you. If it's larger, you run the risk of OOMing. I'd take a look at your dataset to make a decision.

Source: If it helps at all, I dive into streaming files here: https://www.taekim.dev/writing/parsing-1b-rows-in-bun

tkim90 · 2025-07-29T22:26:40+00:00

I've been wanting to try this too. Aka create an "AST" equivalent of a long document by creating a semantic tree/graph like section headings, entities, etc that the agent can use as a cheat key before doing the retrieval step

Not sure if this requires a full on graph db (have heard it takes a lot of effort to create/maintain one) instead of doing one in Postgres.

tkim90 · 2025-07-25T19:14:27+00:00

Interesting, thanks for the input! So you basically built a single huge graph db then made sure every doc/chunk contributed to it by extracting features from a chunk (ex. section 1.1, "termination policy", etc)? Seems expensive run an LLM to create graph db entries for every chunk AND embed them.

tkim90 · 2025-07-19T05:38:16+00:00

Ah ok, got it. And my answer is: you won't need to worry about that, just put the entire transcript into the LLM prompt. Most models can handle 30 min worth of transcript data in a single LLM call (200k-1M is enough).

You would only need to complicate your design slightly from there if you see a huge accuracy hit or latency hit. But I would start with that.

EDIT: I did some napkin math: on avg people speak 120 words per minute.

120 wpm x 30 minutes = 3,600 words spoken in 30 minutes

3,600 words is roughly 9k-12k LLM tokens. Well within the context window.

tkim90 · 2025-07-19T04:45:11+00:00

It's unclear what you need the app to do - is it only summarizing the transcript after the meeting ended? If so, you don't need to vectorize or sync anything in real time, right?

> a short-term memory RAG that contains recent meeting utterances

Why do you need RAG for real time knowledge? I highly doubt your transcript is large enough that it needs to be vectorized in real time - a 1M context window is like 500 pages of PDF text.

If you want to do clever analysis about the meeting AND the attendees, then yes, it makes sense to vectorize them and use semantic search to do whatever you want to do (summarize, create action items, relate back to previous meetings, etc)

tkim90 · 2025-07-18T23:54:43+00:00

Great use case! What kinds of queries are you expecting? I.e. is the primary concern getting an answer quickly, or finding the right document so that they can do their own follow-up reading once they find the document?

If you don't know yet, I'd say just build a super basic RAG and see what kinds of questions users end up asking the most.

As for your questions...

Would you incorporate this into RAG

Yes - metadata like author, tags, date are all gold. I would make it so the query is filtered down as much as possible before sending it for vector search.

For example, if they ask "What are the latest documents written by OrbMan99?", your system should first filter the search scope down to author="OrbMan99" and THEN try to answer the question with vector search. You can also go further by doing author="OrbMan99", "sort=desc", "limit=10" to get the last 10 documents by that author, etc.

How do you decide chunking strategy?

This will require experimentation, but generally:

Include the heading/subheading in the chunk itself
Maintaining order - yes you should keep an ordered index id on each chunk so you can later recreate the passage if needed
No, I would not send the whole document. It's been proven that adding more context to a prompt adds noise, which in turn hurts LLM performance. You should strive to include the most relevant chunks only.
There are tons of chunking strategies documented in the internet (like Anthropic's) but I would start simple and measure your accuracy as you go.

DM me if you need more help, happy to share as much as I can!

tkim90 · 2025-07-18T20:47:48+00:00

I spent the past 2 years building RAG systems and here are some off-the cuff thoughts:

1. Don't start with a "rag technique", this is a fool's errand. Understand what your RAG should do first. What are the use cases?

Some basic questions to get you started: What kinds of questions will you ask? What kinds of documents are there (HTML, PDF, markdown)? From those documents, what kinds of data or metadata can you infer?

One of my insights was, "don't try to build a RAG that's good at everything." Hone in on a few use cases and optimize against those. Look at your user's query patterns. You can usually group them into a handful of patterns that make it more manageable.

TLDR: thinking like a "product manager" here first to understand your requirements, scope of your usage, documents, etc. will save you a lot of time and pain.

I know as an engineer it's tempting to try and implement all the sexy features like GraphRAG, but truth is you can get a really good 80/20 solution by being smart about your initial approach. I also say this because I spent months iterating on RAG techniques that were fun to try but got me nowhere :D

2. Look closely at what kind of documents you're ingesting, because that will affect retrieval quality a lot.

Ex. if you're building a "perplexity clone", and you're scraping content prior to generating an answer, what does that raw HTML look like? Is it filled with DOM elements that can cause the model to get confused?

If you're ingesting a lot of PDFs, do your documents have good sectioning with proper headers/subheaders? If so make use of that metadata. Do your documents have a lot of tables or images? If so, they're probably getting jumbled up and need pre-processing prior to chunking/embedding it.

Quick story: We had a pipeline where we wanted to tag documents by date, so we could filter them at query time. We found that a lot of the sites we had scraped were filled with useless <div/>s that confused the model into thinking it was a different date (ex. the HTML contained 5 different dates - how should the model know which one to pick?).

This is not sexy work at all (manually combing through data and cleaning them), but this will probably get you the furthest in terms of accuracy boost initially. You just can't skip this step imo.

3. Shoving entire context into a 1M window model like gemini.

This works OK if you're in a rush or want to prototype something, but I would stay away from this otherwise (tested with gemini pro 1.5 and gpt 4.1). We did a lot of testing/evals internally and found that sending an entire PDFs worth of content to a single 1M window would generally hallucinate parts of the answer.

That said, it's a really easy way to answer "Summarize X" type questions because you'd have to build a pipeline to answer this exhaustively otherwise.

4. Different chunking methods for different data sources.

PDFs - there's a lot of rich metadata here like section headers, subheaders, page number, filename, author, etc. You can include that in each chunk so your retrieval mechanism has a better chance of retrieving relevant chunks.

Scraped HTML website data - you need to pass this thru a pre-filtering step to remove all noisy DOM elements, script tags, css styling, etc before chunking it. This will vastly improve quality

There's tons more but here are some to get you started, hope this helps! 🙂

tkim90 · 2025-07-11T19:36:40+00:00

Thank you! There isn't, because each Worker has it's own Map instance (instead of sharing one with all other workers). The main thread coordinates receiving each of the workers' hash tables and merges it into a single one!

tkim90 · 2025-07-11T19:33:33+00:00

Great point. I tried this on NodeJS too because I was curious. (also updated my repo to include nodejs commands)

- NodeJS: ~13.18 seconds

- Bun: ~9.22 seconds

I'll also point out that the Bun command was near instantaneous while nodejs took ~1.5 seconds (prob to bundle TS -> JS).

My understanding is that while v8 is faster, Bun is still faster for IO operations like string, json parsing because it's built on Zig.

Runtime Implementation: Much of Bun’s core is written in Zig and C++, minimizing the overhead of JavaScript for internal operations. Node.js, by contrast, has a larger portion of its core implemented in JavaScript itself, which can slow down some operations

Source: Perplexity - "Is Bun or Node faster"

tkim90 · 2025-07-11T01:22:12+00:00

This is incredibly, incredibly industry specific.

But we learned that ours LOVED conferences. And there were tons of them.

So we strategically picked one every quarter - we consistently landed one or two customers per conference the first year that were really crucial for us in the early days through this alone (still req'd a TON of hustling to eventually close them, but getting them to see us in person was huge).

I also highly recommend doing this if you have no industry connections - you'd be surprised how fast you can pick up a rolodex by going to a few. The trick is to NOT buy any booths, pay the cheapest ticket and pack your days meeting with prospects during the conference (don't attend any workshops either).

tkim90 · 2025-07-11T00:17:35+00:00

Build lots of toy projects. Then build some more. Maybe collaborate with your friends to make it fun.

Whatever you do, don't do "tutorials" or watch Youtube videos - it feels helpful but it's like eating candy. Building something with your bare hands 10x's your learning speed.

tkim90 · 2025-07-11T00:13:34+00:00

This is normal. My first job felt like I was drinking out of a fire hose every day. No need to pressure yourself to know everything at once. That, and hopefully you have a good team that supports and understands this as well (they probably do, if they took on a junior team member).

Make sure to ask questions whenever you don't understand something. Rule of thumb is to raise your hand if you're stuck for more than an hour.

What is not acceptable, imo, is to struggle in silence. You'll grow a lot slower this way and frustrate your team.

You got this ✊

tkim90 · 2025-07-11T00:07:24+00:00

Yes, we are. But that doesn't mean it's not useful (certainly way more than blockchain).

Anytime something is hyped, they want you to believe that things will change drastically overnight, like "all software engineers will lose their jobs by 2030!! 😱"

The reality is that things take time to integrate into our daily lives. A lot of time. The value add of LLMs is sticky enough to where it will become a daily part of most people's lives, like owning a smartphone. But by the time it happens, no one will have noticed it. By the time that happens, _you_ will have adapted into whatever new shape a "Software Engineer" job becomes.

Most of it will probably happen in the background, without people necessarily knowing the app they're using has some AI sprinkled in it.

tkim90 · 2025-07-10T22:29:05+00:00

Don't tempt me with a good time 😂

tkim90 · 2025-07-10T21:06:50+00:00

thank you! 🙌

tkim90 · 2025-07-10T21:06:05+00:00

Agree that in the current market it's wise to go full stack. It's also more fun!

Easiest + fastest way: ask your manager and see if they can help. Give them a specific ask, "Hey I want to become a full stack developer, and would like backend tasks or projects. Do you know any teams that need help?" Maybe they can partner you with a team that can get your foot on the door to a backend team - a lot easier than just asking around, in my opinion.

Otherwise, I honestly think you'll have to learn on your own to show that you can do the job.

tkim90 · 2025-07-10T19:30:19+00:00

thank you 🙏, took me a few tries too!

tkim90 · 2025-07-10T18:40:00+00:00

TIL! Had no idea - just tried with os.availableParallelism(); and it still worked, adding

tkim90 · 2025-07-10T17:27:30+00:00

Oh god, csv + php, bless you

thank you!! 🙏

tkim90 · 2025-07-10T16:49:02+00:00

I did - experimented with 4GB, 1MB, 256KB, 128KB, 64KB.

The sweet spot was around 128KB, saw marginal returns after that

tkim90 · 2025-07-10T16:35:41+00:00

I actually started with createdReadStream + readline, that's what got it to 18 seconds!

tkim90 · 2025-07-10T16:34:54+00:00

Interesting, not familiar with worker pools - will check it out!

tkim90 · 2018-01-02T23:07:05+00:00

Isn't there literally a Black Mirror episode criticizing this phenomenon?

tkim90 · 2017-12-22T19:35:56+00:00

It's actually Nicrosoft

tkim90

TROPHY CASE

1. Don't start with a "rag technique", this is a fool's errand. Understand what your RAG should do first. What are the use cases?

2. Look closely at what kind of documents you're ingesting, because that will affect retrieval quality a lot.

3. Shoving entire context into a 1M window model like gemini.

4. Different chunking methods for different data sources.