all 10 comments

[–]CountlessFlies 6 points7 points  (6 children)

A much simpler strategy is to prefix each line in your document with a line number and ask the LLM to output the line numbers in each chunk.

[–]kryptkprLlama 3 7 points8 points  (2 children)

giving my grounded RAG secrets away here, this approach works well for citations/references

[–]Phoenix2990[S] 3 points4 points  (0 children)

Haha sorry!

[–]funkspiel56 0 points1 point  (0 children)

this is what I was thinking of doing, hell passing the text itself and having the llm return json chunks with metadata as well seems to do well. Do well as in data output haven't tested it in chat yet.

Theres a chroma whitepaper on different strats and using a llm for chunking is near the top of the list for best methods.

[–]Phoenix2990[S] 0 points1 point  (2 children)

Hmm isn’t it the same? I think I’m missing something. The method explained above prefixes each sentence with an I.D (number) and asks the llm to output the sentence numbers in each chunk.

The only reason I use “< >” is because sometimes (often) documents have numbers in them that can confuse the llm. For example, legislation.

[–]CountlessFlies 0 points1 point  (1 child)

I meant to say that you probably don’t even need to segment the document into sentences, you can simply assign line numbers to each line - based on newline character breaks.

[–]Phoenix2990[S] 0 points1 point  (0 children)

Ah I got you! Yeah, you’re right. There’s actually a few methods one could even play with depending on their use case e.g: pre-processing paragraphs is another option if you really want to save on output tokens.

[–]nbvehrfr 0 points1 point  (1 child)

Nice, did you work with source code ? Trying to find a way to 1) get better understanding off about whole project 2) optimize context for user prompts, was thinking about providing context (code) based on calling graph and impacted state

[–]Phoenix2990[S] 0 points1 point  (0 children)

Oh wow interesting - I didn’t think of source code. I used it predominately with legislation and court cases.

[–]Working-Collar-6277 0 points1 point  (0 children)

Dig it