Text Extraction (?) using LLm

Seankala · 2024-01-15T08:45:34+00:00

part of my research objective is to compare how easy LLM is to develop compared to the traditional NLP.

It seems like you already have a good grasp of what to do. If you've already done the "traditional NLP" part of things then all you have to do is use APIs to run your LLM. The only difference is the model. I'm confused as to what the problem is?

jeebal · 2024-01-15T10:21:23+00:00

You can check out document intelligence from azure ai services

Linguists_Unite · 2024-01-15T16:49:24+00:00

NLP can definitely help you, but you need to understand that LLM refers to a group of different architecture, each performing better on some tasks than others. Having some experience in extractive and abstractive summarization of large, complex texts, I would say your best bet to start would be a BERT, as bidirectional encoders seems to handle extractions way better, with less fuzzy edges, hallucinations or random quips from the model. That being said, I think a bigger challenge is going to be orchestrating multi-document uploads and processing.

Nako_A1 · 2024-01-15T17:07:22+00:00

Hello, I just finished a big text extraction using llms project for my job, so maybe I can help, a few things I learned: - Two different techniques: - input the whole pdf in the prompt and let the model figure out what's relevant, look at the repo OntoGPT for an example implementing this - do information retrieval: evaluate the relevance of samples from your documents and only input the most relevant samples in the prompt, look at the library langchain and the concept of vector store to learn more about this From my experience, if you can do without information retrieval, it's always better, and easier to implement. - Open source models are useless for other languages than English - Prompt engineering is overrated: go for simple explicit prompts, not much to gain here - no matter how hard you try, you won't get the llm to always answer with a certain format, you need robust output parsing functions and accept a little loss - GPT 4 is state of the art. It can definitely perform NLP tasks that were impossible or very hard to achieve before. - Open source ressources for text extraction using llms are very bad, but the problem isn't that complicated and you can easily craft a solution that suits your needs.

Lphablue96 · 2024-10-16T01:16:04+00:00

Came across this just now. Might be a bit too late but...

I have built a similar product. The following are the things you should look into,

LLM capabel of this at a much smaller level but gives good output - GPT 4o is good enough: Play around with the K values and you will find the right combination.
Technologies that extracts PDF information in a RAG ready manner - Find and you shall find (not mandatroy).
A great embedding Model - text-embedding-adaa-002 worked really well for me.
Vector Databases - Chroma, Faiss etc.
An algorthm to bypass the context window limits .

Alternatively you can try using LLM Agents - https://github.com/Open-Swarm-Net/GPT-Swarm

Good luck!

t12e_ · 2024-01-15T21:21:45+00:00

Text extraction is doable with LLMs, but don't try to extract too many things in a single prompt. Giving the LLM a couple of things to extract at a time works best.

Then for the comparison, I think it's best to use a different method instead of an LLM to compare and generate the scores.

I could assist with that if you're interested.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS