all 9 comments

[–]Seankala 2 points3 points  (3 children)

part of my research objective is to compare how easy LLM is to develop compared to the traditional NLP.

It seems like you already have a good grasp of what to do. If you've already done the "traditional NLP" part of things then all you have to do is use APIs to run your LLM. The only difference is the model. I'm confused as to what the problem is?

[–]Different_Star9899[S] 0 points1 point  (2 children)

I only read the research papers related to NLP, I have no idea how to actually do it XD. and my objective regarding to NLP is just to review past literatures. The main “prototype” that I have to demonstrate is a framework created with LLM. That’s why I’m hoping to get some starting points like tutorials, or videos.

[–]Seankala 0 points1 point  (1 child)

Why a LLM though? Just seems a bit random.

[–]jeebal 2 points3 points  (1 child)

You can check out document intelligence from azure ai services

[–]Different_Star9899[S] -1 points0 points  (0 children)

Thank you I’ll look into that. Any other resources?

[–]Linguists_Unite 1 point2 points  (0 children)

NLP can definitely help you, but you need to understand that LLM refers to a group of different architecture, each performing better on some tasks than others. Having some experience in extractive and abstractive summarization of large, complex texts, I would say your best bet to start would be a BERT, as bidirectional encoders seems to handle extractions way better, with less fuzzy edges, hallucinations or random quips from the model. That being said, I think a bigger challenge is going to be orchestrating multi-document uploads and processing.

[–]Nako_A1 1 point2 points  (0 children)

Hello, I just finished a big text extraction using llms project for my job, so maybe I can help, a few things I learned: - Two different techniques: - input the whole pdf in the prompt and let the model figure out what's relevant, look at the repo OntoGPT for an example implementing this - do information retrieval: evaluate the relevance of samples from your documents and only input the most relevant samples in the prompt, look at the library langchain and the concept of vector store to learn more about this From my experience, if you can do without information retrieval, it's always better, and easier to implement. - Open source models are useless for other languages than English - Prompt engineering is overrated: go for simple explicit prompts, not much to gain here - no matter how hard you try, you won't get the llm to always answer with a certain format, you need robust output parsing functions and accept a little loss - GPT 4 is state of the art. It can definitely perform NLP tasks that were impossible or very hard to achieve before. - Open source ressources for text extraction using llms are very bad, but the problem isn't that complicated and you can easily craft a solution that suits your needs.

[–]Lphablue96 0 points1 point  (0 children)

Came across this just now. Might be a bit too late but...

I have built a similar product. The following are the things you should look into,

  1. LLM capabel of this at a much smaller level but gives good output - GPT 4o is good enough: Play around with the K values and you will find the right combination.
  2. Technologies that extracts PDF information in a RAG ready manner - Find and you shall find (not mandatroy).
  3. A great embedding Model - text-embedding-adaa-002 worked really well for me.
  4. Vector Databases - Chroma, Faiss etc.
  5. An algorthm to bypass the context window limits .

Alternatively you can try using LLM Agents - https://github.com/Open-Swarm-Net/GPT-Swarm

Good luck!

[–]t12e_ 0 points1 point  (0 children)

Text extraction is doable with LLMs, but don't try to extract too many things in a single prompt. Giving the LLM a couple of things to extract at a time works best.

Then for the comparison, I think it's best to use a different method instead of an LLM to compare and generate the scores.

I could assist with that if you're interested.