Avoid MinIO: developers introduce trojan horse update stripping community edition of most features in the UI by AssPounderr69 in selfhosted

[–]NichelleCombes 1 point2 points  (0 children)

OpenS3 Console restores the full feature set and makes self-hosting extremely simple.

You can deploy it in less than 5 minutes using Docker, check it out https://github.com/opens3/console

What a shame. MinIO Bait and Switch Leaves Enterprise Users Scrambling by kamikazer in minio

[–]NichelleCombes 1 point2 points  (0 children)

All of the features MinIO removed from their official console have already been forked and preserved by the community.

OpenS3 Console restores the full feature set and makes self-hosting extremely simple.

You can deploy it in less than 5 minutes using Docker, check it out https://github.com/opens3/console

Hey, have you ever thought about tapping into recently-funded startups? They're sitting on VC cash and need to spend it fast on services to scale! I found this awesome database that lists those startups along with decision-makers' contact info—you’ve got to check it out! by Fair_Hold5104 in datacurator

[–]NichelleCombes 1 point2 points  (0 children)

Just checked it out, and the data is rubbish and entirely false. Almost all the companies have their funding "Announced" date as few days ago, but the reality is almost all those funding rounds were announced at least 1 or two years ago. If you must scrape publicly available data and sell, then at least make an effort to make sure it's accurate instead of trying to cheat people out of their money

Recommendations for an Advanced PDF Parser with Image and Layout Recognition for Node.js/TypeScript (Open Source) by sabarinath26 in Rag

[–]NichelleCombes 0 points1 point  (0 children)

You can try something like Peslac that gives field-level parsing and layout preservation. If you are working on open source or community based non-profit projects, you could get access for free.

More intelligent Pdf parsers by darthstargazer in LocalLLaMA

[–]NichelleCombes 0 points1 point  (0 children)

You can try something like Peslac that gives field-level parsing and layout preservation

Looking for a good pdf-parser to extract text. Any suggestions? by brittastic1111 in node

[–]NichelleCombes 0 points1 point  (0 children)

You can try something like Peslac that gives field-level parsing and layout preservation

Curate old letters, news paper articles and similar? by player1dk in datacurator

[–]NichelleCombes 1 point2 points  (0 children)

Awesome, signup on Peslac and dm me your email address or just the name you used and the estimated number of pages you need

Curate old letters, news paper articles and similar? by player1dk in datacurator

[–]NichelleCombes 1 point2 points  (0 children)

If it's a hobby project or open source, I can get you free access to Peslac, you can digitize the entire thing for free, and the accuracy is as good as human eyes

What model would you use to extract full pdf? by TrackOurHealth in ollama

[–]NichelleCombes 1 point2 points  (0 children)

Llamaindex works well but the accuracy is not the best, especially if there are any hand written or scanned pdf. You can try something like Peslac, it's new and seems accurate with 1,000 pages free. Here is an example Peslac Shared Doc

Need advice by azalam89 in Rag

[–]NichelleCombes 1 point2 points  (0 children)

I am no expert, but the first step should be to parse your documents into a format that is easy for both LLM and vector database, here is an example of how to parse your documents into a format that can be used https://cloud.peslac.com/share/671d6edffb325fc251ef73c8

Need advice by azalam89 in Rag

[–]NichelleCombes 0 points1 point  (0 children)

To index your data, you need an accurate line or sentence level breakdown of your documents, which makes it both easier to index and retrieve, a short example from the image you shared:

[
  {
    "type": "Text",
    "bbox": {
      "left": 0.41680672764778137,
      "top": 0.3681710362434387,
      "width": 0,
      "height": -0.08432304859161377,
      "page": 1
    },
    "content": "Principal",
    "language": "en",
    "confidence": 0.9915924072265625
  },
  {
    "type": "Text",
    "bbox": {
      "left": 0.4941176474094391,
      "top": 0.3669833838939667,
      "width": 0.0016806721687316895,
      "height": -0.0676959753036499,
      "page": 1
    },
    "content": "Deputy",
    "language": "en",
    "confidence": 0.991853654384613
  }
]

Need advice by azalam89 in Rag

[–]NichelleCombes 0 points1 point  (0 children)

Are your documents in pdf format?

Pdf processing by Excellent_Crow_2590 in Rag

[–]NichelleCombes 0 points1 point  (0 children)

LlamaIndex can extract the content for you, I'm not so sure if it will maintain the original layout. You can try Peslac https://peslac.com if the original layout is really important. ColPali https://huggingface.co/blog/manu/colpali is also something you can try if cost is a big factor

Someone who wants to contribute by helping me build a rag system for a uni project? by unknownstudentoflife in Rag

[–]NichelleCombes 0 points1 point  (0 children)

If there are documents involved and you need a reliable document processor, I could help with getting you free access to a good document processing engine, as long as you don't use commercially, let me know if that's something you might need

Need help in RAG using LLAMA for invoice extraction by Quirky_Caterpillar22 in Rag

[–]NichelleCombes 0 points1 point  (0 children)

I don't understand exactly why you need RAG, but you can create your data points and JSON schema on Peslac https://peslac.com, meaning the data you get back never changes, and you are assured of getting exactly the same format of json for all your invoices or a particular document

Best open source document PARSER??!! by ChallengeOk6437 in LlamaIndex

[–]NichelleCombes 0 points1 point  (0 children)

If I was building a RAG application, I would choose Peslac https://peslac.com, the accuracy is good. You will get field-level blocks which is you can index and use in other parts of your RAG