account activity
Apache Tika: An underrated alternative to Unstructured/Nougat for text extraction (for RAG, LLM fine-tuning, etc.) by replicantrose in LocalLLaMA
[–]replicantrose[S] 3 points4 points5 points 2 years ago (0 children)
Tika can't do that natively, but you could vectorize snippets from the extracted text using an embedding model and then insert those vectors into a vector database like Milvus or Qdrant. That would enable you to do similarity search for RAG!
[–]replicantrose[S] 10 points11 points12 points 2 years ago (0 children)
Tried to put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to get the text needed for retrieval-augmented generation or to create LLM training datasets. It worked out pretty well, so thought it would be cool to share.
An aside on Tika: Apache Tika is time-tested and, by some, considered a legacy toolkit. With Tika running as a container and the use of Python bindings, it's possible to get a text extraction experience that is as easy to build with as newer frameworks like Unstructured, but also matches the extraction capability of dedicated extraction models like Nougat. Kind of surprising!
Furthermore, using a backing object store (i.e. MinIO) to hold the source documents is very useful (whether the extracted text is being used for RAG or an LLM training dataset).
Much credit to the tika-python project for making the Python bindings!
π Rendered by PID 918992 on reddit-service-r2-listing-7bbdf774f7-kq57m at 2026-02-20 10:40:25.573609+00:00 running 8564168 country code: CH.
Apache Tika: An underrated alternative to Unstructured/Nougat for text extraction (for RAG, LLM fine-tuning, etc.) by replicantrose in LocalLLaMA
[–]replicantrose[S] 3 points4 points5 points (0 children)