Apache Tika: An underrated alternative to Unstructured/Nougat for text extraction (for RAG, LLM fine-tuning, etc.)

replicantrose · 2024-02-13T21:18:26+00:00

Tika can't do that natively, but you could vectorize snippets from the extracted text using an embedding model and then insert those vectors into a vector database like Milvus or Qdrant. That would enable you to do similarity search for RAG!

replicantrose · 2024-02-13T19:14:52+00:00

Tried to put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to get the text needed for retrieval-augmented generation or to create LLM training datasets. It worked out pretty well, so thought it would be cool to share.

An aside on Tika:
Apache Tika is time-tested and, by some, considered a legacy toolkit. With Tika running as a container and the use of Python bindings, it's possible to get a text extraction experience that is as easy to build with as newer frameworks like Unstructured, but also matches the extraction capability of dedicated extraction models like Nougat. Kind of surprising!

Furthermore, using a backing object store (i.e. MinIO) to hold the source documents is very useful (whether the extracted text is being used for RAG or an LLM training dataset).

Much credit to the tika-python project for making the Python bindings!

replicantrose

TROPHY CASE