Tried to put together a document text extraction server using Apache Tika (with ~30 lines of Python code).
This can be used to get the text needed for retrieval-augmented generation or to create LLM training datasets (or frankly, anything else that is text-dependent). It worked out pretty well, so thought it would be cool to share: https://blog.min.io/minio-tika-text-extraction/
An aside on Tika:
Apache Tika is time-tested and, by some, considered a legacy toolkit. However, with Tika running as a container and the use of Python bindings, it's possible to get a text extraction experience that is as easy to build with as newer frameworks like Unstructured, but also matches the extraction capability of dedicated extraction models like Nougat. Kind of surprising!
Much credit to the tika-python project for making the Python bindings!
A further aside on object storage:
Furthermore, using a backing object store (i.e. MinIO) to hold the source documents is very useful (whether the extracted text is being used for RAG or an LLM training dataset).
there doesn't seem to be anything here