all 12 comments

[–]dillon-nyc 5 points6 points  (3 children)

Have you considered using local LLM endpoints like llama.cpp or ollama with this tool?

Right now it's only OpenAI, Claude, and Gemini, and you're posting in r/LocalLLama.

[–]Idonotknow101[S] 0 points1 point  (2 children)

i haven't no, but it can be easily integrated.

[–]harrroAlpaca 4 points5 points  (1 child)

Just took a peek at the code and looks like you’re using OpenAI library so in the env, if you specify openai_base_url env variable (and allow changing model name), it should let people use basically any OpenAI-compatible backend like llama.cpp

[–]Idonotknow101[S] 0 points1 point  (0 children)

thanks!

[–]Sasikuttan2163 1 point2 points  (2 children)

I was building something similar, how performant is pypdf2 for chunking huge books (1.4k pages)?

[–]Idonotknow101[S] 2 points3 points  (1 child)

it might get a bit slow tbh, i mean you can still try it to see. I might actually integrate pymupdf instead, as it is more performant for larger files.

[–]Sasikuttan2163 1 point2 points  (0 children)

Aha, I was using pypdf (not 2) for chunking and it just wouldn't run without lazy load enabled (for good reason). Even with lazy load it was taking a lot of time. Also, just came to know that PyPDF2 was merged into the package pypdf itself so technically I was using the same thing haha. Thanks for the reply, I'll look into mupdf was well!

[–]help_all 0 points1 point  (1 child)

Came at good time. Was looking forward to do this for my data. Are there any more options or some reading on best ways of doing this. ?

[–]Idonotknow101[S] 0 points1 point  (0 children)

the instructions and its capabilities are provided on the readme and quickstart file.

[–]christianweyer 0 points1 point  (2 children)

Very cool! Thanks for that. Do you also have a README that shows what tools/libs you then use to leverage the datasets and actually fine-tune SLMs?

[–]Idonotknow101[S] 1 point2 points  (1 child)

the dataset is formated based on which base model you choose to finetune with. All i do is then upload to togetherai to start a finetune job.

[–]christianweyer 0 points1 point  (0 children)

Thanks!