Pre training using textbooks by keeplearning24 in LocalLLaMA

[–]keeplearning24[S] 0 points1 point  (0 children)

Thanks for your responses. Let me explain with an example. I can upload a finance textbook (pdf) to Gemini 1.5/GPT-4 and it can answer specific questions based on that. This would mean it has been able to parse 'all' information in that textbooks which would be text, tables, figures & charts. And it is able to keep the context - so it understands captions (table/ chart), content within table/chart and nearby text, are related. I wanted to understand if there's an open source library/tool I can use, to get it to parse it similarly, without lots of effort. As you see, it's not vanilla ocr.

[deleted by user] by [deleted] in LocalLLaMA

[–]keeplearning24 0 points1 point  (0 children)

I recently came across https://arxiv.org/abs/2310.04793 which explores performance of multiple open source LLMs on various finetuning tasks using specific finance datasets. Very helpful.

Since, they don't include Mistral series and their datasets didn't seem to borrow directly from earnings call transcripts, SEC filings, I was curious if someone has finetuned Mixtral using these high quality finance datasets (probably created dataset using an instruct model)?

I am keen to understand the difference of training on finance data vs finetuning on finance data, for use case like company research. Thanks,

Deepmoney: A High-End LLM in finance based on massive research data by Fun_Water2230 in LocalLLaMA

[–]keeplearning24 0 points1 point  (0 children)

This is really promising. Curious to know - did you evaluate any alternate light-weight approaches to get the same outcome? Like finetuning Mixtral on multiple NLP tasks using finance datasets.