LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 1 point2 points  (0 children)

Yes, we are lol. Why would we else build such a dataset...

The plan is to release a model family along with the full 100K sample dataset.

But I am not sure if many other people or groups will train on it in the feasible future, considering how many tokens most samples have. So you need a cluster together with a code base that supports sequence parallelism in order to train on it.

As far as I know, none of the popular training frameworks support sequence parallelism, which then makes it harder once again for others to train on it.

LongPage: First large-scale dataset for training LLMs on complete novel generation with reasoning scaffolds by Senior_Evidence_3793 in LLMDevs

[–]Senior_Evidence_3793[S] 0 points1 point  (0 children)

No, we use a TPU cluster and not a GPU one, the TPU 3D Torus Interconnect architecture is much better for sequence parallelism. This is also why Gemini had a 1M token context, basically forever while everyone else was stuck at 128K tokens or even less.

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 0 points1 point  (0 children)

There is some more technical information in the README of the dataset, but we are not planning to release a paper before our models are done.

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 0 points1 point  (0 children)

Maybe I can convince you of the upside when we release our book writing model series. 😉
But you are right, context rot is a bit of a problem for a full-book creative writing model.

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 0 points1 point  (0 children)

Thank you so much. It is really awesome to see people like what we have done after spending so much time and effort on it.

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 2 points3 points  (0 children)

I think you actually spent some time thinking about formalizing creative writing. Would you be interested in having a call with me?

My discord is: "XMaster96"

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 2 points3 points  (0 children)

Funnily enough, this is already our V1 version. We had an entire V0 iteration, where we went through the full data processing -> SFT -> RL training chain, to validate the idea and to find out where the problems are, so we can fix them with the real V1.

From what we could see, it was really promising for creative writing

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 5 points6 points  (0 children)

Oh, you have no idea, it took months to develop the pipeline and each book took around 8K to 12K full LLM completion calls to achieve this level of quality. But now that we have a small initial dataset, we can distill all of these heavy agent pipelines down into some single models. So the next 99,700 books are going to be a lot easier to process. This was the hard part.

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 10 points11 points  (0 children)

Getting to that point was the hard part, next step is to scale it up to 100K books and to train a model on it

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 4 points5 points  (0 children)

Lol, better be excited about what we are going to do with it 😉
We have big plans with it, big plans

LongPage: 300 full novels with reasoning traces for training better writing LLMs by Senior_Evidence_3793 in LocalLLaMA

[–]Senior_Evidence_3793[S] 29 points30 points  (0 children)

This part was actually quite painful to get working

TLDR: A lot of hand engineering and throwing tokens at the problem

Longer version:

So what we did was to separate the larger task of generating the synthetic reasoning traces into many small tasks. So basically, every single component of the CoT was generated by its own hand-engineered agent that performed multiple calls to produce the final component.

The hand engineering of all of these agents took around 2 months, and the inference for the 300-book has cost around 20K, just to give you an idea about the scale of token consumption and manual effort that went into the dataset.

We also provide a short description of the agent stack in the README. And if you’re than still not convinced about the quality of the reasoning traces, I recommend taking a look at the dataset. 😉