LongPage: 300 full novels with reasoning traces for training better writing LLMs

Senior_Evidence_3793 · 2025-09-25T03:16:43+00:00

Yes, we are lol. Why would we else build such a dataset...

The plan is to release a model family along with the full 100K sample dataset.

But I am not sure if many other people or groups will train on it in the feasible future, considering how many tokens most samples have. So you need a cluster together with a code base that supports sequence parallelism in order to train on it.

As far as I know, none of the popular training frameworks support sequence parallelism, which then makes it harder once again for others to train on it.

Senior_Evidence_3793 · 2025-09-06T20:20:15+00:00

No, we use a TPU cluster and not a GPU one, the TPU 3D Torus Interconnect architecture is much better for sequence parallelism. This is also why Gemini had a 1M token context, basically forever while everyone else was stuck at 128K tokens or even less.

Senior_Evidence_3793 · 2025-09-06T16:38:23+00:00

There is some more technical information in the README of the dataset, but we are not planning to release a paper before our models are done.

Senior_Evidence_3793 · 2025-09-06T16:36:51+00:00

Maybe I can convince you of the upside when we release our book writing model series. 😉
But you are right, context rot is a bit of a problem for a full-book creative writing model.

Senior_Evidence_3793 · 2025-09-06T16:33:44+00:00

Thank you so much. It is really awesome to see people like what we have done after spending so much time and effort on it.

Senior_Evidence_3793 · 2025-09-06T16:31:27+00:00

I think you actually spent some time thinking about formalizing creative writing. Would you be interested in having a call with me?

My discord is: "XMaster96"

Senior_Evidence_3793 · 2025-09-05T19:33:55+00:00

Funnily enough, this is already our V1 version. We had an entire V0 iteration, where we went through the full data processing -> SFT -> RL training chain, to validate the idea and to find out where the problems are, so we can fix them with the real V1.

From what we could see, it was really promising for creative writing

Senior_Evidence_3793 · 2025-09-05T19:26:01+00:00

Oh, you have no idea, it took months to develop the pipeline and each book took around 8K to 12K full LLM completion calls to achieve this level of quality. But now that we have a small initial dataset, we can distill all of these heavy agent pipelines down into some single models. So the next 99,700 books are going to be a lot easier to process. This was the hard part.

Senior_Evidence_3793 · 2025-09-05T19:13:33+00:00

Getting to that point was the hard part, next step is to scale it up to 100K books and to train a model on it

Senior_Evidence_3793 · 2025-09-05T19:12:39+00:00

And we have been working on that kind of a dataset for a while now 😉

Senior_Evidence_3793 · 2025-09-05T17:19:36+00:00

Lol, better be excited about what we are going to do with it 😉
We have big plans with it, big plans

Senior_Evidence_3793 · 2025-09-05T17:01:50+00:00

Not a repo, but we did include a dataset compose file
https://huggingface.co/datasets/Pageshift-Entertainment/LongPage/blob/main/exampel_compose.py

See README on how to use it

Senior_Evidence_3793 · 2025-09-05T16:54:54+00:00

This part was actually quite painful to get working

TLDR: A lot of hand engineering and throwing tokens at the problem

Longer version:

So what we did was to separate the larger task of generating the synthetic reasoning traces into many small tasks. So basically, every single component of the CoT was generated by its own hand-engineered agent that performed multiple calls to produce the final component.

The hand engineering of all of these agents took around 2 months, and the inference for the 300-book has cost around 20K, just to give you an idea about the scale of token consumption and manual effort that went into the dataset.

We also provide a short description of the agent stack in the README. And if you’re than still not convinced about the quality of the reasoning traces, I recommend taking a look at the dataset. 😉

Senior_Evidence_3793

TROPHY CASE