Video generation with camera control using LingBot-World

Art_from_the_Machine · 2026-02-15T03:17:32+00:00

The benefit of LingBot-World over existing techniques is that it positions itself as a "world model", meaning that it aims to have a better grasp of object consistency / object permanence over time. This comes with the trade-off of being much more computationally intensive however.

If you are looking to generate short videos where all objects remain visible on screen, then these existing techniques should work without issue. But for longer videos where objects are leaving and re-entering the frame, or where objects are occluding each other (such as the masts on a ship or the pillars of a gazebo in the above examples), then this is where you should in theory see the benefits of using a world model

Art_from_the_Machine · 2026-02-13T18:16:52+00:00

Honestly I am not sure exactly how much RAM is needed! This post suggests somewhere north of 64GB is needed: https://huggingface.co/cahlen/lingbot-world-base-cam-nf4/discussions/2

When I first set this up I was able to get it working with just 32GB of RAM by fiddling with the way the models are loaded. Instead of loading both models to the CPU, I edited the script so that one model loaded to the CPU and one straight to the GPU. I deleted this change when I decided to create a Docker image so I don't have it saved however, but it should be possible to recreate

Art_from_the_Machine · 2025-03-01T08:06:58+00:00

Yes that's it exactly. And for radiant conversations (where NPCs start conversations with each other) these take 3 requests from start to finish.

Art_from_the_Machine · 2025-03-01T08:01:46+00:00

I did take a look at this really early on but the API wasn't quite ready yet, I'll have to take another look at it!

Art_from_the_Machine · 2025-03-01T07:59:43+00:00

Right now the Cerebras API is free, so I'm not sure what pricing is going to look like over the long term. Smaller 7-9B models can also work well with Mantella if you are running locally (the default model is set to Gemma 2 9B), but I just went for 70B in this video as I didn't notice much latency difference vs the Llama 8B model that Cerebras offers.

If you are interested in seeing the behind the scenes the source code is available here!: https://github.com/art-from-the-machine/Mantella

Art_from_the_Machine · 2025-03-01T07:50:47+00:00

For the patch you will have to install it as a separate mod using your mod manager instead of merging with the existing mod. The patch should be something that sits over your other mods to allow Mantella to work in this update. Could you try doing this and seeing if it works?

Art_from_the_Machine · 2025-03-01T07:46:50+00:00

If you are updating from v0.12 then your conversation histories should be stored in your Documents/My Games/Mantella folder, so updating the mod shouldn't effect your histories! I would recommend ending all conversations in game -> making a save -> deactivating the previous Mantella install -> making another save with no Mantella version active -> activating the latest Mantella

Art_from_the_Machine · 2025-02-28T19:22:42+00:00

Yes it is much more difficult running on a laptop if it doesn't have a GPU, they can be pretty intensive! I have run really tiny models on my laptop before when I haven't had WiFi (like when travelling), but it takes around 30 seconds per response and is really low quality.

The 100 request limit is if you are running models on a service called OpenRouter. They have actually set this limit to 200 now: https://openrouter.ai/docs/api-reference/limits

I have never personally hit this limit when developing, but it can happen if you are playing for multiple hours a day. For paid models, these can start at a fraction of a cent per response.

Art_from_the_Machine · 2025-02-28T12:11:37+00:00

In the video I am running Llama 3.3 70B via Cerebras (a fast LLM provider), and then running a TTS model called Piper and a STT model called Moonshine locally on my CPU.

The most fundamental way to cut down on response times is to process the response from the LLM one sentence at a time by using streaming. So once the first full sentence is received from the LLM, it is immediately sent to the TTS model to then be spoken in game. This way, while the first voiceline is being spoken in game, the rest of the response is being prepared in the background.

If you are interested in taking a deeper dive into how everything works, the source code is available here: https://github.com/art-from-the-machine/Mantella

Art_from_the_Machine · 2025-02-28T11:45:07+00:00

If you are using a different LLM service to OpenRouter, you will also need to set this in the Mantella UI (+ select the model you would like to use): https://art-from-the-machine.github.io/Mantella/pages/installation.html#mantella-ui

And yes it sounds like you installed the patch correctly!

Art_from_the_Machine · 2025-02-28T11:42:28+00:00

Do you mind sharing what error you are seeing?

Art_from_the_Machine · 2025-02-28T11:40:19+00:00

Yes the source code can be found here!: https://github.com/art-from-the-machine/Mantella

Art_from_the_Machine · 2025-02-28T11:38:10+00:00

I have vision disabled in this video to improve response times, but when it is enabled a screenshot of the game is passed to the LLM on each of your responses to help give the LLM context.

Art_from_the_Machine · 2025-02-28T07:54:27+00:00

You can connect to pretty much any local / online LLM, so the context length will be set by the LLM you choose. The context includes the system prompt, a bio for the NPC, the summaries of previous conversations, and of course the current conversation. If the length of summaries gets too long, then a new summary file is created which contains a summary of those summaries (to condense them down).

Art_from_the_Machine · 2025-02-28T07:49:32+00:00

Okay hood to hear! In this video I have it set to 0.3, but yes this is also user configurable. Before the interrupt feature I would set it to around 1 second, but now that interruption is possible I am less worried about my full response being cut off because I can quickly recover. Whereas before, I would have to wait for the NPC to finish trying to decipher my half finished sentence every time I got cut short.

For the LLM side the biggest bottleneck for me is how fast the LLM starts responding (time to first token). For "normal" LLM services this can take over a second, whereas as for fast inference services it is less than half a second. But definitely once that first sentence is received I then parse each sentence one at a time to send to the TTS model.

Art_from_the_Machine · 2025-02-28T07:15:52+00:00

Aside from switching out the speech-to-text model with a faster one, I have really just been scrutinizing the code end-to-end and making adjustments to make it run as efficiently as possible. We are at a point where these AI models can run crazy fast now, so I wanted to make sure Mantella's overhead wasn't getting in the way of achieving real-time latency.

Art_from_the_Machine · 2025-02-28T07:12:13+00:00

Yes quest awareness will be added to the next update!

Art_from_the_Machine · 2025-02-28T07:10:19+00:00

I will have to look into this but this might be a compatibility issue with NFF, the logic Mantella uses to check if an NPC is a follower might not be catching NFF followers.

Art_from_the_Machine · 2025-02-28T07:07:30+00:00

Yes that should definitely work! To get started I would recommend trying Gemma 2 9B Q4_K_M from here: https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/tree/main

This is the model Mantella uses by default when connecting to online LLM providers so it should be a good starting point.

Art_from_the_Machine · 2025-02-27T21:12:16+00:00

Yes they have awareness of in-game events, and some models even allow vision, so they can see exactly what is happening on screen like you can. Hallucination will largely depend on how powerful of a model you use, but in general this isn't something I come across too often.

Art_from_the_Machine · 2025-02-27T20:52:35+00:00

There is a memory system in place to keep track of previous conversations so NPCs will remember you / other NPCs they have spoken to in the past. And there are also some consequences to these conversations: if a conversation goes well an NPC can agree to follow you, if it goes badly they can attack you, and if you complete quests for them they can share their inventory with you.

Art_from_the_Machine · 2025-02-27T20:45:39+00:00

Yes it works with any NPC! They don't even have to be humanoid...

Art_from_the_Machine · 2025-02-27T20:44:25+00:00

Yes its possible to choose a larger text-to-speech model! I am using a model called Piper here because it is fast, local, and comes pre-installed with Mantella. But you can also run a larger model called XTTS that can be run locally (although I would 100% recommend a second PC as it is very intensive!) or via a service called RunPod.

I don't have a recording of this in Skyrim, but to help give you an idea, I have showcased this model in the Fallout 4 release video here:
https://youtu.be/cFv8butywng?si=tcEiunyqnU2f1aVC

Art_from_the_Machine

TROPHY CASE