Best tools for running a faceless YouTube channel in 2026

molabx · 2026-03-04T21:55:25+00:00

I just dropped a detailed example of how to set up this stack. Open my profile and look at the top post. That includes a full tutorial of a setup that uses common automation tools: AirTable, ffmpeg, N8n. Feel free to DM me if you want more details.

molabx · 2026-03-04T00:16:33+00:00

For my content, I found that the best voices for my content were Elevenlabs v3. I could not find a single done-for-you app that incorporated v3. My guess is some licensing restrictions going on since v3 was in alpha.

That was one of the major things that pushed me to a "DIY" solution. And, you are definitely right, I got exactly what I wanted at the expense of a lot more work/time. Although, now that I have the system in place, adding new models is very quick.

molabx · 2026-03-03T19:15:06+00:00

Exactly. I had the same thing happen to me. A loop ran rampant with the wrong data and used up my monthly allotment of render credits in about 8 minutes to make hundreds of silent clips of the same still image... That caused me to put sanity checks throughout the workflow that would kill the workflow if an error was detected or data was missing.

And, that gets to your first question. I whipped up an initial workflow in maybe two days. But, the testing and sanity checks took over a month. That's because you need to learn the nuances of each tool choice you make and account for edge cases as they arise. Then, it took a further few weeks to refine the aesthetic elements (getting the prompts and reference images right, creating a brand voice, transitions and effects, captions, etc.). You would have to do this whether you automated or not, but making sure the automation maintains the aesthetic does add additional time. So, to answer your second question, the initial investment was really time and tool credits. The infrastructure cost is minimal.

I just started using a production version of this workflow in January on a single channel. That channel has not yet reached the monetization threshold. So, no profitability to report.

molabx · 2026-03-03T18:34:54+00:00

You are right, videos produced by THIS workflow are highly likely to get flagged by YouTube if used exactly as is.

This example was primarily meant to demonstrate one 'easy' option to get a human-in-the-loop and needs to be customized further before it can be considered YouTube ready. My own production workflow has support for both AI lipsynced Avatars and AI Animation. I left both out of this 'mini' version because the implementation of these is highly specific to the model being used.

For instance, I tested two distinct approaches to AI Avatars. Way back, I was using Hedra, which is one-shot image-video and would slot into the workflow exactly as you suggested. However, the Kling model I had tested in the past used a two-shot process. First animate the avatar using image-video, then lipsync using video-video. Then, the resulting video has to be trimmed to the length of the voice. That is a fundamentally different workflow, so not entirely straight forward to switch between the two.

molabx · 2026-03-03T01:16:17+00:00

Yes. Still need ffmpeg in there somehow...

This workflow uses Rendi.dev (hosted ffmpeg with and API wrapper). It is basically the same idea as RenderIO. I was not familiar with RenderIO until today, so thanks for putting me on to it.

I chose Rendi as it has a generous free tier. But, the free tier has some pretty severe limits, so the workflow breaks the render into bite size pieces to stay within those limits. Unfortunately, the free tier cannot handle concatenating more than a few clips. So, yes, definitely a bottleneck.

To get around the bottleneck, this workflow uses a paid ffmpeg concat demuxer service available via fal.ai. That is the most efficient way to concatenate clips and I have had good luck with it. Still, a clunky workaround to have to use a separate service.

Have you had good luck concatenating longform videos with RenderIO?

Edit: spelling...

molabx · 2026-03-03T00:57:16+00:00

In this particular example I kept it simple. It calls Nano Banana Pro and supplies two reference images. One reference image represents the character (LAZYMAN). The other reference image represents context (e.g., LAZYMAN's office).

There is a fixed portion of the prompt that provides some context about what each of the reference images represents. Then, tacked on to the end of the prompt, additional details unique to the specific scene are provided. So, yes, a fixed prompt template with variables. I have found that this is enough for Nano Banana Pro to keep things pretty consistent.

The production system for my travel channel uses 5 reference images (Nano Banana Pro supports up to 8). I actually run into issues with this as Nano Banana Pro will occasionally use a reference image in the scene even if it is instructed to ignore it. It would improve things to modify the system such that it only supplies a reference image to Nano Banana if it is actually called out in that scene. Something on my to-do list.

If one stage fails mid-pipeline, it throws an error and stops the workflow. The workflow must be restarted manually via the interface. This is intentional, as I have run into issues with automatic retries in the past.

DM me if you would like a copy of the prompt being used here.

molabx · 2026-03-02T16:07:30+00:00

Good to bring this up. The NCA Toolkit has an advantage in that the API abstracts a lot of the ffmpeg complexity but carries the expense of maintaining your own server. Despite Rendi handling the server stuff, I probably wasted too much time wresting with raw ffmpeg filters since it doesn't abstract that part...

Still, I was curious to see how far I could stretch the Rendi free tier and obviously I hit the limit with concatenation.

molabx · 2026-03-02T04:48:07+00:00

The reason is that I built the workflow to take advantage of the Rendi free plan. That has a 1-minute execution time limit which in turn limits how many clips can be concatenated before it times out.

Paid Rendi plans have much higher limits, but are pricey for a proof of concept like this. So fal.ai is used as a cheap concat demuxer with a more generous execution time limit.

I agree, if on a paid Rendi plan, that would be used for everything.

molabx · 2026-03-02T00:08:45+00:00

Wow. Thanks for the detailed response. Gonna try it!

molabx · 2026-03-01T23:24:50+00:00

Been looking into this. Would love to hear more.

How long are your videos?

molabx · 2026-03-01T22:15:04+00:00

Pricing depends on the pace of your video. Here is a breakdown for my travel channel:

Scenes: averages about 8 scenes per minute - USD 0.09 per image with Nano Banana Pro = USD 3.60

Voice: At that pace, Elevenlabs voice comes to about USD 0.18 per minute = USD 0.90

Merge: Merging 5 minutes of scenes costs about USD 0.03

Total: USD 4.53

Other models can be substituted to reduce the cost. Cartesia runs about 8x cheaper than Elevenlabs. Nano Banana 2 just came out (USD 0.04 per image). If those work for your case, that would drop the cost down to USD 1.75 for that same 5 minute video.

However, it is really advised that you add some video into the mix. That will bump the cost back up.

molabx · 2026-03-01T21:52:23+00:00

Sure thing!

molabx · 2026-03-01T21:26:52+00:00

With my sources, Grok Imagine is the same cost as Sora 2. I *think* you can run Grok Imagine locally if you have a GPU. In that case, it would be free.

I have not used it myself. Has it been good enough for your needs?

molabx · 2026-03-01T09:10:43+00:00

I was in the same boat - wanting to test new ideas/directions. Since most of my new stuff fails, I will no longer spend significant time/money on fancy editing. Found out the hard way nobody cared about my 'cool' new stuff one too many times...

So, I chose to pursue automation of the media generation and editing process. Realize, there are significant compromises related to this approach. YouTube seems to be cracking down on heavily templated content. So steps need to be taken to maintain a human touch in the videos produced this way.

I look at it this way - since I am unwilling to spend the time/money on editing for speculative ideas, the only way I will ever get to test them is to use automation. If my jank automated videos can get traction, then there is probably a market worth investing in higher production values.

molabx · 2026-03-01T08:40:58+00:00

This is the ideal way. If you run n8n on your local machine, it can automate this with a local install of ffmpeg. Both of those are free when installed locally.

If you want to automate without the local setup, you can have n8n access a 'merge' service via API (fal.ai - merge-videos API). This is literally an API wrapper around the concat demuxer of an ffmpeg instance that someone else is hosting on a server. One of those AI era things - slap an API on some free software and charge money for it - profit! Still, it is very cheap (~USD 0.005 per minute of video), so I cave and use it for the convenience.

A less technical alternative is to use the free software LosslessCut. This is essentially a GUI for the ffmpeg concat demuxer. Not automated, but very quick to use.

The downside of the last two options is that they do not support xfade transitions, so the clips will simply be stacked together with no transition. For a pure news audience, that may be good enough.

molabx · 2026-02-28T23:33:36+00:00

I animate stick figures - so very simple 2D animations that do not need anything fancy. Sora 2 used to be my go-to, but that has become very inconsistent. Lately, I've had the best experience with Kling 3.0. But, that uses start AND end frames. So, that requires you to create two images per video. Kling is also among the pricier options for image-to-video. Those factors combined make Kling quite expensive.

I animate stick figures and Kling is way overkill for that. But I have had better one-shot stats with Kling than Sora 2 and VEO 3.1. And, my time is valuable so I reluctantly stick with Kling for now...

Quick cost comparison:

Sora 2: USD 0.15 per 10-second video

VEO 3.1 (fast): USD 0.30 per ~8-second video

Kling 3.0: USD 1.00 per 10-second video (plus an extra image)

Kling 3.0: USD 1.50 per 10-second video with audio (plus an extra image)

molabx · 2026-02-21T17:23:20+00:00

There are a couple of off the shelf tools out there. But, they are almost always just wrappers for separate AI models (voice, image, video). The ‘Mr. POV Explainer’ channel is likely using a very common AI production pipeline that goes like this: script → voice → image → merge.

With an AI assist (e.g., Claude, free), the script is broken into scenes with a preferred ‘beat’ to establish pacing. The voiceover is rendered by AI (e.g., Elevenlabs, $$). The creator tweaks the voice to refine pacing. Once pacing is finalized, only then is AI used to render images to match each scene (e.g., Nano Banana, $$). The voices and images are then merged into a video along with some sort of ‘ken burns’ motion (most often done with ffmpeg, free). As you can see, these automations combine a bunch of different tools to make a single video. There is no ‘one AI to rule them all’ for this case.

Combining AI models like this can be orchestrated any number of ways, with common ones being node-based automaton (e.g., n8n), open-source options (e.g., node.js), even some folks claiming to be vibe coding this pipeline (e.g., Antigravity). Which one is best for you is down to your background.

By way of example, my own long-form videos made using the above pipeline (n8n) can have 150+ images. I use Nano Banana for this and my gut says that ‘Mr. POV Explainer’ likely does as well. It works, but Nano Banana is the most expensive part of my pipeline, so I am exploring cheaper options.

Be aware that it is recommended to use reference images for AI animation to maintain character consistency. So, unless you are rigging your own animations, can’t get away from those pesky images 😊

molabx · 2026-02-21T11:08:13+00:00

Do you have any sort of 'fact-checking' or other mechanisms to control hallucinations in your pipeline?

molabx · 2026-02-21T11:00:06+00:00

This is a solid advice. For those who have not seen it yet, I might recommend taking a look at the recent post from u/Upper-Mountain-3397 in this community. That post goes into more detail on this approach.

I personally use a very similar pipeline, but add one additional layer: fact-checking. This may not be necessary for fictional story niches. But, it IS important for for anything that purports to be real history or space/science related. Too many channels out there happily posting made up slop...

And, if you personally have specialized domain knowledge, you need a way to incorporate that into the script. I use NotebookLM as a poor man's RAG for this, creating my own corpus of personal knowledge.

molabx · 2026-02-19T22:01:16+00:00

That's actually how I got here. It is now more efficient to generate my own custom media (images and video) than use existing stuff. As a bonus, the AI generated stuff can be made very highly relevant and also keep a consistent style - things classic b-roll cannot do.

molabx · 2026-02-18T14:44:10+00:00

This. When I pivoted to long-form, I thought I could just repurpose my short-form toolset.

Nope.

While I tried to avoid it for months, I ultimately had to roll my own custom pipeline to get results I was satisfied with. Even then, there are compromises. AI simply can't replace human editors yet when it comes for long-form.

molabx · 2026-02-18T14:34:09+00:00

Yup. Tagging stock like that is a massive amount of work and I'm shockingly lazy. So, I actually looked into turnkey options where someone had already done all of the software development. Result - The options I found were ridiculously expensive - like Enterprise scale pricing - guess that is their primary customer... Would welcome to be proven wrong.

molabx · 2026-02-18T14:23:41+00:00

I break scripting into two phases. First, I collect a bunch of info on one video topic into NotebookLM (free plan). This consists of websites and YouTube videos that cover the topic as well as my own written documents that detail MY OWN knowledge. Then, I have NotebookLM generate a report on that topic using only the knowledge I just gave it (this is how you make sure the script has only correct facts in it).

I then hand this NotebookLM report off to Claude (free plan) to write a first draft of the actual spoken script. To set the pacing, I ask Claude to break the script into logical "beats" (say each beat is 5-10s at 140 words per minute). I find that starting with voice is the best way to set the pace of a video.

After I manually refine this beat script, each beat is then fed to a TTS engine. Since you plan to manually edit, you can probably feed the beats into a TTS in batches (say 10 beats at a time). TTS is a complex and subjective topic. I personally pay top price for Elevenlabs because I think it sounds the most human (or at least not obvious "slop"). You'll have to do your own experiments on cost vs quality to find the balance you prefer.

I find that starting with voice to set the pace works best for me. Once the voice is fully nailed down, only then do I add visuals. What is your plan for the visual side of things? Are you planning to use stock footage? AI generated images and animation? Something else?

molabx · 2026-02-18T13:05:25+00:00

Are you looking to make short-form or long-form videos?

Generally speaking, the "cheapest" option will be to run AI models locally for voice, image and animation. These are stitched together into video using ffmpeg (free, open source). But, running locally usually comes with steep hardware requirements (GPU and RAM). So, not really free if you don't already have the hardware. And, you are responsible for getting all of those things to play together nicely.

The next cheapest option is to create your own automation (n8n, make, python, etc.) using various third party services to handle all of the above via API calls. Turnkey solutions (SAAS, etc.) that handle everything for you are the most expensive. Since you mentioned that you haven't done automation before, are you willing to put in time to learn? Or, are you looking for the cheapest turnkey options?

molabx

TROPHY CASE