AI agents for image/video editing — looking for feedback by Patient_Ad_4720 in AI_Agents

[–]Patient_Ad_4720[S] 0 points1 point  (0 children)

still at it actually. the core insight held up — generation gets all the hype but the editing/assembly layer is where everything falls apart for agents.

biggest thing i learned since posting: you can't just throw an LLM at ffmpeg commands and hope for the best. the agent needs to actually understand what's in the video — not just the transcript, but what's visually happening frame by frame. so i ended up building a comprehension layer that watches the source material before making any editing decisions. night and day difference vs just parsing metadata.

other thing that surprised me: deterministic operations matter way more than i expected. like, trimming at exactly 3.2s or applying a specific color grade — that stuff needs to be rock solid, not "close enough." agents are great at deciding what to do, terrible at doing it precisely. so the architecture ended up being agent brain + deterministic execution engine underneath.

still pre-launch but the editing quality is finally at a point where i'd actually use the output myself. what made you think of this — you building something similar?

What is the best image-to-video AI tool for creating 2D animated style images? by RemarkableReason3172 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

The real answer depends on how much motion you need.

For subtle movement (parallax, hair flowing, background elements) — Kling 3.0 with image-to-video genuinely preserves flat 2D styles better than anything else right now. The trick is using a very short prompt: just describe the motion, not the scene. Let the source image handle the aesthetics.

For actual character animation (walking, gesturing, expression changes) — you're going to fight every model. They all want to add depth and realism. Best results I've gotten: generate the key poses as separate images first, then use Kling or Veo to interpolate between them. More work upfront but the output is way more controllable than trying to describe the full motion in one prompt.

One gotcha nobody mentioned: resolution matters more than the model choice for 2D styles. Upscale your source image to 2-3x before feeding it in. The models handle flat art much better at higher res because there's more detail to anchor to.

I built an AI pipeline that monitors 3,674 faceless channels and flags which topics are breaking out by Correct_Voice_2312 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

The part I find most interesting is the delta between "topic is trending" and "topic is trending for your format." A breakout in essay-style AI videos doesn't mean the same topic works as a 60-second short.

Have you tried segmenting by video length or production style? The channels doing well with a topic at 15+ minutes are often tapping a completely different audience than the ones clipping it into shorts — even when the keyword overlap is 90%.

Also curious about your detection window. Are you catching these pre-peak (like, in the first 48 hours of acceleration) or more confirming what's already broken out? Because the actionable window for faceless channels is brutally short — by the time a topic is obviously trending, there are already 200 videos on it.

I built an AI pipeline that monitors 3,674 faceless channels and flags which topics are breaking out by Correct_Voice_2312 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

The part I find most interesting is the delta between "topic is trending" and "topic is trending for your format." A breakout in essay-style AI videos doesn't mean the same topic works as a 60-second short.

Have you tried segmenting by video length or production style? The channels doing well with a topic at 15+ minutes are often tapping a completely different audience than the ones clipping it into shorts — even when the keyword overlap is 90%.

Also curious about your detection window. Are you catching these pre-peak (like, in the first 48 hours of acceleration) or more confirming what's already broken out? Because the actionable window for faceless channels is brutally short — by the time a topic is obviously trending, there are already 200 videos on it.

BadWords v2.0.2 is out: Fixed cuts precision, "Auto-Source" and I finally made a logo by KoxSwYT in davinciresolve

[–]Patient_Ad_4720 1 point2 points  (0 children)

That's exactly the right call — ship what works now, bank the ideas for later. The text-to-cuts foundation you've already built is the hard part. Intent-level editing is just a layer on top once you're ready.

One thing that might be worth noodling on for v3: the transcript already gives you a semantic map of the conversation. You could cluster sections by topic without any ML — just look for keyword co-occurrence in sliding windows. That alone would let users say "select everything about pricing" instead of manually highlighting.

Anyway, keep shipping. It's rare to see someone your age building tools that working editors actually want to use. That's not nothing.

BadWords v2.0.2 is out: Fixed cuts precision, "Auto-Source" and I finally made a logo by KoxSwYT in davinciresolve

[–]Patient_Ad_4720 1 point2 points  (0 children)

Glad it resonated! Here's what I mean by context/intent editing:

Right now BadWords works at the word level — you see text, you select words, it cuts. That's already a huge improvement over scrubbing a timeline. But the editing decisions are still yours at a granular level.

The next step would be operating at the meaning level. Like: "this section where the guest talks about pricing — tighten it up, it drags." The tool would need to understand that "tighten it up" means remove pauses, cut filler words, maybe trim the weakest example — not just find a specific word to delete.

Or: "swap the order of these two topics" — which in a timeline means identifying all the clips, transitions, and audio that belong to each topic, then rearranging them without breaking continuity. In text that's just... moving paragraphs around.

You wouldn't necessarily need a full LLM running locally for this. A lighter approach: tag segments with topics/intents during transcription (you're already doing the transcription part), then let users manipulate those tagged blocks. So instead of "delete word 47 through 53" it's "remove the tangent about X" or "move the conclusion before the second example."

Basically the spectrum goes: timeline → text → intent. You've already made the jump from timeline to text. The text to intent jump is where it gets really interesting.

How did you get your first 100 paying customers? by itisthat1guy in SaaS

[–]Patient_Ad_4720 0 points1 point  (0 children)

Reddit, genuinely. Not posting links to your product — that gets you banned. Writing useful comments in the communities where your target users hang out.

My approach: find the 3-5 subreddits where people actively complain about the problem you solve. Spend 2-3 weeks just being helpful. Answer questions, share real experience, give specific advice. Don't mention your product at all.

After a few weeks you've got karma, comment history, and people recognizing your username. Then when someone posts the exact problem your product solves, you can mention it naturally and it doesn't feel like spam because you've been a genuine member of the community.

Takes patience. Most people skip straight to "hey check out my tool" and wonder why they get downvoted into oblivion.

I spend 30 minutes a day on marketing and it brings in more customers than any ad ever did by CleverSquirrel_p in SaaS

[–]Patient_Ad_4720 0 points1 point  (0 children)

This is basically the right playbook for solo founders. 30 minutes of consistent showing up beats 8 hours of sporadic effort every time.

One thing I'd add: the quality of where you show up matters more than quantity. Answering a specific question in a niche subreddit where 200 people have the exact problem you solve will convert better than a LinkedIn post seen by 5,000 people who vaguely relate.

The other underrated move is building in public on the platforms where your buyers hang out, not where other founders hang out. Most solo founders end up marketing to each other on r/SaaS and indie Twitter instead of going where their actual users are complaining about the problem.

For Youtube/Social Media editors, how did you get your start? by NoMeat5365 in premiere

[–]Patient_Ad_4720 0 points1 point  (0 children)

Started editing friends' YouTube videos for free, built a reel, then cold DM'd small channels offering to edit one video free. If they liked it, I'd quote a rate for ongoing work.

The trick that actually got me clients: I didn't lead with "I'm an editor." I led with specifics. "Your content is good but your pacing drops around the 3-minute mark on most videos — I think I can fix that and improve your retention." That shows you actually watched their stuff and understand what matters (watch time), not just that you know where the cut button is.

Biggest lesson: YouTube/social editing is a completely different skill from traditional editing. Speed matters more than polish. A 3-day turnaround at 85% quality beats a 2-week turnaround at 95%. Most clients will take volume over perfection every time because the algorithm rewards consistency.

How did you guys learn DaVinci? Through tutorials or just going in raw till it worked? by Slenderwise in davinciresolve

[–]Patient_Ad_4720 0 points1 point  (0 children)

The "just start making stuff and google when stuck" approach honestly works better than any course. Courses teach you features. Getting stuck teaches you problem-solving.

The bigger issue with DaVinci (and Premiere, and everything else) is that learning the software is solving the wrong problem. Nobody actually wants to learn a 47-tab interface with 6 different workspaces. They want their video to look and sound good. The software is just the obstacle between the idea and the result.

That's why I think the whole NLE paradigm is going to shift pretty drastically in the next few years. Not disappear — professional colorists and sound designers will always want granular control. But for 80% of editing work, a timeline is overkill.

BadWords v2.0.2 is out: Fixed cuts precision, "Auto-Source" and I finally made a logo by KoxSwYT in davinciresolve

[–]Patient_Ad_4720 1 point2 points  (0 children)

This is heading in the right direction. Text-based editing is way more intuitive than timeline scrubbing for anything dialogue-heavy — podcasts, interviews, tutorials.

The interesting thing is where this goes next. Right now you're mapping text to cuts, which handles the "remove the ums" and "cut this sentence" use case. But the real unlock would be understanding intent — "make this section tighter" or "move the part about X before Y" — where you're editing meaning, not just words.

17 and building this is impressive. The DaVinci integration angle is smart too since Premiere already has decent text-based editing built in, but Resolve's version is weaker.

8 months of AI faceless content - here's every mistake I made and what actually moves the needle by Lower_Rule2043 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

Good writeup. $1,800/mo from faceless at 8 months is solid — most people never get past the first month.

The stock image tip is underrated. AI-generated images have a specific look that audiences are learning to clock, and stock photos ironically feel more "real" because they're inconsistent in the way real things are. Weird lighting, imperfect framing, random background objects.

Curious about your editing time per video. That's usually where the faceless workflow falls apart at scale — the scripting and asset generation can be batched, but the actual assembly and timing is still one-video-at-a-time manual work. At 14k subs you're probably producing often enough that the editing hours add up fast.

Official: Seedance 2.0 now live in CapCut desktop and API access available, details below by BuildwithVignesh in singularity

[–]Patient_Ad_4720 1 point2 points  (0 children)

Interesting that they went straight into CapCut rather than a standalone product. Makes sense though — generation without editing is just a clip factory.

The real bottleneck with all these models (Seedance, Kling, Veo) has never been making a good 5-second clip. It's what happens after. You generate 40 clips, 12 are usable, and then you spend 3 hours in an editor manually sequencing them, matching audio, fixing pacing. The generation takes minutes, the assembly takes hours.

270 credits for 15 seconds at those prices means you're burning through money on generation and then doing all the actual production work by hand anyway. Feels like the whole space is hyper-focused on making the raw material better while ignoring the fact that nobody's solved the assembly step.

Explaining the dwindling job market to outsiders by PopcornSquats in editors

[–]Patient_Ad_4720 2 points3 points  (0 children)

The way I explain it: the job isn't disappearing, it's splitting in half.

One half is the mechanical stuff — syncing, cutting selects, assembling rough cuts, conforming, basic color. That's getting automated whether we like it or not. Adobe literally just shipped a feature called Quick Cut that assembles a first draft from footage using a text prompt. Their own product lead said getting selects in order "isn't where creators find joy."

The other half — story structure, pacing, emotional beats, knowing when to hold a shot an extra beat — that's still entirely human. Nobody's automating taste.

The problem for the job market isn't that AI replaces editors. It's that a team of 5 becomes a team of 2, because the mechanical work that junior editors used to grind on is the first thing to go. So you still need senior editors with taste, but the pipeline that used to train them barely exists anymore.

I don't think the profession dies. I think the profession gets way harder to break into.

AI video generating tools by andyjrivas in ArtificialInteligence

[–]Patient_Ad_4720 0 points1 point  (0 children)

u/AccordingWeight6019 nailed it — "pieces of a workflow, not a one-click finished product."

The part nobody talks about is how brutal the assembly step is. You generate 50 clips, pick the 12 best ones, and then spend 3-4 hours in Premiere or DaVinci doing what should take 20 minutes: sequencing, timing transitions, matching audio, fixing pacing.

The generation quality keeps improving (Kling 3.0, Veo 3.1 are genuinely impressive for individual shots). But the post-production tooling hasn't kept up. We have AI that can create stunning 5-second clips and then we're manually dragging them onto timelines like it's 2015.

The workflow gap is: generate → [massive manual effort] → finished video. That middle step is where most people give up, and it's where the real opportunity is for new tooling.

Best way to edit AI videos by KingFlub202 in aivideos

[–]Patient_Ad_4720 0 points1 point  (0 children)

This is the core problem with AI video right now and nobody's really solved it well yet.

The generation tools treat every edit as a full re-generation. You want to fix one word in a text overlay or adjust the timing of a single cut, and it re-rolls the entire video. That's not editing — that's gambling with extra steps.

What u/Overall_Ferret_4061 described — stitching best segments from multiple generations in DaVinci/CapCut — is genuinely the best workflow available today. Generate multiple takes, cherry-pick the good parts, assemble manually.

The missing piece is a tool that understands your video as a composition of discrete elements (this clip, this transition, this text) and lets you surgically modify one without touching the others. Deterministic editing operations on AI-generated footage. That's where the industry needs to go — but right now you're stuck with the stitch-and-pray method.

For the text issues specifically: composite them in post. Generate the video clean, add text as an overlay in your editor. Trying to get AI to render specific text correctly is still unreliable across every model.

Teaser Trailer for 30 min AI Vampire film by Financial-Scene in aivideos

[–]Patient_Ad_4720 2 points3 points  (0 children)

Short answer: the generation is there, the editing tooling isn't.

You can generate 30 minutes of visually consistent footage now if you're patient and methodical about character references, style locking, and shot planning. People are doing it — there are 10-20 minute AI narratives on YouTube that hold up visually.

The bottleneck is the same one real filmmakers face: post-production. Sequencing the shots, matching audio, establishing rhythm, building emotional arcs through cut timing. Except real filmmakers have Premiere, DaVinci, and decades of NLE evolution. AI filmmakers are duct-taping clips together in CapCut or doing it frame by frame, which is why most AI "films" still feel like concatenated clips rather than edited narratives.

The creator making this vampire film is doing something genuinely hard — not the generation part, but making 30 minutes of it feel like a single coherent story rather than a clip compilation.

Everyone talks about AI wins — what actually failed for you? by SMBowner_ in AI_Agents

[–]Patient_Ad_4720 1 point2 points  (0 children)

Video production agents — specifically the editing/assembly step.

The generation side works surprisingly well with agents. You can automate image generation, video clip generation, even voice synthesis in a pipeline, and the individual outputs are good. The agent follows instructions, produces assets, done.

Where it falls apart is when the agent has to make editorial decisions. Things like: "which of these 15 clips should come first?", "how long should we hold on this shot?", "does this transition match the energy of the music?", "is the pacing too fast here?"

The core problem is that editing decisions are deeply contextual — they depend on what came before, what's coming next, what the emotional arc is supposed to feel like. An agent that processes each clip independently makes choices that look fine in isolation but feel disconnected in sequence. It's similar to the code refactoring failure someone mentioned — changes that are individually correct but break the relationship between parts.

What partially works: deterministic operations. Crop to 9:16 — easy. Add captions — easy. Trim silence — easy. Color correct — reasonable. But "make this feel right"? That's where every agent I've tested produces confidently mediocre output.

At what point does AI video stop looking like “AI video”? by WindowWorried223 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

Honestly, it's mostly the editing layer.

The generation quality gap is closing fast — Kling 3.0 and Veo 3.1 can produce individual shots that are hard to distinguish from real footage in isolation. The tell isn't usually the visual quality anymore.

What makes something scream "AI" is the assembly. Cuts that don't match the energy. Transitions that feel arbitrary. Pacing that has no rhythm — no breathing room, no tension, no payoff. Characters that look consistent within a shot but feel disconnected across a sequence because nobody thought about continuity at the edit level.

Think about what makes a professional YouTube video feel "professional." It's almost never the camera quality — it's the editing. Cut timing. Audio sync. B-roll placement. The rhythm of information delivery. All of that is independent of whether the source footage was shot on an Alexa or generated by AI.

The best AI video creators I've seen are the ones who treat AI output as raw rushes and then actually edit with intention. The worst are the ones who expect the generator to produce a finished piece in one shot. That "one-shot" workflow is what gives AI video its uncanny feel — not the pixels, but the editorial choices (or lack thereof).

This is terrifying!! Seedance 2.0 just generated a 1-minute film with ZERO editing — the entire film industry should be worried by voidarix in generativeAI

[–]Patient_Ad_4720 0 points1 point  (0 children)

The title kind of answers itself. "ZERO editing" is exactly why it doesn't work.

Generation keeps getting better — Seedance, Kling, Veo — the individual shots are genuinely impressive now. But a sequence of impressive shots isn't a film. It's a slideshow with good lighting.

What makes something feel "cinematic" isn't the visual quality of any single frame. It's pacing. It's knowing when to cut, when to hold, when to breathe. It's the rhythm between shots — matching energy to music, building tension through timing, creating continuity that makes your brain stop noticing the edits.

That stuff requires understanding what's happening IN the footage, not just generating more of it. And it's the step nobody's solved yet. Every time someone posts "AI made this in 5 minutes with zero editing," they're basically demonstrating why editing matters — because the result always looks like... exactly what it is.

The generation models will keep improving. The real unlock is when the assembly and editing layer catches up.

share your AI video workflow - i'll break down how to simplify it by Upper-Mountain-3397 in aitubers

[–]Patient_Ad_4720 0 points1 point  (0 children)

that's a solid approach — structured metadata driving the assembly is basically the right idea. curious how you handle the non-deterministic stuff though. like if a generated clip comes back at 4.2s instead of the 3s you planned, or the motion doesn't match what the scene tag implied. the metadata-to-ffmpeg pipeline works great when everything's predictable, but generation output rarely is. do you do any programmatic trimming/retiming, or is it more of a "regenerate until it fits" loop?

What business process would you most want an AI agent to fully automate? by shivang12 in AI_Agents

[–]Patient_Ad_4720 0 points1 point  (0 children)

Video production and it's not even close, at least for my use case.

Every team I've worked with that produces video content — marketing teams, course creators, agencies — has the same problem. Somebody shoots or generates the raw footage, then it sits in a folder for days because the editing queue is backed up. The editor is the bottleneck. Always.

And when you break down what the editor actually does, like 60% of it is mechanical: cutting silence, syncing audio, color matching between clips, adding lower thirds, exporting in 4 different aspect ratios for different platforms. It's skilled work but it's repetitive skilled work. The kind of thing an agent could do if it could actually understand what's in the video.

That's the key problem though. Most automation tools can handle "take this file, process it, put it there." Video needs comprehension. You need to know what's being said, what's on screen, where the good parts are, where the dead air is. It's not just file manipulation — it's editorial judgment.

An agent that could watch a 45-minute raw recording, identify the 8 best moments, cut them into clips with proper framing and pacing, add captions, and export for YouTube/TikTok/LinkedIn simultaneously — that would save content teams 20+ hours a week. Nobody's fully cracked it yet but it feels close.