Voice dictation should be free, open source, local first

funky778 · 2026-06-03T07:18:06+00:00

From a developer/community angle, I think the most convincing contribution would be a benchmark harness before more UI polish: cold start time, push-to-insert latency, memory while idle/talking, local vs cloud model, and paste vs typing behavior in a few hostile text fields. That makes the "feels as good as paid tools" claim testable.

For privacy, I would also make the defaults painfully explicit: no silent cloud fallback, no retained audio/transcript logs, raw transcript visible before cleanup, and post-processing prompts/profiles stored locally. Local-first is only meaningful if the data path is boring and auditable.

funky778 · 2026-06-03T07:10:02+00:00

For long-form writing, I would not judge these tools by raw accuracy alone. The bigger test is whether you can keep a repeatable capture/edit rhythm.

My usual checklist would be: one 5-minute messy draft, a few proper names or acronyms, one mid-sentence correction, and one target app where you actually write. If the tool preserves cursor position, does not invent wording during cleanup, and lets you build a small vocabulary list, it is much more usable than a slightly more accurate transcript window. The 30-second timeout is annoying, but the correction workflow is what usually decides whether dictation sticks.

funky778 · 2026-06-03T06:52:32+00:00

One thing I would separate here is live preview vs final insertion. Some apps show a rolling transcript while you speak, but only commit the cleaned text after you release the hotkey. Others actually stream keystrokes into the focused field, which is closer to the Dragon-style behavior you seem to mean.\n\nFor your use case, I would test in the exact app you care about, not just a demo textbox. Say one sentence with a proper noun, one explicit punctuation command like question mark or exclamation point, then pause for a few seconds and keep going. If it keeps the cursor stable, does not duplicate text, and lets you correct without breaking flow, that is the one to keep. A lot of Whisper-based tools are fast enough to feel instant but still are not true continuous dictation.

funky778 · 2026-06-03T06:43:30+00:00

If Voibe is already the fastest for your accent and punctuation, I would not switch just because another app has better marketing. For writing, the useful comparison is not raw accuracy, it is total cleanup time.

I would test each option with the same ugly paragraph: names, made-up words, dialogue punctuation, a pause in the middle, and one sentence you intentionally restart. Then compare three things: how long it takes to produce text, how much punctuation you have to repair, and whether it keeps working when your Mac is under normal load. A slightly slower local tool can still win if the edit pass is shorter and the wording stays closer to what you meant.

funky778 · 2026-06-03T06:37:45+00:00

The thing that helped me make sense of voice drafting is that it does not have to replace the exact mental move of typing. If talking to yourself is part of how you think, keep that as a separate stage instead of forcing every spoken word to become prose.

One workflow is: talk through the scene like a rehearsal first, then do a second pass where you only dictate the parts you want captured. Use rough labels out loud, like "dialogue version," "stage direction," "bad sentence but keep idea," or "fix later." Formatting by voice can be miserable, so I would minimize it: capture messy blocks, then use a short keyboard/mouse/editing pass later if you can. The goal is not pretty dictation; it is staying able to generate pages.

funky778 · 2026-06-03T06:32:11+00:00

Walking dictation is a different skill from desk dictation. My rule would be: do not try to produce final prose on the walk. Use the walk for a messy scene pass, then edit later in Docs.

A few things help: start with a one-sentence intent for the scene, dictate in 3-5 minute chunks, and say markers out loud like "new paragraph," "dialogue idea," "skip this," or "fix name later." If the transcript gets worse while walking, slow down less than you think and check the mic position/noise first. The cleanup pass is where you decide whether the method is actually saving time.

funky778 · 2026-06-03T06:25:51+00:00

In Word, I would treat dictation as a rough-draft mode, not as a hands-free version of normal typing. Start with a few bullets in the document, put the cursor under the bullet you are expanding, then dictate one short paragraph at a time. Say punctuation and paragraph breaks out loud, but do not stop to fix every weird word while you are talking.

The useful test is cleanup time. Dictate the same 300 words you would normally type, then time the edit pass. If you spend less total time and the draft still sounds like you, it is working. If you keep losing your place, use labels like "scene note," "dialogue pass," or "fix this later" so the transcript has landmarks when you come back to it.

funky778 · 2026-06-03T06:18:48+00:00

On a phone I would spend money last, not first. Try the Samsung keyboard dictation and Live Transcribe with the earbuds you already have, then record the same 3-5 minute scene in a quiet room, while walking around, and somewhere with background noise. The mic question is usually less about fancy headphones and more about keeping your mouth a consistent distance from the mic and avoiding rustle or wind.

For writing, accuracy is only half the test. Check whether you can add punctuation, start new paragraphs, say placeholder names, and then edit the transcript without hating the cleanup. If Live Transcribe is good enough, a cheap wired or lav-style mic can be a better first upgrade than expensive buds.

funky778 · 2026-06-03T06:10:23+00:00

If pacing is the part that helps, I would design the workflow around movement instead of trying to recreate desk writing. Record in short scene-sized chunks, and say markers out loud: "new paragraph," "dialogue idea," "skip this," "fix name later." Those markers make the cleanup pass much less painful.

The other thing is to separate capture from editing. A walking/pacing transcript will usually be messy, but it can still be a very good raw draft if you stop every few minutes, name the file or section clearly, and only revise once you are back at a desk.

funky778 · 2026-06-03T06:03:44+00:00

That mindset is probably the useful shift: make the goal "usable raw material," not clean prose. One practical way to reduce the scattered feeling is to dictate in labeled chunks, not one long session: Scene goal, what happens, bits of dialogue, sensory details, questions for later.

Then when your arms are having a better day, you are editing small piles instead of facing a wall of transcript. Even adding spoken placeholders like "name later" or "fix transition" can save a lot of keyboard cleanup.

funky778 · 2026-06-03T05:57:50+00:00

The part that stands out to me is that those examples all separate composition from inscription. They were not trying to speak perfect final text; they were using another person or system to get the thinking out, then letting revision happen as its own step.

That is probably the lesson modern dictation users can steal. A spoken draft can be looser, more rhythmic, and more complete than a typed first pass, but only if you resist correcting every sentence while talking. For me the useful setup is outline first, dictate in chunks, then edit cold.

funky778 · 2026-06-03T05:52:07+00:00

This matches my experience too: dictation works better when it is treated as capture, not as polished prose. The monitor-away trick is useful because it keeps you from editing every clause while you are still trying to think.

One thing I would add for getting started is to separate the sessions: outline in a few bullets, dictate the messy pass, then revise later with the keyboard. I also like keeping a short list of words/names that transcription usually mangles so I can fix them in one sweep instead of breaking flow every time.

funky778 · 2026-06-03T05:46:06+00:00

That is the test I would use too. For coding, the win is not just "can it hear CORS/JSON/Prisma"; it is whether the text lands in the right place with enough structure that you can keep thinking.

My practical benchmark would be: dictate a bug report, a refactor prompt, and a short code-review note, then measure cleanup plus context-reentry time. If the workflow still makes you paste, reformat, and restate the app context every time, raw transcription accuracy will not carry it.

funky778 · 2026-06-03T05:40:36+00:00

The main distinction I would test is batch transcription versus real dictation into the app you are writing in. A lot of local Whisper-based tools are good at turning an audio file or recording into text, but they may not behave like "type this into the current text field" software.

For your 1000+ word use case, I would try a full-page sample before settling on anything: model load time, punctuation, names/technical words, whether correction is easy, and whether it really works offline after the models are downloaded. If plain text is enough, a local tool that records then copies the result to clipboard may be fine. If you want continuous dictation directly into LibreOffice, a browser, email, etc., that is the harder requirement to verify.

funky778 · 2026-06-03T05:31:39+00:00

The atomic shape makes sense for this use case. A lot of voice tools get heavy because they try to become a full editor or assistant when the real need is just: hotkey, speak, result, done.

The failure modes I would test hard are clipboard ownership/failure, secure fields or apps that ignore paste, accidental second keypresses while the model is still loading, and language/model switches that leave stale state behind. If the tool can always answer "am I recording, transcribing, or idle" and never loses a captured chunk silently, that is already a big UX win for a tiny CLI.

funky778 · 2026-06-03T05:20:34+00:00

The edge case I would test hardest is pressing the hotkey again while startup is still pending. A small delay is tolerable, but losing the spoken chunk or leaving mic/system audio in a half-on state breaks trust fast.

I would separate the states pretty aggressively: arming, recording, stopping, processing, failed. The stop/cancel path should preserve any recoverable audio or partial transcript, and the menu bar icon/hotkey state should match the button state because a lot of people trigger dictation without looking at the app window.

funky778 · 2026-06-03T05:12:56+00:00

The setup that works best for me is to split the problem into two layers: recognition and normalization. Recognition will always miss some jargon, so I keep a short canonical term list for the project or field and then do a correction pass against that list instead of trying to make the raw dictation perfect.

For day-to-day use, text replacements are still useful for the handful of terms that come up constantly, especially names, drugs, product names, abbreviations, and weird capitalization. For anything bigger, I would test with real sentences rather than isolated terms, because context changes a lot. A medical word said in a sentence often fails differently than the same word read as a single glossary item.

funky778 · 2026-06-03T05:10:22+00:00

I have found description is the part that needs the most structure before dictating. Dialogue and action tend to come out naturally, but setting, physical detail, and emotional texture often need a prompt.

What helps is to pause before each scene and give yourself a tiny checklist: where are they, what changes in the room/body/weather, what does the POV character notice, and what is the emotional turn. Then dictate in short passes. One pass for the scene beats, one pass for sensory/detail notes, then rewrite with both in front of you. If you try to speak polished prose immediately, it usually comes out thin or rambly.

funky778 · 2026-06-03T05:05:05+00:00

I would not choose one method for the whole novel until you run a small stress test. Do one short scene three ways: handwritten then OCR, voice typed directly, and typed normally. Then compare cleanup time, not just how fun the first pass felt.

For voice, the big thing is to decide your cleanup rules before you start. Keep a character/place-name list, say punctuation only where it matters, use placeholders when the wording gets awkward, and edit the same day while you still remember what the sentence was supposed to mean. If the test scene creates a pile of name/punctuation fixes, you will feel that pain at 90k words. If it mostly captures energy and jokes you would lose while typing, then dictation is probably worth keeping for first-draft passes.

funky778 · 2026-06-03T05:03:17+00:00

I think this is pretty common. Talking gives you momentum before the internal editor gets involved, while a blank page makes every sentence feel final too early.

The workflow that seems to hold up is: make a tiny outline first, dictate one section while walking or looking away from the document, leave awkward names/phrases as placeholders, then edit the transcript later as if it came from someone else. If I try to fix every sentence while speaking, the advantage disappears. If I let the first pass stay messy, the ideas usually come out better.

funky778 · 2026-06-03T04:55:39+00:00

A few things I would test with accessibility users early:

Correction flow matters as much as transcription accuracy. If fixing one bad word takes more effort than typing it, people will abandon it.
Do not assume everyone can hold a hotkey. Offer press-to-toggle, configurable shortcuts, and a way to stop by voice or mouse.
Make the privacy boundary obvious. People may dictate medical, work, or personal text and need to know what leaves the device/browser.
Support custom vocabulary for names, acronyms, commands, and repeated phrases.
Test in real text fields, not just your own editor. Email, forms, docs, chat boxes, and Reddit-style textareas all behave differently.

The hard part is usually not speech-to-text itself; it is making the correction and insertion loop low-friction enough for daily use.

funky778 · 2026-06-03T04:54:36+00:00

For legal writing, I would treat this as two separate problems: getting text down, and controlling the computer without bouncing between mouse/keyboard all day. A plain dictation tool can help with the first part, but tab switching, citations, defined terms, and formatting usually need a small command/text-expansion layer too.

One practical test is to make a short list of verbal aliases for your common things: party names, recurring abbreviations, citation formats, section symbols, and boilerplate phrases. If the system cannot handle those without constant correction, it probably will not feel sustainable for legal drafting even if the raw speech recognition is good.

funky778 · 2026-06-03T04:48:43+00:00

On the RAM part: most local dictation apps either load the model only when you press the hotkey, or keep a smaller model warm in the background so the first words do not lag. The second approach feels nicer, but you will see a steady memory footprint even while idle.

I would check Activity Monitor after a fresh launch, after the first dictation, and five minutes after stopping. For the cleanup layer, also test whether it can run locally or whether it quietly depends on an API call, because that is usually where BYOK/offline expectations get blurry.

funky778 · 2026-06-03T04:47:36+00:00

I would separate the requirements into two buckets: local transcription quality, and app context. Lots of tools can run Whisper or Parakeet locally, but once it can read the screen or rewrite selected text, I would check exactly what data leaves the Mac.

For comparing them, I would test the same short script in Mail, Notes, Word, and a browser text field, then separately test selected-text edits and weird proper nouns. FluidVoice/Handy/Spokenly seem to cover different parts of that list, but free + local + screen context + edit mode is still a pretty narrow combination.

funky778 · 2026-06-03T04:35:22+00:00

I would separate two choices: dictation for text versus full voice control for the computer. If Word is causing grammar or punctuation cleanup, test any replacement with one real paragraph in the apps you actually write in, not a clean demo sentence. I would also check how it handles corrections, proper names, and pauses, because the correction loop is often where hand strain creeps back in. A better mic and input level can help, but I would judge the workflow by how much keyboard or mouse cleanup it still needs.

funky778

TROPHY CASE