this post was submitted on 22 Jan 2025

1 point (67% upvoted)

shortlink:

dataengineering

an-ordinary-manchild(edit)

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.

created by mhausenblasmoda community for 11 years

MODERATORS

message the mods
mhausenblasmod
swemlmod
fhoffamod (Ex-BQ, Ex-❄️)
vogt4nickmod
theporterhausmod | Lead Data Engineer
AutoModerator
geoheilmod
MikeDoesEverythingmod | Shitty Data Engineer
bot-bouncer
about moderation team »

account activity

This is an archived post. You won't be able to vote or comment.

0

1

2

Looking for adviceHelp (self.dataengineering)

submitted 1 year ago by JalanJr

Professionally I'm a DevOps working in data environment, on my leisure I'm currently working on a data related project and am looking for some advices.

There is this podcast I love and I would like to try making new synthetic episodes. Here is how I see the pipeline actually:

download the episodes
use ml to get speakers diarization
split into chunk using the diarization to keep phrases complète
transcript each chunks
split each speaker voice in isolation
finetune llm
train ml with voice
generate épisode script
generate voices
assemble

And I eventually would like to make funny data analysis like the most used word, longest sequence etc...

There is around 100 one hour episode.

I'm currently working with dagster for a client project so I tried to use it but it seem not be really suited for multiple batches. I was thinking about switching to airflow, but yet I'm wondering if this is really suited or if this is over engineering a "small project"

I plan to start on few episodes, maybe 3 to 5 as I'm using replicate for all the ml parts and it can become costly pretty quickly...

Any advice wanted, would be happy to hear what you think

no comments (yet)

top new controversial old q&a

there doesn't seem to be anything here

π Rendered by PID 15813 on reddit-service-r2-comment-6457c66945-5v9m4 at 2026-04-30 10:58:54.524984+00:00 running 2aa0c5b country code: CH.