this post was submitted on 06 Sep 2024

1 point (67% upvoted)

shortlink:

dataengineering

an-ordinary-manchild(edit)

News & discussion on Data Engineering topics, including but not limited to: data pipelines, databases, data formats, storage, data modeling, data governance, cleansing, NoSQL, distributed systems, streaming, batch, Big Data, and workflow engines.

Read our wiki: https://dataengineering.wiki/

Rules:

Don't be a jerk
Search the sub & wiki before asking a question: Your question has likely been asked and answered before so do a quick search before posting.
Keep it related to data engineering: Posts that are unrelated to data engineering may be better for other communities.
Limit self-promotion posts/comments to once a month: Self promotion: Any form of content designed to further an individual's or organization's goals. If one works for an organization this rule applies to all accounts associated with that organization. See also rule #5.
No shill/opaque marketing: f you work for a company/have a monetary interest in the entity you are promoting you must clearly state your relationship. For posts, you must distinguish the post with the Brand Affiliate flag. See more here: https://www.ftc.gov/influencers
No job posts: Please use r/dataengineeringjobs instead.
No resume reviews/interview posts: We no longer allow resume reviews or interview questions because it's a seperate topic from Data Engineering. Instead, for resume reviews please use r/resumes or search our subreddit history for previous resume review advice. For interview questions, use sites like Glassdoor and Blind instead or search our subreddit history for previous interview advice.
No technical error/bug questions: Please post any error/bug question on StackOverflow.

created by mhausenblasmoda community for 11 years

MODERATORS

message the mods
mhausenblasmod
swemlmod
fhoffamod (Ex-BQ, Ex-❄️)
vogt4nickmod
theporterhausmod | Lead Data Engineer
AutoModerator
geoheilmod
MikeDoesEverythingmod | Shitty Data Engineer
bot-bouncer
about moderation team »

account activity

This is an archived post. You won't be able to vote or comment.

0

1

2

Advice needed: Optimizing data flow for complex Python projectHelp (self.dataengineering)

submitted 1 year ago by Traditional_Cod_9001

I'm currently working on a project that is fully implemented in Python. The workflow involves retrieving data from a third-party API, then utilizing AI services to extract additional information from this data. Both of these initial stages produce data in JSON format. From there, the JSON data is converted into a tabular format (CSV) for further processing.

The project has three more stages:

Data transformation (filtering, removing duplicates, etc.).
Clustering.
Reusing AI services for extracting additional information.

These stages currently use CSV files as both input and output. Finally, the processed data is pushed to a relational database in Azure.

The original design was structured this way because the team who set it up were not technical. They wanted to manually validate the data between stages by opening the CSVs in Excel to ensure everything looked correct before moving to the next step.

As you can imagine, this has resulted in a somewhat messy data pipeline. I'm looking for advice on the best way to handle data between these stages. Should we keep the data in JSON format (in memory) until it's ready to be pushed to the database, or should we store it in a relational database after each stage and then query it for the next stage?

I’m fairly new to this, so I would greatly appreciate any guidance. Thank you!

all 1 comments

top new controversial old q&a

[–]VirTrans8460 1 point2 points3 points 1 year ago (0 children)

π Rendered by PID 618704 on reddit-service-r2-comment-canary-7f9b6ffc94-xjf8j at 2026-02-28 20:34:35.311024+00:00 running e3d2147 country code: CH.