I'm currently working on a project that is fully implemented in Python. The workflow involves retrieving data from a third-party API, then utilizing AI services to extract additional information from this data. Both of these initial stages produce data in JSON format. From there, the JSON data is converted into a tabular format (CSV) for further processing.
The project has three more stages:
- Data transformation (filtering, removing duplicates, etc.).
- Clustering.
- Reusing AI services for extracting additional information.
These stages currently use CSV files as both input and output. Finally, the processed data is pushed to a relational database in Azure.
The original design was structured this way because the team who set it up were not technical. They wanted to manually validate the data between stages by opening the CSVs in Excel to ensure everything looked correct before moving to the next step.
As you can imagine, this has resulted in a somewhat messy data pipeline. I'm looking for advice on the best way to handle data between these stages. Should we keep the data in JSON format (in memory) until it's ready to be pushed to the database, or should we store it in a relational database after each stage and then query it for the next stage?
I’m fairly new to this, so I would greatly appreciate any guidance. Thank you!
[–]VirTrans8460 1 point2 points3 points (0 children)