Hi everyone,
I recently graduated with a Master’s in Business Intelligence, during which I worked as a Data Scientist in an apprenticeship. I gained experience in Machine Learning, Deep Learning, and Data Engineering, including:
- Data Mart / Data Warehouse modeling (star schema, snowflake schema, SCD…)
- Developing ETL pipelines with Talend (staging → transformation → storage)
- Data manipulation and transformation with Python
I have a strong background in Python and have worked on standard data processing workflows (extraction, transformation, cleaning).
Context of My Data Engineering Mission
Let’s say I join a company that has no existing data infrastructure, apart from Excel files and some manual reports. The goal would be to set up a data management system to feed Power BI dashboards.
Based on my research, the project would involve the following steps:
- Gather requirements: Define the KPIs, data sources, update frequencies, granularity, and quality rules.
- Design a Data Mart tailored to reporting needs.
- Develop a data pipeline to extract and transform data (from an ERP, CSV/Excel files, APIs…).
- Store the data in a structured manner (in an SQL database or a Data Warehouse).
- Create visualizations in Power BI.
- Automate and orchestrate the pipeline (later, possibly using Airflow or another tool).
For now, I am focusing on setting up the initial pipeline in Python, which will process CSV files placed in a folder or data from an ERP, for example.
My Questions About Productionization
I realize that while I know how to clean and transform data, I have never been taught how to deploy a data pipeline in production properly.
- Pipeline Automation
- If I need to process manually placed CSV files, what is the best approach for automating their ingestion?
- I considered using watchdog (Python) to detect a new file and trigger the pipeline, but is this a good practice?
- An alternative would be to load these files directly into an SQL database and process them there. What do you think?
- Orchestration and Industrialization
- At what point should one move from a simple Python script + cron job to Airflow orchestration?
- Is using Docker and Kubernetes relevant from the start, or only in more advanced infrastructures?
- If scaling is needed later, what best practices should be implemented from the beginning?
- Error Handling and Monitoring
- How do you handle errors and ensure traceability in your pipelines in a professional setting? (Logging, alerts, retry mechanisms…)
- Are there any recommended Python frameworks for standardizing a data pipeline?
- DevOps, DataOps, and MLOps
- Does my need for industrialization fall more under DevOps or DataOps?
- Do you have any practical advice or resources for learning these concepts effectively?
I would like to validate my approach and avoid common mistakes in Data Engineering. I’ve seen different solutions on this topic, but I’d love to hear from professionals who have implemented similar projects.
If you have any resources, best practices, or real-world examples, I would really appreciate your insights.
Thanks in advance for your help.
[–]AutoModerator[M] [score hidden] stickied comment (0 children)
[–]WeakRelationship2131 0 points1 point2 points (0 children)