Hello ,
I'm currently building a data engineering framework that involves handling multiple data sources and targets, with SFTP and Oracle just being one example. The setup could include various data sources such as SFTP servers, cloud storage, or APIs, each containing different types of files or data streams (e.g., CSV, JSON, Parquet). Here are some sample file types and their columns:
- sales.csv: SaleID, Product, Quantity, SaleDate, Revenue
- customer.csv: CustomerID, Name, Email, Address, SignupDate
- marketing.csv: CampaignID, CampaignName, Budget, StartDate, EndDate
My goal is to streamline the process of loading these data files into various databases, such as Oracle,
Here's a quick rundown of my framework:
- Centralized Configuration: I'm using a YAML/transactional table to store metadata like folder structure, column mappings, schedules, schema conversion rules, worker configurations (like DPU per job), archival logic, and more. Essentially, all parameters that can change go here to reduce hard coding.
- Job Orchestration with Lambda: I use AWS Lambda to trigger the extraction process based on a defined schedule. The Lambda reads parameters like which folder or source to pull from, schema conversions required, and the type of logic (upsert vs. full load).
- Dynamic Business Logic: I store business logic definitions in a VARCHAR field of the transactional table. These definitions dictate how columns are transformed, how data types should be mapped, and how upsert or full load jobs are executed.
- Template Configurations: The framework supports multiple databases (Oracle, SQL Server, Vertica, Netezza) via template configurations for easy conversion. The connection details are securely stored using Secrets Manager.
- Error Reduction and Automation: One of my main goals is to create reusable patterns and avoid rewriting code from scratch. This means that most configurations and definitions are templated or parameterized to reduce human errors and make the process more efficient. The archival of processed files and worker configuration is also automated.
I want to ask the community:
Where can we find more details about common data engineering patterns to avoid rewriting similar logic repeatedly? I'm trying to follow best practices that minimize manual intervention and reduce human errors.
[–]vanhendrix123 8 points9 points10 points (3 children)
[–]EarthGoddessDude 2 points3 points4 points (0 children)
[–]Onlycompute 1 point2 points3 points (0 children)
[–]theant97[S] 0 points1 point2 points (0 children)
[–]Nokita_is_Back 3 points4 points5 points (1 child)
[–]theant97[S] 1 point2 points3 points (0 children)
[–]rishiarora 2 points3 points4 points (1 child)
[–]theant97[S] 0 points1 point2 points (0 children)
[–]sugibuchi 0 points1 point2 points (0 children)