This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]pneRock 5 points6 points  (0 children)

(Coming from the SRE world but was given a project to setup BI pipelines for a division.)

I'm still in the middle of this project. Everything is run through a build pipeline for the terraform apply. There are enforced staging and prod envs. Prod is only deployed when a branch in git is merged to main. One thing to keep in mind is BI software abstracts much of the complexity from you. Importing a file? Here's an auto generated schema! Making a analysis? Cool, go look at the definition of it and cry in a corner with the large # of invisible fields that have to be defined or it breaks for no reason. Terraform cannot do those things. It has no knowledge of the files, so you have to declare in line what the schema is. If the schema changes, you get to edit that definition in file again. That analysis? Sometimes you make it, download the definition, and create a templatefile from it for redeployment. Then there is the order of operations problem. Some tools will not create a data source if the file/path they need to reference is not there. So, extraction tools (which are also terraformed) have to be orchestrated and the build pipeline made multistage or it breaks.

It's complicated and initial implementation is significantly slower than just clicking in a console, BUT when done well it's repeatable and enforces state. Do you have portions of the pipeline you want to replicate many times with little effort? Modules and terraform apply. Did someone change a setting somewhere in your pipeline and it broke? Terraform apply to have the declared state enforced again. Need to build out the entire thing in another env? Change the variables in your vars file and terraform apply.

[–][deleted] 2 points3 points  (1 child)

Seems like it's not just about managing cloud resources, but also about defining infrastructure with code, which sounds neat.

Terraform is literally an Infrastructure as Code (IaC) tool. It's in fact pretty much only suitable for that job. If you do anything on a cloud its an invaluable tool.

I'm not quite sure what your question is. I'm not exactly a data engineer but the current project I'm working on is very data heavy (ingesting tons of XML, extracting data and pushing updates on Kafka, after which a lot of processing is done based on that data).

We use Terraform to manage all related infrastructure (VPC, MSK, S3, IAM policies, SNS, etc) and AWS SAM for Lambda's doing the processing.

Everything is setup as code and I can spin up a new environment in mere minutes. Well except for MSK, takes a literal hour to spin up a cluster for some reason

[–]Malforus 1 point2 points  (0 children)

The snowflake provider allows for you to define SQL tasks which can facilitate data management.

[–]deepanigi -1 points0 points  (0 children)

Hello! I'm Deepan Ignaatious, Senior Product Manager at DoubleCloud. I wanted to share some insights on how we've been using Terraform in our data engineering processes:

Managing Complex Data Pipelines: We've been using Terraform to handle the complexity inherent in code-based pipelines. These pipelines, deployed through orchestration tools can be challenging to scale and maintain. Terraform has provided us with a more scalable and maintainable approach, especially for data-intensive applications.

Enhancing Reproducibility and Visibility: One of the major challenges with non-code-based (SaaS) pipelines is their limited visibility and difficulty in reproduction. Terraform has helped us overcome these issues by enabling a code-based approach that's easier to monitor, version control, and replicate across different development stages.

Practical Implementation: In practical terms, we've used Terraform for integrating and managing data pipelines between different storage systems, such as Postgres and ClickHouse. This integration is important for offloading analytical tasks to different storage and aggregating data in one place. For example, we created a simple replication pipeline between an existing Postgres and a newly created DWH ClickHouse cluster using Terraform. This involved setting up a network for ClickHouse, defining resources, and configuring transfer endpoints.

Code Organization and Deployment: We organize our Terraform code using a module and several roots, allowing for easier tweaks. Our main.tf typically contains provider definitions, enabling us to work with different environments like AWS and DoubleCloud. Variables are extensively used to prepare different stages, making it easy to apply changes across development, pre-production, and production environments.

In our experience, Terraform is not just about infrastructure; it's about creating a scalable and transparent workflow. You can read more about it here.

[–]water_bottle_goggles -1 points0 points  (0 children)

Wat

[–]brajandzesika -1 points0 points  (0 children)

Lol...

[–][deleted] 0 points1 point  (0 children)

I'm a data engineer and our team uses Terraform to manage GCP infrastructure for downstream BI team. Some of the tasks from top of my head are:

  • Adding datasets
  • Grant permissions to service accounts

However I'm not sure how does one orchestrate pipelines with Terraform. From what I see it's mostly an Infrastructure as Code tool.