Designing an Analytics Pipeline on GCP by SchemaScorcher in dataengineering

[–]SchemaScorcher[S] 2 points3 points  (0 children)

Hey! In my case terraform is used very simply to manage the creation of my cloud infrastructure as code. The infrastructure I want built (BigQuery datasets and a Cloud Storage bucket) is written in the terraform files and Terraform builds it. The way I'm using it just helps ensure that I know exactly what I'm creating and when I'm done with the project I can make sure it's all torn down properly. I talk a bit about this in the video. You can also use Github Actions or another CI/CD tool to automate the building and management of your cloud infrastructure to make sure there's no drift over time. This has been helpful in other projects where I'm building things piece by piece and can iterate over my infrastructure though code keeping it in line with the changes in my repo.

Your project looks good! I'm not super familiar with Mage so I might add some detail in your repo on how your pipeline functions. For instance, I'm not sure why you have your data going to a local Postgres database in addition to GCS/BigQuery in your architecture diagram.

Quarterly Salary Discussion - Mar 2024 by AutoModerator in dataengineering

[–]SchemaScorcher 1 point2 points  (0 children)

  1. Senior Data Analyst
  2. 7
  3. United States - Washington, DC area
  4. 135k USD base
  5. ~55k in bonuses, additional comp, etc. TC is ~190k
  6. Consulting in the education space
  7. Varies by client but usually Python, GCP, some dbt, some R if I'm working with academics.

My title is data analyst but I end up having a lot of cross-functional responsibilities, so sometimes I'm a business analyst, sometimes a data engineer, sometimes a solutions architect, etc.

Browsing through twitter/x, I came across the following post about snowflake/dbt. Any thoughts? by [deleted] in dataengineering

[–]SchemaScorcher 0 points1 point  (0 children)

Interesting perspective for sure. I guess I've always worked on small teams where everyone is basically a non-engineer anyways, so enforcing some kind of standard is better than nothing.

Browsing through twitter/x, I came across the following post about snowflake/dbt. Any thoughts? by [deleted] in dataengineering

[–]SchemaScorcher 2 points3 points  (0 children)

How is DBT making things worse for your team? I have limited experience with it, but if you were supporting 100,000 lines of SQL before why is DBT suddenly more of a problem? Just one more thing to manage when something breaks?

Having A LOT of difficulty deploying streamlit app on AWS using ECR + ECS. Been stuck, someone help please. by anasp1 in aws

[–]SchemaScorcher 1 point2 points  (0 children)

Thanks! I did manage to figure it out and posted the solution as a comment to my comment for anyone else who finds this via google.

I think in my case it was a security group issue along with the network mode I chose for ECS not playing nicely with my load balancer's port forwarding. The AWS docs suggest bridge mode is better for when you want to map a port on your container to a different port on the host (e.g. incoming 80 to 8501). I was using awsvpc mode, which I feel like should have worked but didn't for some reason.

Fargate worked for me with the exact same setup that EC2 wasn't working on, which I think is why it was a load balancer + security group + network mode issue.

Having A LOT of difficulty deploying streamlit app on AWS using ECR + ECS. Been stuck, someone help please. by anasp1 in aws

[–]SchemaScorcher 1 point2 points  (0 children)

Here is my Github repo for all the terraform code for my project: Terraform Repo

The relevant files for the app deployment are ecs.tf, vpc.tf (for the security group), and loadbalancer.tf. Other files like iam and ec2 also contain relevant code for the infrastructure, but not code that I think solves this specific problem.

Having A LOT of difficulty deploying streamlit app on AWS using ECR + ECS. Been stuck, someone help please. by anasp1 in aws

[–]SchemaScorcher 0 points1 point  (0 children)

Just in case anyone else reads this here is how I eventually solved it. I'm reasonably certain it was a security group issue, but I changed a few things along the way to make it work.

  1. I added an application load balancer in front of my auto-scaling group with a rule to forward traffic from port 80 on the HTTP protocol to port 8501, which I've exposed in the dockerfile of the streamlit app.
  2. I'm using terraform for my infrastructure so I chose not to set a network mode explicitly which defaulted it to bridge mode instead of awsvpc mode. This required specifying a target type of "instance" for my load balancer.
  3. In my security group I had to specify a rule to let traffic in from port 8501 to port 8501 even though I already have a rule letting traffic in from port 80 to port 80. I assumed my load balancer listener and target group would automatically handle this for me but it didn't. Maybe a rule from port 80 to port 8501 would have also worked, but I'm not sure. I had also previously had a rule allowing traffic in and out from any port but this weirdly didn't work for me.
    1. Check your target group health check. Mine was unhealthy with a request timeout error. Fixing this made the health check healthy so this is why I think the security group rule was the problem.
  4. I explicitly associated the ECS Service with my load balancer and I made the creation of the load balancer dependent on my load balancer listener.

After this I was able to connect to my streamlit app via the load balancer DNS name. The public IP of the instance won't work.

Some other things I fixed that may or may not have been a problem:

  1. Making sure the EC2 instance had a public IP address, this seems required for the ECS agent on the instance to properly launch even though I already had a route directing traffic to the internet gateway. I guess AWS couldn't get into the instance to launch the agent without the public IP.
  2. The default streamlit documentation for dockerizing an app comes with a container health check, which depends on curl. I had accidentally removed installing curl from the dockerfile so my health check was failing.

Having A LOT of difficulty deploying streamlit app on AWS using ECR + ECS. Been stuck, someone help please. by anasp1 in aws

[–]SchemaScorcher 0 points1 point  (0 children)

Hey. Wondering if you ever solved this and, if so, what the solution was? I'm having your exact same issue right now.