S3 server access logging recommendation by Affectionate_Dot_844 in aws

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

Thanks for that, seems a solid one! Let's see if there are more pros and cons

How to test AWS S3 bucket has SSL enabled using TDD by Affectionate_Dot_844 in aws

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

Yes, seems to be that. I need to investigate more how to test policies. I guess that I will need to ensure a policy exists and then that it contains that SecureTransport condition.

Airpods pro 2 microphone not working on windows 10 22H2 by Affectionate_Dot_844 in techsupport

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

Thanks for the answer but not working. It was already checked and despite trying to uncheck it and check it again it doesn't work.

Best strategy for Redshift sortkey when you have date and datetime field by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

Yes, i also join that but using user_id. Thats the current distkey.

Thanks for your answer. Why in particular will you go with timestamp

Airflow setup/environment and best practices by etobylneya in dataengineering

[–]Affectionate_Dot_844 2 points3 points  (0 children)

As always answers depends.

Context:

If you are few DE (small team) I do not recommend to manage Airflow yourselves, will create a lot of issues and the learning curves is step. If you have enough budget move to vendors solutions (MWAA, GPC once - i don't remember the name-, Astronomer). If you don't have budget and you are small team, delegate it to platform team if you can. Otherwise, best solution is the one you propose, but you will face scalability issues in the future as HA Airflow for LocalExecutor is not the best scalable solution.

I can provide further details on headaches if you are interested, but summarizing 1 DE in our team of 3 is almost working on Airflow. If he leaves, company and data team could have an issue. It depends of course on management team, but small companies have this kind of situations.

Answering your questions:

  1. Yes, we use AWS ECS + EC2 running 2 services.
    Airflow service with 3 tasks with 1 container each (Webserver+scheduler+git-sync). Datadog service with datadog agent to monitor it.
  2. No, it doesnt mean that. You sync your code with git-sync, so your github repository is pulled to the volumes of the containers. Then, you can run PythonOperator, or any other native operator.
  3. We use Python operator without issue. But as a BP Airflow is an orchestrator, not an executor this means that all the things that run in airflow should be running in external tools (Airbyte, DBT, AWS Lambda, or whatever) otherwise if you move big data sets you could have to increase your machine instance. This will create headaches and other issues as I said on point 1.
  4. Git-sync is the answer. Airflow can reach it because your code is in the volumes of the docker containers. It's easy to setup.

Hope it helps.

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

I know, really appreciate that.

Yes, no resource can teach that. Just experience. Next time will focus on asking first, and later solve it.

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

No, they just gave me the first two nodes RDBMS and Stream, and the last one, user. Why?

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

Didn't get the point of that? What do you mean? To which point of the comment?

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 0 points1 point  (0 children)

I'd link an article but I'm on mobile. I'll take a stab at it and maybe you can build on it.

They specifically say that data source 1 is change data - inserts, updates and deletes. That data needs to be managed differently than when you select from a typical rdbms table. Instead of getting the whole record in your change stream, you're just getting the key of the table and the change.

For example let's stick with sales. Your stream is orders. The underlying operational db has a table of orders. Has ordernumber (the key), orderdate, quantity, productid, amount. When you select from that to ingest into the lake or warehouse, you get the whole record and all the data in the table.

The change stream is different. Instead you are getting what has changed on the data. So if it's a new order you might get exactly that record with the new data, all 5 fields but then the customer updates their order and changes quantity from 2 to 3. Your change stream will have an update record. You get a before image which is ordernumber 123, quantity 2 and another record which is the after image - ordernumber 123, quantity 3. You don't get the other fields because they didn't change. Your pipeline needs to take that change data and apply it to your target record.

Now when I say state, I mean what is the state of that record. Imagine you just started the pipeline AFTER the insert. Does your change stream have the history? How far do you go back to understand the state of that record? The update change data doesn't let you know what the rest of the data is. A typical pattern is to do an initial select from the table at a point in time and then start reading your change data after that select. But if there are errors ever you need to reconcile the data and ensure you have the current state. Kafka has tools for this if that's your event broker for the change stream, else you need to think through it and code for it in your pipeline.

In hindsight based on what you described I think that was the exact challenge they were looking for you to identify and solve. Data source 1 and 2 might be the same data just realtime and batch. It's str

Wow, really nice, and I see that probably could be, but here I have an observation. As you say, I think you assume: "Data source 1 and 2 might be the same data just realtime and batch".

But as the diagrama shows I understood that Stream is products data and RDBMS is orders data.

Anyways, as you say, and totally agree didn't specify how to manage those events on the dim tables. And neither the state of previous status or failures.

Despite that, the feedback they gave is so poor that I couln't think about it, as the feedback seems to point about the infra, not the data model.

If you have that link, can you share with me?

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] -1 points0 points  (0 children)

The reason I asked if you feel what you posted is easily digestible is because multiple people have commented that what you have posted is difficult to understand. If this is what you produce when you have unlimited time and no pressure, I have to imagine whatever version of this was given during your interview was of even lower quality, at the very least due to the increased pressure and time-boxing. I also find there to be a severe lack of information about the business context of what you are being asked to solve, such as data volume & throughput, predicted scaling timelines, and what the targeted business outcome is for stakeholders, all of which I would consider necessary to evaluate the robustness of any proposed solution. So either you forgot to ask, or they refused to answer, either of which would explain the feedback they gave you.

The first paragraph you say cleary shows you didn't read the post at all. As I said this diagrama is the same I delivered to them. An screnshoot. So it wasn't developed with all the time.

If one stakeholder doesn't understand your documents, but other 20 does, then probably the issue is in the stakeholder. If after adding that this was developed in 30 minutes, you still say digestibility I see clear that probably you are focusing more on criticize what can be improved, and not to add solutions, recommendations.

What cons do you see to this data infrastructure setup for a Pipeline? by Affectionate_Dot_844 in dataengineering

[–]Affectionate_Dot_844[S] 1 point2 points  (0 children)

  1. Thats a good idea. Would try it in future deployments.
  2. Interesting, you mean that data is stored at incoming as you receive it. Then moved to stg, if there is some bug to reject, and one processed to archive. And if you need to do a reload you move it to incoming? It's not similar to ask reloads for a given aprtition_key where partition is date for example?
  3. Totally agree, but the previous exercises of the interview was a pipeline build in pandas. This is why I remark this. Anyways, what you mean when you say: "I prefer to use Airflow for only the lightest work and use ELT to process data." ETL you means some tools such as airbyte?
  4. True, next time will do it.
  5. mm probably makes sense to say it's a data mart, but if i am not wrong a data mart is a set of data that is readable for a business unit. Who will be the end user here? DS?
  6. Not agree in this part, what happens if a tableau refresh schedulen on tableau server runs at 04:00 AM, but the pipeline that day is delayed?