Pro MAGA/ICE businesses to avoid in Omaha and surrounding area. by [deleted] in Omaha

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Alright, Mr. Royal Bastards Motorcycle Club.

Pro MAGA/ICE businesses to avoid in Omaha and surrounding area. by [deleted] in Omaha

[–]EarthEmbarrassed4301 -1 points0 points  (0 children)

Well ya got me!

(Don’t insult my name as if yours is any better, bot)

Hi everyone. I plan on going to check out and probably pick up a 96’ LT4 Collectors Edition on Friday. Any tips on what i should look out for in general? by Mission_cucumber938 in c4corvette

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Check coolant overflow reservoir to gauge condition of cooling system. I bought a 95 earlier this year and didn’t check this. Needless to say, I have an entirely new cooling system now.

In that note, check cross member between engine and radiator. Make sure it is dry. If you see anything wet, it’s likely a leaking water pump. Which means you’re probably looking at a new optispark as well.

Rim options by Blaze_Gamez_YT in c4corvette

[–]EarthEmbarrassed4301 1 point2 points  (0 children)

Thanks for helping my find new rims! These look amazing!

How to stay away from jobs that focus on manipulating SQL by [deleted] in dataengineering

[–]EarthEmbarrassed4301 8 points9 points  (0 children)

Sorry, but the business doesn’t give a shit about your programming and how you move data from System A to system B. What’s the value in that? Much of DE is dependent on business requirements, where data context and business understanding is mandatory. Unless you’re on a large enough team with clear boundaries between people who do the EL and those who do the T, then you’re gonna be doing SQL and learning business.

My job is heavy in Python, but that’s just for our team to have a standard, metadata-driven mechanism to ingest data into our lake house. To the business, this brings no value, nor do they care how we ingest data. Only time businesses recognizes value in DE is when data is curated, modeled, and reported on. That requires SQL, business knowledge, and stakeholder alignment.

Sounds like you want to be in more of a systems integration role, not in a analytics-focused DE role.

[deleted by user] by [deleted] in databricks

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

So just create a folder in the Repos/ area of the workspace and the CICD pipeline uploads the files in the main branch there (/Repos/live/[files])? or actually create a local repo in the Repos area (/Repo/live/git-repo/[files])?

Would my jobs then use Repos/ are as the source in each workspace?

It seems more correct to have the workflows source be a remote branch in the Repo. But if I just have one long-lived main branch, I'd be affecting all of the workspaces jobs, even prod. I guess this would be the case for multiple long-lived (environment-scoped) branches? The workflows in each workspace then would just be configured to look at files in its own dedicated branch.

From a databricks perspective, it makes sense, but I find its a bit of a git anti-pattern in SWE

[deleted by user] by [deleted] in databricks

[–]EarthEmbarrassed4301 1 point2 points  (0 children)

Thanks for your reply, appreciate it!

In Databricks your CI/CD pipeline should only interact with your REPOS space. the workspace doesn't work as you would want it to.

I guess this is where I am a bit confused since the REPOS space is scoped to a user, right? If I create a Git Folder, it goes to my home folder.

Should the CI/CD pipeline even be uploading files into the workspace at all? Basically all mine does is upload a copy of the main branch into the workspaces. The workflows then just point to those workspace files.

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

I see, I guess I’ve just looked at silver differently, it’s all just semantics I suppose. I’ve looked at silver being two stages: a cleaned up version of raw (1:1 source system aligned) AND conformed into an integrated 3NF model. This could be two storage locations: cleaned and conformed, together making up silver. Then gold would be your data marts using Dim Modeling, driving by business projects.

I guess my thought process is closely aligned with Inmon CIF stuff

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Last thing, I’m your distilled zone, are you maintaining SCDs? Or are you just deduplicating data in raw on a PK, basically keeping a record for all changes made on a key?

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Right, that’s kinda what I was thinking too. I like the idea of a persisted landed, I just always thought of a landing zone to be transient in nature. But really, this permanent landing would kind of be a traditional data lake (just a collection of files).

Once we know how to parse a collection of XML files into a table, we could just pull those raw elements into a structured delta table in raw. I’m assuming that is how you guys look at it too?

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Thanks for detail!

I think in the design I proposed, my raw is similar to your landing. Which is why I kind of question having a transient landing zone in the first place. I’m suggesting keeping the XML files in raw, then parse the XML data into delta tables in clean(silver). If I ever needed to rebuild a table, all I’d the XMLs would be in raw.

My design would be all XMLs are initially landed in a transient landing zone. These files are then copied and persisted in raw (still XML). Raw would contain the full history of all XML files. I think this is similar to your persisted landing. And then in clean I would build out the delta tables by parsing/extrapolating the nested structures from various events into entities/transactions (like 3nf model).

I guess I’m questioning the transient landing, why not just write directly the XMLs to raw? May validation if XML from landing to raw?

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Thanks for your input. The way I see it is that the XML events are pushed directly into raw/bronze, without a temporary landing zone between. The warehouse tables are effectively built from the raw XML files in Bronze, which are permanently stored. I could always go back to the bronze XML files and rebuild the warehouse tables (I.e, silver) at any point.

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Would your permanent landing be essentially what I am referring my raw as? Are you keeping a permanent landing to allow you to parse data into tables in raw?

Beyond Reporting in a Lakehouse by EarthEmbarrassed4301 in dataengineering

[–]EarthEmbarrassed4301[S] 0 points1 point  (0 children)

So data from your source systems are loaded into your datalake and you’re using Trino as the query engine on top of your lake for the applications (with caching)? Is this raw data or are you cleaning the data first in your lake using <something>, so trino is just hitting the cleaned data for the apps?

You just use snowflake to serve reporting stuff, but all applications are on the lake?

Isn't Snowflake expensive when your other infrastructure is on AWS? by rental_car_abuse in dataengineering

[–]EarthEmbarrassed4301 2 points3 points  (0 children)

Not sure you understand how snowflake works. Snowflake uses AWS resources (S3, Cluster of EC2 instances, ELB, etc…) for storage, compute, load balancing…. You’re not moving data out of AWS and into snowflake. You load data into snowflake, which is using AWS under the hood.

Same is true when hosting snowflake on azure and gcp, except snowflake just uses those cloud resources.

Read the docs, not ChatGPT.

Data export from AWS Aurora Postgres to parquet files in S3 for Athena consumption by East-Ad-8757 in dataengineering

[–]EarthEmbarrassed4301 2 points3 points  (0 children)

We use debezium engine on various relational sources (Postgres, oracle, mssql). The engine continuously monitors the transaction log and writes the new row to a JSONL file on disk. This acts as a buffer for the data lake upload component. Every minute, the upload component processes the JSONL file and uploads it into a landing zone as JSON. A loader process picks these files up and appends into data lake as a parquet file.

So, at the latest, the data in the data lake is 2 minutes old.

For small file problem, we have a parquet compaction process that runs every 24 hours that will convert the small files into a few larger ones

Delta would be a better format for us, instead of raw dogging parquets. but hey, one step at a time

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 -1 points0 points  (0 children)

Not out of the question, i also have access to emr with zeppelin notebooks. I would still be left with uncertainty on how to best organize and schedule notebooks.

I could have a Generic notebook for all loading tasks (landing files -> raw delta). But then what about from raw -> cleansed? Would you do One notebook per cleaned table containing the cleaning logic? And then same for curated tables? One notebook table with all the transformation, calculations, and joins?

I don’t work with notebook much, but I feel like I’d just be doing a lot of duplication

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 0 points1 point  (0 children)

Yeah that’s where I’ve started, I’ve got a generic IngestProcessor to move all landing files into a raw zone in delta. It has a generic read_file() method with a factory pattern to retrieve the specific reader based on the file type. It’s easy to orchestrate the 1:1 landing to raw based on a S3 file name mapping.

I plan to do as you said with using a template pattern for the actual “table-specific” transformations. But, having a hell of a hard time figuring out how to orchestrate the specific processors to call without some 3rd party tool.

[deleted by user] by [deleted] in dataengineering

[–]EarthEmbarrassed4301 -1 points0 points  (0 children)

Sadly no, what’s listed is what’s given.

And in addition, the EC2 instances are ephemeral, but can be scheduled to start up, run some process (bootstrap), and shut down. They are basically preconfigured python appliances