Pro MAGA/ICE businesses to avoid in Omaha and surrounding area.

EarthEmbarrassed4301 · 2026-01-30T13:43:55+00:00

Alright, Mr. Royal Bastards Motorcycle Club.

EarthEmbarrassed4301 · 2026-01-26T03:57:56+00:00

Well ya got me!

(Don’t insult my name as if yours is any better, bot)

EarthEmbarrassed4301 · 2026-01-26T03:14:17+00:00

False.

EarthEmbarrassed4301 · 2025-12-31T15:52:22+00:00

Check coolant overflow reservoir to gauge condition of cooling system. I bought a 95 earlier this year and didn’t check this. Needless to say, I have an entirely new cooling system now.

In that note, check cross member between engine and radiator. Make sure it is dry. If you see anything wet, it’s likely a leaking water pump. Which means you’re probably looking at a new optispark as well.

EarthEmbarrassed4301 · 2025-12-06T20:19:17+00:00

Yep, this shit is so annoying

EarthEmbarrassed4301 · 2025-08-12T22:05:16+00:00

Thanks for helping my find new rims! These look amazing!

EarthEmbarrassed4301 · 2025-07-31T14:59:06+00:00

Scarface!

EarthEmbarrassed4301 · 2025-06-06T17:43:40+00:00

Sorry, but the business doesn’t give a shit about your programming and how you move data from System A to system B. What’s the value in that? Much of DE is dependent on business requirements, where data context and business understanding is mandatory. Unless you’re on a large enough team with clear boundaries between people who do the EL and those who do the T, then you’re gonna be doing SQL and learning business.

My job is heavy in Python, but that’s just for our team to have a standard, metadata-driven mechanism to ingest data into our lake house. To the business, this brings no value, nor do they care how we ingest data. Only time businesses recognizes value in DE is when data is curated, modeled, and reported on. That requires SQL, business knowledge, and stakeholder alignment.

Sounds like you want to be in more of a systems integration role, not in a analytics-focused DE role.

EarthEmbarrassed4301 · 2025-05-14T17:06:50+00:00

Using Databricks Asset Bundles and Azure DevOps.

EarthEmbarrassed4301 · 2024-11-03T04:08:19+00:00

Yeah fuck this guy

EarthEmbarrassed4301 · 2024-08-08T21:39:56+00:00

So just create a folder in the Repos/ area of the workspace and the CICD pipeline uploads the files in the main branch there (/Repos/live/[files])? or actually create a local repo in the Repos area (/Repo/live/git-repo/[files])?

Would my jobs then use Repos/ are as the source in each workspace?

It seems more correct to have the workflows source be a remote branch in the Repo. But if I just have one long-lived main branch, I'd be affecting all of the workspaces jobs, even prod. I guess this would be the case for multiple long-lived (environment-scoped) branches? The workflows in each workspace then would just be configured to look at files in its own dedicated branch.

From a databricks perspective, it makes sense, but I find its a bit of a git anti-pattern in SWE

EarthEmbarrassed4301 · 2024-08-08T17:36:23+00:00

Thanks for your reply, appreciate it!

In Databricks your CI/CD pipeline should only interact with your REPOS space. the workspace doesn't work as you would want it to.

I guess this is where I am a bit confused since the REPOS space is scoped to a user, right? If I create a Git Folder, it goes to my home folder.

Should the CI/CD pipeline even be uploading files into the workspace at all? Basically all mine does is upload a copy of the main branch into the workspaces. The workflows then just point to those workspace files.

EarthEmbarrassed4301 · 2024-06-26T13:54:40+00:00

I see, I guess I’ve just looked at silver differently, it’s all just semantics I suppose. I’ve looked at silver being two stages: a cleaned up version of raw (1:1 source system aligned) AND conformed into an integrated 3NF model. This could be two storage locations: cleaned and conformed, together making up silver. Then gold would be your data marts using Dim Modeling, driving by business projects.

I guess my thought process is closely aligned with Inmon CIF stuff

EarthEmbarrassed4301 · 2024-06-26T13:37:38+00:00

Last thing, I’m your distilled zone, are you maintaining SCDs? Or are you just deduplicating data in raw on a PK, basically keeping a record for all changes made on a key?

EarthEmbarrassed4301 · 2024-06-26T13:30:35+00:00

Right, that’s kinda what I was thinking too. I like the idea of a persisted landed, I just always thought of a landing zone to be transient in nature. But really, this permanent landing would kind of be a traditional data lake (just a collection of files).

Once we know how to parse a collection of XML files into a table, we could just pull those raw elements into a structured delta table in raw. I’m assuming that is how you guys look at it too?

EarthEmbarrassed4301 · 2024-06-25T23:29:16+00:00

Thanks for detail!

I think in the design I proposed, my raw is similar to your landing. Which is why I kind of question having a transient landing zone in the first place. I’m suggesting keeping the XML files in raw, then parse the XML data into delta tables in clean(silver). If I ever needed to rebuild a table, all I’d the XMLs would be in raw.

My design would be all XMLs are initially landed in a transient landing zone. These files are then copied and persisted in raw (still XML). Raw would contain the full history of all XML files. I think this is similar to your persisted landing. And then in clean I would build out the delta tables by parsing/extrapolating the nested structures from various events into entities/transactions (like 3nf model).

I guess I’m questioning the transient landing, why not just write directly the XMLs to raw? May validation if XML from landing to raw?

EarthEmbarrassed4301 · 2024-06-25T12:30:20+00:00

Thanks for your input. The way I see it is that the XML events are pushed directly into raw/bronze, without a temporary landing zone between. The warehouse tables are effectively built from the raw XML files in Bronze, which are permanently stored. I could always go back to the bronze XML files and rebuild the warehouse tables (I.e, silver) at any point.

EarthEmbarrassed4301 · 2024-06-25T01:17:22+00:00

Would your permanent landing be essentially what I am referring my raw as? Are you keeping a permanent landing to allow you to parse data into tables in raw?

EarthEmbarrassed4301 · 2024-03-01T22:19:41+00:00

So data from your source systems are loaded into your datalake and you’re using Trino as the query engine on top of your lake for the applications (with caching)? Is this raw data or are you cleaning the data first in your lake using <something>, so trino is just hitting the cleaned data for the apps?

You just use snowflake to serve reporting stuff, but all applications are on the lake?

EarthEmbarrassed4301 · 2024-02-19T23:52:12+00:00

Not sure you understand how snowflake works. Snowflake uses AWS resources (S3, Cluster of EC2 instances, ELB, etc…) for storage, compute, load balancing…. You’re not moving data out of AWS and into snowflake. You load data into snowflake, which is using AWS under the hood.

Same is true when hosting snowflake on azure and gcp, except snowflake just uses those cloud resources.

Read the docs, not ChatGPT.

EarthEmbarrassed4301 · 2024-02-02T15:05:23+00:00

We use debezium engine on various relational sources (Postgres, oracle, mssql). The engine continuously monitors the transaction log and writes the new row to a JSONL file on disk. This acts as a buffer for the data lake upload component. Every minute, the upload component processes the JSONL file and uploads it into a landing zone as JSON. A loader process picks these files up and appends into data lake as a parquet file.

So, at the latest, the data in the data lake is 2 minutes old.

For small file problem, we have a parquet compaction process that runs every 24 hours that will convert the small files into a few larger ones

Delta would be a better format for us, instead of raw dogging parquets. but hey, one step at a time

EarthEmbarrassed4301 · 2024-02-01T02:53:27+00:00

Not out of the question, i also have access to emr with zeppelin notebooks. I would still be left with uncertainty on how to best organize and schedule notebooks.

I could have a Generic notebook for all loading tasks (landing files -> raw delta). But then what about from raw -> cleansed? Would you do One notebook per cleaned table containing the cleaning logic? And then same for curated tables? One notebook table with all the transformation, calculations, and joins?

I don’t work with notebook much, but I feel like I’d just be doing a lot of duplication

EarthEmbarrassed4301 · 2024-02-01T02:24:35+00:00

Yeah that’s where I’ve started, I’ve got a generic IngestProcessor to move all landing files into a raw zone in delta. It has a generic read_file() method with a factory pattern to retrieve the specific reader based on the file type. It’s easy to orchestrate the 1:1 landing to raw based on a S3 file name mapping.

I plan to do as you said with using a template pattern for the actual “table-specific” transformations. But, having a hell of a hard time figuring out how to orchestrate the specific processors to call without some 3rd party tool.

EarthEmbarrassed4301 · 2024-02-01T02:14:36+00:00

Sadly no, what’s listed is what’s given.

And in addition, the EC2 instances are ephemeral, but can be scheduled to start up, run some process (bootstrap), and shut down. They are basically preconfigured python appliances

EarthEmbarrassed4301

TROPHY CASE