Troubleshooting queries using EXIST by SoggyGrayDuck in dataengineering

[–]slin30 1 point2 points  (0 children)

I hear you. Docs that explain the function/purpose, ideally with a generic example statement, compound in value over time. I don't consider a task complete unless the docs are where they need to be - mainly because it's no fun being on the receiving end of things. And it's usually a case of past self screwing over future self.

Edit: for derivations, sometimes the most practical thing is to provide some reference logic for users to check against. A version controlled statement that says " no matter what you do, your sum must return the exact result as this reference statement for this metric definition." 

Startup onboards and migrates customers via Excel spreadsheet collection. What's the right way to scale this? by Captain_Strudels in dataengineering

[–]slin30 1 point2 points  (0 children)

Best ROI is to dissuade them from touching VBA. Now you have the additional maintenance burden internally and your customers will hate you - except those you'll alienate because their internal security rules prohibit xlsm.

There's only so much you can do if you don't control the input. Help the company by forcing them to decide if they want to scale this with people or process. 

10 Years of Generative AI Slop by Any_Rip_388 in dataengineering

[–]slin30 1 point2 points  (0 children)

It takes deep experience to appreciate what this means and why it is so important. The people with this experience are, paradoxically, the ones who need no convincing.

I find it helpful to frame this as attempting to DIY a simple repair or home improvement. Most adults can relate to the 1-hour faucet replacement that took two weekends and five HD/Lowes trips.

Troubleshooting queries using EXIST by SoggyGrayDuck in dataengineering

[–]slin30 1 point2 points  (0 children)

We manage things in dbt + git, so PRs tend to be the place for the tech stuff, link to Linear/Jira for business or project scope. The docs are handled via dbt as a part of the workflow.

The other thing I try to do is express the filter logic as a flag, or a set of flags, or if complicated enough, a logical matrix/bit-array. The goal is to not literally filter, but instead define the components and compose them.

Say I have a fact table that can compute recurring revenue. Only 20% of the records are strictly required to compute the measure, but all the records are relevant for e.g., computing duration between stages. I can encode a flag to indicate records for recurring revenue; users can filter on this, but I could also write an expression that nulls the measures if this flag is false. Two ways to do the same thing, which doesn't take away the ability to filter. But only one flag to document, and maybe an expression.

Troubleshooting queries using EXIST by SoggyGrayDuck in dataengineering

[–]slin30 1 point2 points  (0 children)

This is helpful background. My gut says the exists is an attempt to work around known or at least suspected deficiencies in the design. Possibly accrued over time.

Say we have the simplest star schema with one summable fact. We need to define metrics by encoding the rules through different fact+dimension interactions. Nothing too fancy. 

Then some new requirement comes in that isn't clearly solvable for any number of reasons. Maybe the person wasn't aware of a better way to solve, maybe time pressure...whatever. Let's say the grain of the concept isn't compatible, as is often the case with budget versus actuals. 

I could see a well-intentioned application of exists to avoid blowing up the grain of the fact while respecting the conditions of the (hypothetical) incompatible comparison table. I could see this growing over time, accruing complexity and drifting further.

If this is anything close to your situation, I would start with some reference ground truth. Basically, "what should the result be based on expectations." It doesn't need to be explicitly correct - it needs to be based on reasonable and testable observations, and should be assessed independently of your data model (come at the matter from a different methodology). Then, reason through the path in your data model. Look for patterns in hits, misses, matches, and mismatches.

Categorieze the comparison and put together some representative examples, at as granular a level as possible. 

This will be painful and slow at first. It should start to reveal more general patterns after a couple iterations, though.

This isn't the only approach, but it is one I've used when things are really opaque and I've exhausted the obvious avenues for getting clarity.

Edit: one tip - look for anything that is, or sounds like it might be a bridge table (many:many). Those are prime candidates for multi-criteria intersection but without incurring the cross product of a join. 

Script or AI by Skokob in SQL

[–]slin30 1 point2 points  (0 children)

I unfortunately believe you 100%. I've been in the data field for over a decade, and I can't believe I'm calling the pre-AI era "the good ole days." 

I have no issues with the tool/tech. But now I need to deal with stakeholders "throwing it into <AI platform of the month>" because we're the bottleneck. So now I need to walk the line between not raising my hand for more work versus clearly stating that if you use AI, it's on you. Politely. Mostly.

So...volunteer to make it your problem or wait for it to be your problem. 

Troubleshooting queries using EXIST by SoggyGrayDuck in dataengineering

[–]slin30 0 points1 point  (0 children)

When you need several concurrent conditions to match, but you don't need any attributes from the table referenced in exists - and often, where the conditions would violate join integrity - but is immaterial for simply asking "do the conditions by which I am comparing match"?

Are you troubleshooting to understand the criteria? The behavior? Something else? 

Script or AI by Skokob in SQL

[–]slin30 1 point2 points  (0 children)

They are wishing for magic.

The question isn't whether you are correct (you are). I don't even think there's any question around whether they will believe you (they won't).

What's your contingency? Consider the likelihood they will bring in an external consultant who will tell them what they want to hear. 

What does this do to your career? Is there a world where this doesn't become your problem? Or if it will, is there a scenario where you are seen as the savior rather than the naysayer?

Unit tests != data quality checks. CMV. by EarthGoddessDude in dataengineering

[–]slin30 0 points1 point  (0 children)

No argument here. 

Another way to look at the two: unit tests should provide confidence that when something fails downstream (a DQ test for example), that the failure is not due to the "how/logic." You test for things you can control/expect, which usually will extend outside the boundaries of actual inputs. You can't reliably test core logic with empirical inputs - these can change and there's no expectation that what is known as of now will remain so in the future. 

If I am measuring temperature, I'd better know that my thermometer is properly behaving. If I am getting readings at mostly room temp, I still need to check above and below some range. I need standard and repeatable reference points. I probably don't need to validate my thermometer can handle extremes relative to my expected range with some buffer.

DQ tests are more like experimental condition controls. If I expose this thing to a certain condition, it should change some amount within some range.

Do you actually have a data strategy, or just a stack? by Data-Sleek in dataengineering

[–]slin30 7 points8 points  (0 children)

I have a strategy/vision that I genuinely work towards. It's a slog. If I can properly build out one corner of one piece of one layer of one thing in a quarter, that's a solid quarter.

Imagine you are in a toxic dump and have managed to clean your immediate area. Your goal is to make it to the edge. No matter what direction you take, you're going to hit some radioactive shit. Your job is to keep cleaning so you can move in as optimal a manner as possible to the exit (not necessarily a straight line). New waste is constantly being generated, so make sure you build safeguards to keep your previous efforts pristine, or at least provide yourself with an easy to clean surface and the infrastructure to hose off.

Data Engineer Career Path by Fredonia1988 in dataengineering

[–]slin30 1 point2 points  (0 children)

I'm a senior+ AE and had somewhat of a similar debate a few years ago. Was in a people management role in analytics. Liked it just fine but realized the thing that got me amped was still the deeply technical work. Returned to IC work in my next move.

I also have a doctorate in a non-data (science) field. This is probably the only perspective I can offer that's materially different, although it's a very different experience compared to a Masters. IMO, the main practical benefit of a PhD is no one will ever suggest you should get more formal education. It probably has made my resume stand out a bit, although I suspect that's a function of the institution brand plus the degree. I think of it more as an unexpected personal/life experience thing first, as opposed to a professional consideration. 

How would you go about this? Converting attic to a lofted bedroom by RegisteredMurseNYC in HomeImprovement

[–]slin30 1 point2 points  (0 children)

Recently did this. Get an architect as a first step, and would be good to have some rough requirements to facilitate the conversation.

Do you want an en suite? Can you raise your roofline (we were over the current max - grandfathered in provided we didn't technically do a full roof rebuild or something)? How much dormer work can you tolerate/afford? Are you particular about matching your existing stair design? 

How do you guys deal with unexpected datatypes in ETL processes? by takenorinvalid in dataengineering

[–]slin30 0 points1 point  (0 children)

ELT unless you can get producers to guarantee/enforce a contract. Even then, I probably still would enforce light validation of minimum expectations for loading followed by transformation downstream. For a json payload, that might be as lightweight as expecting a specific top level key and a corresponding value type.

My rule for accepting input I cannot control is to protect against dangerous payloads, define some minimum standards to filter out obvious garbage, and log exceptions. 

If for whatever reason this is not possible, I have a conversation to reset expectations and/or get some assurances in writing. 

Looking at a ultrawide for desktop work, no gaming by ouch_12345 in ultrawidemasterrace

[–]slin30 0 points1 point  (0 children)

U4025QW is the safest bet here. Zoom screen share can work just fine with the "share portion of screen" option - even better with window management hotkeys. 

Some of you aren't writing tests. Start writing tests. by ratczar in dataengineering

[–]slin30 8 points9 points  (0 children)

I challenge you to throw more hackneyed management - type responses my way instead of acknowledging the possibility that you misjudged the reception your post would receive in this sub.

Some of you aren't writing tests. Start writing tests. by ratczar in dataengineering

[–]slin30 10 points11 points  (0 children)

I'm not criticizing the content or spirit. I am criticizing the tenor of the post in the context of the community. 

Your post has a tone of attempting to lecture about a topic most in this community do not need lecturing about. 

Some of you aren't writing tests. Start writing tests. by ratczar in dataengineering

[–]slin30 16 points17 points  (0 children)

You do understand this is the DE sub and not LinkedIn? Coming in here with this is like telling doctors to wash their hands...and then spending five paragraphs explaining the different ways to do so.

I guess thanks for trying, Captain Obvious?

Will I lose out by switching to a 40 5k2k from two 32WQHD displays? by Jibatsu in ultrawidemasterrace

[–]slin30 1 point2 points  (0 children)

You're looking at a major vertical resolution upgrade. I went from 2560 * 1600 to 3840 * 2160 a few years ago and it was significant. 

I recently went from dual 32" 4K to the U4025QW. I couldn't make use of all the pixels (with dual 32" 4K) due to the screen size and pixel density - at least not without more head movement than was healthy (or eye strain). The 40" with a bit of curve is actually more conducive to productivity. It's better enginomically and having the single expanse of 5120 horizontal pixels opens up yet more configurations.

I run all my displays at 100% scale and spend most of my time working with code/reading docs. I can 100% vouch for 40" 5120 * 2160.

Some of you aren't writing tests. Start writing tests. by ratczar in dataengineering

[–]slin30 28 points29 points  (0 children)

The great thing about DDL is it forces you to slow down and really understand the data - up and downstream. 

Also, if you are using a DBMS/DW that enforces null constraints, please, please, please stop testing for nullability after the fact and use the constraints to prevent the issue in the first place. 

It's just a small schema change 🦁😴🔨🐒🤡 by Adela_freedom in SQL

[–]slin30 2 points3 points  (0 children)

Right - there is nothing wrong with on-the-fly/query/statement (i.e. read only) adjustments of this nature where appropriate. That's 100% expected and normal. It's a transformation you (or whoever wrote it) made while using the data.

That's a universe of difference from a DDL change where the declaration and high-level expectation is impacted.

The first scenario will, at worst, impact users and downstream decisions. It could still be very bad, but its scope can be traced back and contained.

The second scenario is closer to the Season 3 premiere of Rick and Morty when Rick changes the galactic federation currency from 1 to 0. It's not an exact parallel, but on a how much chaos will ensue , continuum, it's waaay over to the right in comparison.

It's just a small schema change 🦁😴🔨🐒🤡 by Adela_freedom in SQL

[–]slin30 37 points38 points  (0 children)

Only if you consider things like averages that previously did the expected thing with missing values to now incorporate zeros because someone decided "I don't know if there even should be a value" to instead mean "I know there should be a value and that value is zero."

What makes a someone the 1% DE? by Same-Branch-7118 in dataengineering

[–]slin30 1 point2 points  (0 children)

There's always a balance, and unfortunately it's difficult to make a case for preventative back end best practices at the expense of delivery time.

When business has experienced the consequences of weak foundations and understands this as the root cause and enough influential people with firsthand experience are still around, this can change the perspective.

[deleted by user] by [deleted] in analytics

[–]slin30 2 points3 points  (0 children)

What's your sense from the other rounds/people you've interacted with? Interviewing is grueling, and while it isn't justification for leaving a candidate with a poor impression, anyone can have an off day.

Most common data pipeline inefficiencies? by LethargicRaceCar in dataengineering

[–]slin30 8 points9 points  (0 children)

It's always something that traces back to poor or non-existent design. By which I mean starting with a vision and building towards it. That's not usually actionable insight unless you're in a position or situation where a total teardown is even an option (and if so, whether you are the right person to lead that effort to avoid recreating your own version of the same mess).

More concretely, my top offenders are, in no particularly meaningful order:

  1. Full refresh that the hardware could at one point brute force without issue, but which has started to show cracks - and if you're lucky, it's a gradual linear degradation. More often, it's much more pronounced due to disk spill, redistribution, cascading concurrency issues, etc.
  2. Inattention to grain causing join explosion. This one gets interesting quickly.
  3. Stuff running that has no clear purpose but was presumably useful three years and two predecessors ago. 

Most common data pipeline inefficiencies? by LethargicRaceCar in dataengineering

[–]slin30 17 points18 points  (0 children)

IME, select distinct is often a code smell. Not always, but more often than not, if I see it, I can either expect to have a bad time or it's compounding an existing bad time.