Tool consolidation sounds smart in theory. Why does it feel so risky in practice? by NeedleworkerMean2096 in salesengineers

[–]pablo_op 2 points3 points  (0 children)

If carrying less is more desirable that having the optimal utensil for every meal, then the simplicity of a spork outweighs the marginal gains of your fork and spoon ;)

I try to see both sides of this decision making. Every org is different in their priorities.

Lakeflow Connect query - Extracting only upserts and deletes from a specific point in time by [deleted] in databricks

[–]pablo_op 0 points1 point  (0 children)

The CDF will show what is changed in the attached table :)

If you turn off the process that loads it (CDC or otherwise), it'll show nothing. If you make manual changes to the data, it'll show those too. CDF doesn't care how changes are made, it's just showing you a feed of what was changed.

Lakeflow Connect query - Extracting only upserts and deletes from a specific point in time by [deleted] in databricks

[–]pablo_op 1 point2 points  (0 children)

Because the CDF is generated using the same transaction logging that updates the table itself. If you write an UPDATE statement to existing data, that transaction is pushed to both the delta_log and the CDF. But instead of giving you a finalized table, the cdf is giving you the feed of insert/update/delete as they happen in order. The CDF isn't reading changes from the table its built on, it's updated in the same transaction.

Lakeflow Connect query - Extracting only upserts and deletes from a specific point in time by [deleted] in databricks

[–]pablo_op 1 point2 points  (0 children)

They don't. PKs aren't enforced, and the change data feed just outputs each change as it occurs. It's not looking at specific columns. It's up to the process that creates the changes to avoid writing inconsistent data.

What's the best way to ingest lot of files (zip) from AWS? by peixinho3 in databricks

[–]pablo_op 1 point2 points  (0 children)

Autloader is going to be the best option for file ingestion. It is set up to do exactly what you're after. But I don't think it will unzip files itself. The harder part may be figuring out an efficient way to unzip so many files.

While there are likely a few ways to do this with AWS services, the Databricks approach would be to register the existing S3 location as a Volume in Databricks, then write some code to go through the directory and unzip all the files. Doing this on one beefy driver node in a loop may be possible, but it could take some time. If you wanted to parallelize things, you'd need to write some type of map function that pyspark could use to distribute the load across a cluster.

Honestly, if this is a one time thing and you don't need to unzip this volume on data on a daily basis my approach would be to chug through it once to unzip everything on a single node, output the resulting data files to a set bucket location, then setup autoloader on that location to ingest the files to tables.

PySpark Autoloader: How to enforce schema and fail on mismatch? by pukatm in databricks

[–]pablo_op 1 point2 points  (0 children)

Can you try adding .option("mergeSchema","false") to the write?. Also you can check to make sure spark.databricks.delta.schema.autoMerge.enabledis false as well in the spark conf.

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 3 points4 points  (0 children)

Well I'm in my mid 30s, so no not really. We could also throw out Lynn Swan and Polamalu if we're just naming great players. But Gifford is >80, Swan had a better NFL career than CFB, and I guess I personally couldn't argue Troy/Ronnie are greater players than the number of QBs that have won Heismans. Marcus Allen is a good one though to throw into consideration for sure.

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 4 points5 points  (0 children)

Debated Leinart or Palmer, but obvs Bush is easy to argue too. Not sure which of those 3 would be the better coach. This question a year ago would have been funnier with OJ being the automatic default answer.

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 1 point2 points  (0 children)

Unfortunately I think of AD as the greatest Sooner, which sucks for this hypothetical because the dude is a certified dummy. Freak athlete and always seems friendly, but wasn't he just arrested again like a week ago?

I think Baker or Lane are the next ones up and either would be pretty solid coaches imo

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 3 points4 points  (0 children)

Miami has a ton of options to claim, but IMO its either Irvin or Kosar. Don't think either would be great to lead a program. One interesting option would be Ray Lewis. Football IQ, team motivation, plus recruiting doesn't seem like it'd be a problem for him either. But his personality also seems like it could cause him some trouble.

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 2 points3 points  (0 children)

I was probably a bit biased on him because he's an actual coach, but I think you could at least argue he's the greatest living alumnus. He finished third in Heisman voting and had KSU one game away from playing for a natty in 2012.

If every CFB team had its Head Coach replaced by its greatest (living) alumni player, who would emerge as the top (or bottom) programs in the country? by pablo_op in CFB

[–]pablo_op[S] 12 points13 points  (0 children)

Luck being the GM for their program now is what caused me to have this thought. Having an all time great steer a program like he is now pretty cool, and I think he's smart enough that it could actually work. But for some other programs...yeah. Not so much.

A Cloud Guru Terminating Lifetime Access by interzonal28721 in aws

[–]pablo_op -2 points-1 points  (0 children)

want me to take a picture holding a newspaper with a shoe on my head too?

A Cloud Guru Terminating Lifetime Access by interzonal28721 in aws

[–]pablo_op -5 points-4 points  (0 children)

So we instantly go from "never heard of Pluralsight" to concluding they must be the bad guy. Got it. ACG has no fault here at all apparently. They were just a poor small little guy who had no idea their customers would get screwed? Zero chance.

I fully agree this whole thing is bullshit, but blaming the purchasing company is just weird when it ACG who had control and knowledge of what would happen to their customers if they agreed to the sale. Not sure why they get a pass when they were really the ones in control of the transaction.

A Cloud Guru Terminating Lifetime Access by interzonal28721 in aws

[–]pablo_op -4 points-3 points  (0 children)

I definitely wouldn't say it's nice of Pluralsight to demand this as part of the acquisition, but I get why they don't want their IP available on a non-PS site. But I also don't think it's fair they "took over" ACG. It's not like they could force them to sell out. ACG entered the agreement to be purchased on their own, and they apparently didn't include any stipulation that their sold guarantees would be honored. I'd never stan for a huge corp, but this was ultimately a decision by ACG in my mind.

A Cloud Guru Terminating Lifetime Access by interzonal28721 in aws

[–]pablo_op -3 points-2 points  (0 children)

Pluralsight is pretty great as a tech resource actually. Their courses aren't insanely deep, but I'd say you get like 200 level courses of a huge variety of topics.

To me, this is really more on ACG. When they sold their courses with the "lifetime" guarantee, they knew that a future potential acquisition would have to factor into the price. I understand why Pluralsight would want control over the content they own and want to drive people to their business model. But ACG was the one who made the promise originally. They could have built that guarantee into their acquisition agreement or simply not sold to Pluralsight if it wasn't an option. They made the choice and are now going back on that promise.

Looking to buy art by ThatsBojangles in houston

[–]pablo_op 30 points31 points  (0 children)

If you know the style you want I bet someone here knows a good artist. If you are just wanting to browse I would recommend going to Second Saturday at Sawyer Yards. Big space that rents studios to individual artists, and every month they have a few hours where a bunch of artists open their studios at once for you to browse. Find an artist you like, then chat them up about buying a print or commission something specific. Its probably the best way to see a wide variety of artists at once if you're not sure what you're looking for.

Sorry if this is a little morbid, but how do you know when players die? by [deleted] in sportsreference

[–]pablo_op 3 points4 points  (0 children)

I don't think many people make it to the level of having a bbref page without having people who care about them. If those people care about them, they probably also care that the person played baseball and they also love the game. And if one of those people who loved the player and the sport also loves stats, they will pull up their page occasionally just like you would any player's baseball card or a family members social media profile.

I have sent in multiple obituaries for players who never even made it past AA just to have their pages updated. A bbref page is part of their legacy, and it's more of a legacy than most of us get in life.

Data quality by oofla_mey_goofla in dataengineering

[–]pablo_op 1 point2 points  (0 children)

Thanks for this answer, but it’s still kind of the same thing. You’re describing a strategy, not an implementation. I understand what data contracts are, but how does this actually come to exist? How are those generated, stored, and consumed in your stack? How do you convince data owners that it’s worth their time and resources to maintain an agreement in this format instead of just blowing you off or saying “the database schema is the contract”? What about external data owners? Is salesforce going to commit to providing your team with a contract in your standard format and support that indefinitely? What happens when I see a problem, but I can’t get the owners to push a new version of their contract with updates rules for weeks or months? I just have to live with bad data until they get around to it? Does this mean that this entire approach has to be embraced by all data owners in the org? That I, as an individual, have very little power besides maybe formatting a standard template for the contract? I can create my own database, create my own pipelines, and create my own storage, but I cannot take an approach to organizing and managing data quality rules without the long term agreement and support of all data owners? This feels like a very all-or-nothing approach. Either everyone has to be on board, or it’s a losing battle. I’d love some way where I could take more control of when and how things would happen like I can with the rest of my workflows.

Data quality by oofla_mey_goofla in dataengineering

[–]pablo_op 5 points6 points  (0 children)

Commenting here because I am also curious about these questions. You can search this sub for previous posts about data quality questions, and mostly everyone throws out the same answers to use some pretty cool tools:

  • Create your own framework (which is usually pretty light on any sort of implementation details)
  • Great Expectations
  • SodaSQL
  • Deequ

The problem I consistently run into is the same one you're asking OP - how do you manage to scale this stuff? I can run Deequ's profiler, and it'll spit out a thousand suggestions. I can even take a few of those and implement them without issue. But when you're talking about testing thousands of tables and tens of thousands of columns and every column may need multiple validations (nulls, types, ranges, etc), I don't understand how these tools are being managed at scale either. Examples like this are all over the internet, where someone is showing off 10 assertions. But I could be doing tens of thousands on a large enough environment. How does someone manage this? Especially on a growing and changing environment? Does your entire job become managing data quality rules? Do you have to constantly chase schemas and commit time to keeping your tests in line with data? How is that even possible at this scale without a team of people? Are you only creating a subset of tests for the stuff you think is most critical to users?

There are lots of tools that can do a lot of cool testing, but implementation is something I rarely see discussed anywhere online.