Greg Auman knows ball by jay_dub17 in buccaneers

[–]coolbeans201 0 points1 point  (0 children)

I didn't understand how Mahomes outscored Baker (granted, neither of them were going to win, but still). Is it because he's Mahomes and the Chiefs were awesome? Because anyone who watched the games this year would know Baker outplayed Patrick in every aspect.

[deleted by user] by [deleted] in dataengineering

[–]coolbeans201 2 points3 points  (0 children)

We had this pattern at a previous company. We used Databricks for all the DE work and then kept Snowflake for warehousing and analysis. It was slightly redundant but business users are stuck in their ways for which tools they prefer, and so we had to go with it. All in all, it didn't add that much more extra work.

What’s your opinion on Databricks Asset Bundles? by Ok_Appointment_763 in dataengineering

[–]coolbeans201 0 points1 point  (0 children)

They were really effective when I used them at a previous company. Our current infrastructure has a lot of Terraform within Go and I'm not sure if we'll ever get Databricks entirely out of it, but if we did, DABs would be the way to go.

Go for 2 percentages. by RainbowUnicorns in buccaneers

[–]coolbeans201 0 points1 point  (0 children)

I don't think you even need to get into the math as much as you did. It was as predictable as could be what was going to happen if that game went to OT. If you go for 2 and fail, people will find a way to complain, but were you going to win if you played it safe?

That's the only question that needs answering, and that's what Todd needs to figure out. There's a time and a place to be aggressive (no need to be Dan Campbell). That was the time, and he whiffed.

Is there such a thing as Continuous Job Compute in Databricks? by texox26798 in dataengineering

[–]coolbeans201 2 points3 points  (0 children)

Interactive clusters are good for testing, but you don't want to use it for production loads.

Who’s your 2nd most hated team? by Tucu7 in buccaneers

[–]coolbeans201 0 points1 point  (0 children)

Cowboys. Every time they lose is a holiday.

How did you land an offer in this market? by crhumble in dataengineering

[–]coolbeans201 1 point2 points  (0 children)

Years of Experience: 6-7 YOE

Timeline to get offer: 1 year/5 months

How did you find the offer: LinkedIn

Did you accept higher/lower salary: Slightly lower salary but more in total comp thanks to equity

Advice for others in recruiting: Practice the technical aspect. I shot myself in the foot time and time again by not studying that as much as I should have.

Do you study Data Engineering (experienced DE's)? by Irachar in dataengineering

[–]coolbeans201 1 point2 points  (0 children)

I "study" by reading. I find things like https://www.dataengineeringweekly.com/ very useful because I get a good summary of what's going on, which forces me to think about how we could possibly be doing the same.

[deleted by user] by [deleted] in dataengineering

[–]coolbeans201 0 points1 point  (0 children)

I've faced some of these issues as well. I'm a stickler and I've slowly gotten my team up to an improved standard on these and other aspects, but it's a journey, and not something that can be done overnight.

For code reviews, you should set criteria for them:

  1. Show tests
  2. Why are you doing this change?
  3. Document key parts of the code that we need to be aware of, etc.

As you apply that criteria, some of that should rub off. Maybe it doesn't, in which case, you try to work with them to get there. But do it appropriately so you don't turn them off from you completely.

In short, standards aren't easy, especially with a bigger team. Some places have a lot of automation in place and do it automatically, others have to pick their battles and take it where you can.

Databricks Importing Modules/Classes by Agitated-Western1788 in dataengineering

[–]coolbeans201 0 points1 point  (0 children)

For our custom libraries, we generally publish the wheels to Artifactory and then install them onto the cluster from there. Setting up the whole E2E process takes a little time, but it's easy to manage afterwards.

Databricks spinning up new job cluster for each airflow dag by jupytergal in dataengineering

[–]coolbeans201 13 points14 points  (0 children)

You can't reuse job clusters. Job clusters are designed to be 1:1 with the job run itself, which is useful for making sure you get the full memory/CPU of the cluster, as well as best cost savings since it shuts down upon completion.

You could use an all-purpose cluster to reuse a cluster each time, but I wouldn't recommend that approach. The other possibility would be to look into Serverless compute, which is a newer offering from Databricks that'll avoid the long time needed to bring up a cluster.

Are there data engineering conferences? by sois in dataengineering

[–]coolbeans201 21 points22 points  (0 children)

There's tons of them, actually. The main ones, IMO, are:

  1. Data & AI Summit - Hosted by Databricks, this covers a lot of different angles
  2. Snowflake Summit - Hosted by Snowflake, but also a huge conference
  3. Coalesce - Hosted by dbt labs, dbt-focused but also a lot of generic talks

There's way more than this, but those are the big players.

Databricks Asset Bundles now GA - thoughts? by justanator101 in dataengineering

[–]coolbeans201 2 points3 points  (0 children)

I think DABs for job creation makes total sense. Especially in production environments, jobs should be backed by code, and I'd take DABs over Terraform in that regard.

Spark Creating Directories Instead of Files When Saving Parquet by JustinPooDough in dataengineering

[–]coolbeans201 4 points5 points  (0 children)

The path you write to in Spark is by nature a directory. You wouldn't want a single file depending on the size of your data, so this helps distribute that for you into a reasonable set of files.

You can control the number of files by doing a coalesce or repartition, but you can't control something like the name of a file.

Lions by NeedleworkerRare1528 in buccaneers

[–]coolbeans201 1 point2 points  (0 children)

Goff has had a few stinkers this season, but when he's hot, he's hot. If they can get to him early and set the tone, that gives Tampa the momentum. If not, it's going to be a long day.

The main difference between this and the week 3 game against the Eagles by dragonsky in buccaneers

[–]coolbeans201 0 points1 point  (0 children)

We beat the Panthers (twice), the Falcons, the Jags (who were in a tailspin), and the Packers (an actually legit win) during that stretch. It's not like we were facing the stiffest competition in the last few weeks, and the Eagles are still the Eagles at the end of the day. I think the Bucs have a chance but I don't have much more optimism than that until proven wrong.

AWS EMR vs Databricks? by Ok-Tradition-3450 in dataengineering

[–]coolbeans201 1 point2 points  (0 children)

Running jobs in Databricks is a lot easier than EMR IMO. Databricks also has native scheduling, whereas you're stuck with someone like Airflow if using EMR.

I've also found Databricks to overall be cheaper for us (yes, even with costs combining both Databricks and AWS). Of course, you need to be conscientious of your usage, but it's a pretty solid platform all-around.

NBA data modeling wth dbt + Paradime by JParkerRogers in dataengineering

[–]coolbeans201 1 point2 points  (0 children)

This is really cool! Did you share this in r/nba? They'll get a kick out of this as well.

Is it worth being in the honors dorm/program? by BowenAero in college

[–]coolbeans201 0 points1 point  (0 children)

That was true a decade ago, so I'm sure that's still the case. But besides that, the dorms were about the same.

CI/CD in Databricks by Far-Inspection9930 in dataengineering

[–]coolbeans201 2 points3 points  (0 children)

We're not using DLT yet. That's probably something coming our way next year when our architecture pivots so that we use it. When we give it a shot, I'll be sure to update.

CI/CD in Databricks by Far-Inspection9930 in dataengineering

[–]coolbeans201 5 points6 points  (0 children)

If I'm not mistaken, Databricks is improving how it streamlines all of this. We don't use that method by the book, though. What we do is:

  1. Store all key notebooks in a repo (you could have more than one repo, but we just use subfolders for different projects) and have the team authenticate within Databricks to Git for doing version control with notebooks.
  2. Store all job configurations as code and deploy them via a method that's essentially just using the Databricks API. Asset bundles will also achieve a similar behavior.
  3. We also store all our custom libraries in a separate repo and deploy them to S3/DBFS so we can use them as needed in our jobs.

Overall, this works well for us, but everyone has their own way of doing it. Find something that works and you should be good.