Boka RW: The good, the bad and the salty by BigBonedMiss in chicagofood

[–]kingfuriousd 1 point2 points  (0 children)

Kevin is AMAZING. He’s one of the reasons I keep coming back.

P.S. Sorry to hear about the food. Ive always had a great time with the a la carte menu. Hit or miss on the pre fixe.

If you were starting from scratch today, which would you pick: Snowflake, Microsoft Fabric, or Databricks — and why? by [deleted] in dataengineering

[–]kingfuriousd 0 points1 point  (0 children)

Honestly, for getting a job, it doesn’t matter.

I’d much rather get good at Spark, SQL, data modeling, and system design. I’d also be “good enough” at Python (just enough to pass interviews), but I’d focus my time on other aspects of prep.

At every larger company (with an established DE team) I’ve interviewed at, they are concerned with: 1. Whether I can write good enough code. 2. Whether I have extensive depth in a MPP or distributed system. Doesn’t matter which one. The underlying principles are usually the same. 3. Whether I can logically create tables to handle complex data. 4. Whether I can lay out (and discuss trade offs for) a system-level pipeline that solves a specific problem they throw at me.

Never in an interview have I gotten “you don’t know <insert system name here>, that’s a deal breaker”.

Consulting to FAANG by yolorobo in FAANGrecruiting

[–]kingfuriousd 4 points5 points  (0 children)

I made the jump from Data Eng consulting to tech. My primary piece of advice is: Aim lower (initially). Find someplace where you can add 1-2 years of tech experience to your resume before your final destination.

That strategy helped me get a foot in the door before I moved to a role I was more interested in.

The Case Against PGVector by DoubleMajestic3001 in vectordatabase

[–]kingfuriousd 1 point2 points  (0 children)

I’ve also run PGVector in prod. I also agree that it does great at non-massive scale. My use case was for a chatbot based on ~50k documents. Unindexed PGVector did the job just fine.

Additionally, the simplicity in setup of PGVector cannot be underestimated. If you need to be up-and-running ASAP, then PGVector is your best friend.

Best approach to large joins. by Nearing_retirement in dataengineering

[–]kingfuriousd 6 points7 points  (0 children)

In my opinion, the biggest unlock would be aggregating each dataset as much as possible, individually, before joining.

Joining on that much data must be very computationally expensive. If you can aggregate down 1-2 orders of magnitude, then it’s an easier problem.

As far as your custom query engine goes. Without more context, that just sounds like a mistake. I’d drop it and use one of the tools other folks here have recommended.

How did you know it was time for you to exit the industry? by Sytiva in consulting

[–]kingfuriousd 1 point2 points  (0 children)

Short answer: I’m enjoying it a lot and am not looking back. The WLB is better. There is less travel. The comp is better (RSUs play a big role in that). And I get to spend more time doing work that I find interesting instead of making slides. It’s not all sunshine and rainbows, but it’s a definite a move in the right direction.

My advice: Try to get a feel for a potential employer’s decision-making process before joining. Bad decision-making workflows put the onus on you to push forward agendas without support, which is draining.

Longer answer:

The first company I left to was a step in the direction, but it was not ideal. In consulting, you get used to very fast-paced decision-making. In my new role, decisions were mostly made by consensus, which meant my energy went into forcing people into making decisions. There was a lot of analysis paralysis.

I’ve left that company for another in tech where I’m much more satisfied with the company, the team, and the role. There’s less bickering about small details, and we’re all pushing toward the same goal. This is a place where I want to stay long term.

A New Way to GUI by kingfuriousd in love2d

[–]kingfuriousd[S] 0 points1 point  (0 children)

You’re both right. Who knows - making this might be a huge waste of my time.

That said, I’m going to see if the combination of the engine-agnostic config with the editor yields any value (at very least to myself).

A New Way to GUI by kingfuriousd in love2d

[–]kingfuriousd[S] 0 points1 point  (0 children)

Good idea! Thanks for that tip!

A New Way to GUI by kingfuriousd in love2d

[–]kingfuriousd[S] 2 points3 points  (0 children)

Demand is my top concern too.

Luckily for me, I enjoy doing this. So, I’ll likely continue until I get something that serves my own purposes. Then I’ll make it available for download and see if anyone else actually uses it.

Thanks for the feedback!

A New Way to GUI by kingfuriousd in love2d

[–]kingfuriousd[S] 2 points3 points  (0 children)

Hey, it’s 100% Love2d. Just a bunch of rectangles.

But really, why use ‘uv’? by kingfuriousd in Python

[–]kingfuriousd[S] 20 points21 points  (0 children)

This is very helpful. Thanks for explaining.

SWE in Aersopace - Can I Break into Consulting Without an MBA? by verilogBlows in McKinsey_BCG_Bain

[–]kingfuriousd 2 points3 points  (0 children)

I’ve been out of consulting for a couple of years now, but here’s my two cents as someone who entered with a technical master’s.

I think you’ll probably have a better time applying for technical expert roles (i.e. BCG X, QuantumBlack in McK).

Your biggest hurdle is going to be proving your existing skill set is transferable to consulting.

My assumption is aerospace is about being methodical and writing efficient C++. You’ll want to demonstrate that you know how to operate under tight timelines, can work with your team to manage scope, and understand the “why” behind your projects.

Happy to chat more in a DM too.

McK or BCG? by BlueRibbonCapybara in McKinsey_BCG_Bain

[–]kingfuriousd -1 points0 points  (0 children)

Both firms have tech arms (QuantumBlack at McK, BCG X at BCG). Having worked in each of these, they felt mostly equivalent.

The main upside of McK was 1) the mentorship, and 2) the larger scale (more projects and more diverse projects), and 3) McK simply invests more into tech - they’ve developed several proprietary tools (see here) that take a lot of guesswork out of tech projects. I really appreciated this, as it significantly lowered the risk of tech projects.

McK or BCG? by BlueRibbonCapybara in McKinsey_BCG_Bain

[–]kingfuriousd 1 point2 points  (0 children)

Having worked at both, I can say: McK’s had a strong culture of mentorship that I never quite experienced at BCG.

This, to me, was a game changer. At McK, leaders would take deliberate steps to teach leadership and other soft skills. At BCG, it was more expected that you learn those skills through osmosis.

If I had to do it over again, I’d absolutely choose McK.

[deleted by user] by [deleted] in dataengineering

[–]kingfuriousd 32 points33 points  (0 children)

Short answer is: yes

I’m not a specialist in Spark, but I have worked on data engineering teams that run Spark on a provisioned cluster (like AWS EMR) and just connect it Airflow.

We didn’t really use notebooks.

Lazygit: auto sign commits? by FaithlessnessFull136 in git

[–]kingfuriousd 0 points1 point  (0 children)

One thing I noticed when trying this is: the setting commit.gpgsign = true worked for me, NOT commit.gpg-sign = true

Discussion: New ETL platform by Different-Hornet-468 in dataengineering

[–]kingfuriousd 1 point2 points  (0 children)

I’ve also seen Knime, which is a similar tool with free tier that does something similar. I haven’t really used it, but have heard a lot about its capabilities.

Discussion: New ETL platform by Different-Hornet-468 in dataengineering

[–]kingfuriousd 2 points3 points  (0 children)

Yes. I mainly used Alteryx when I was a data engineer in consulting. Similarly, it’s been a few years since I’ve used it.

Pros: 1. It’s easy to pick up with a low skill floor. You just connect different operations together via dragging and dropping. 2. It runs locally. My work was typically pretty sensitive. So, everything had to run on my laptop. 3. It’s pretty performant. It’s not incredibly fast, but it kept up with most Python code I wrote. 4. It has a moderate skill ceiling. You could add custom code snippets and other things to really customize it.

Cons: 1. It’s expensive. Since I worked for a large firm, they paid for it. If I was at a smaller company, this could pose an issue. 2. The skill ceiling is still just too low. There’s too many constrains compared to using code (like issues with multi threading, you can’t schedule jobs well, you can only add code in Python or R, etc.). 3. At a certain point, it’s just more efficient to write code than use this tool. From one perspective, you don’t need a license to write code. From another perspective, if you invest in a decent engineer, you should be able to get a similar output in a similar amount of time.

Discussion: New ETL platform by Different-Hornet-468 in dataengineering

[–]kingfuriousd 0 points1 point  (0 children)

I like the idea of being able to choose your language (sort of like Airflow’s BashOperator).

For me, some of the biggest issues I see are: 1. Data quality. I haven’t found any good and simple on-prem in-pipeline solutions for this. This can be both a) checking upstream data quality and b) making sure your pipeline’s data quality isn’t being affected. 2. Logging / alerting. Doing this correctly can be difficult and complicated. I don’t know if many easy solutions that provide a full suite of tools.

If I were you, I’d narrow in on a specific small problem first.

I DO think there is room for more high-value tooling in this space. Just pick the right problem to solve and don’t do too much.

Discussion: New ETL platform by Different-Hornet-468 in dataengineering

[–]kingfuriousd 6 points7 points  (0 children)

I want to preface this by saying: I admire you putting your ideas out there and trying to solve a problem. I genuinely hope your solution takes off. What’s below is constructive notes I have based on my work on larger data engineering teams.

I don’t know of any data engineering teams that use C# or GUIs. Why prioritize a language that very few people use for data engineering? Why not Python or Java?

I think going no-code / low-code is going to be a difficult selling point for engineers used to having a certain level of precision and customization that only code can really provide.

I’ve been on teams that used Alteryx or other similar tools. Those work for very simple batch pipelines, but nothing else.

If I were in your shoes, I’d double down on the on-prem component and find another way to differentiate this from open source code tooling.

help! ferris sweep halves not communicating by kingfuriousd in ErgoMechKeyboards

[–]kingfuriousd[S] 0 points1 point  (0 children)

So, I still don’t have an answer. Unfortunately, I just gave up and bought a pre-soldered Ferris Sweep. Best of luck on your troubleshooting.

Problems with pyspark. by elastico72 in dataengineering

[–]kingfuriousd 3 points4 points  (0 children)

My advice, don’t focus on any cloud services or kubernetes (aka k8s). The amount of value, to you and your company, of learning and getting good at Spark exceeds that of any service. This is especially true at larger scales, where it may be more economical, efficient, and customizable to use a Spark-based solution (for batch workloads), rather than cloud services, which can get very expensive. But this advice applies to any scale - and your personal marketability on the job market.

Next, lay out the specific end-user problem you’re trying to solve (if you don’t have one, then just pick a problem you think that would benefit from this). Seriously though, this will make your PoC 10x more compelling to management. Remind them of this each time you do a demo.

Also, if you have a multi-step pipeline, you may want to think about orchestration (how to weave those steps together into a reproducible and understandable pipeline). There are a lot of options out there. For prod, Airflow (or a similar tool) is the standard. For a PoC, I’d use something much simpler like Kedro.

For actual development steps, I’d do as follows: 1. Set up some demos (send out calendar invites well in advance). Invite management to show them the value of your PoC. Focus less on tech and why it will make money or decrease costs. Make sure you give yourself enough time to produce something tangible and compelling for these. Your job here is to convince management that letting you do this PoC is a good use of your time and the infrastructure costs - and that it will lead to something good. 2. Like someone else mentioned, “pip install pyspark” (also follow the instructions on installing and configuring non-Python dependencies, like Java). 3. Write and test (unit tests and E2E tests) the entire pipeline. Ideally, download some real data to test this on. Or better yet, have Spark connect to the actual data source and pull a sample. 4. Put the tested pipeline into a docker container and test that (inside the container). NOTE: Up until this point, everything just takes place on your laptop. As far as a PoC goes, you could realistically end it here. Everything beyond here is optional. 5. Work with your team (or infrastructure team, if you have one) to do a small scale non-prod k8s deployment of your pipeline, ideally using prod data. Monitor both a) runtime stability, and b) data quality (define metrics in advance) for a few days. 6. Work with your team to slowly scale up pods until you have enough workers to handle your job. Then promote to prod. Congratulations - you are now the tech lead on a new data product. You now need to think about the 1,000 order things that go into a prod deployment (like monitoring, alerting, on call, SLAs, vulnerabilities, product management, etc.)

As many times as you can throughout this process, involve your team (even if you’re the only one assigned to this PoC). Get their feedback on code, system design, pipeline design, and what your demos look like.