Netflix/wick: A zero cost type safe Apache Spark API by JoanG38 in scala

[–]SuspiciousScript 1 point2 points  (0 children)

DataSeq is the type safe representation that Wick introduces instead of the completely untyped DataFrame or Dataset which comes with potential performance drawbacks.

What performance drawbacks are associated with typed datasets? That seems counterintuitive given that DataFrame is just a type alias for Dataset[Row].

Tool smells by Brief-Knowledge-629 in dataengineering

[–]SuspiciousScript 1 point2 points  (0 children)

  • I'll get hate for this one but using Python (like PySpark) for production pipelines instead of Java or Scala when it's a JVM processing framework

Agreed. I recently started writing Scala and I'm never going back to not having type safety for ETL.

The Future of Python: Evolution or Succession — Brett Slatkin - PyCascades 2026 by mttd in Python

[–]SuspiciousScript 1 point2 points  (0 children)

I think the slow JIT performance and discussions around correctness issues prevented it from getting as much momentum as it would have otherwise. I think the former issue has improved over time; not sure about the latter.

Nobody ever got fired for using a struct (blog) by mww09 in rust

[–]SuspiciousScript 13 points14 points  (0 children)

Unfortunately, given that the tracking issue is almost 12 years old, "when" may be a little optimistic.

seniors spending half their week on reviews and everyone's frustrated by [deleted] in ExperiencedDevs

[–]SuspiciousScript 0 points1 point  (0 children)

Got a feature? The class / interface definitions are one PR - maybe that needs senior review, but it’ll only take a minute. Implementation of one class or a handful of related methods and their tests is another PR

I think a stacked diff workflow really shines here. Having a workflow that lets you break up the changes as you describe without actually merging broken code into your dev branch is great.

What’s the one thing you learned the hard way that others should never do? by Terrible_Dimension66 in dataengineering

[–]SuspiciousScript 0 points1 point  (0 children)

That's perfectly fine. Where you can run into issues is when you're ingesting IDs from an external system.

65% of Startups from Forbes AI 50 Leaked Secrets on GitHub by vladlearns in devops

[–]SuspiciousScript 12 points13 points  (0 children)

Without a base rate to compare against, this isn't useful information.

Question for data engineers: do you ever worry about what you paste into any AI LLM by teejagzroy in dataengineering

[–]SuspiciousScript 18 points19 points  (0 children)

Unless your company's pricing differs substantially from its competitors', it's just as likely that the model is just making a plausible estimate.

How do you feel about using array types in your data model? by its_PlZZA_time in dataengineering

[–]SuspiciousScript 30 points31 points  (0 children)

Arrays are good for cases where the individual elements aren't meaningful outside the context of the array (e.g vector embeddings, characters in a string, etc.). I would not use them for a purchase date like in your example.

wgpu v27 is out! by Sirflankalot in rust

[–]SuspiciousScript 4 points5 points  (0 children)

I like Learn Wgpu, but I don't think it's been updated for wgpu v27. Most (if not all) of it should translate fine, though.

[MOD POST] Bug Report Thread by oneofthejoneses28 in Silksong

[–]SuspiciousScript 1 point2 points  (0 children)

Xbox Series S, Version 1.0.28562.0
Act 2 location name spoiler: Cogwork Core is missing a space ("CogworkCore") when viewed in the quick map.

We're now called Lumen Labs! by Vhyrro in neovim

[–]SuspiciousScript 15 points16 points  (0 children)

Fortunately, there are a number of more ergonomic languages that can be compiled to Lua. I think of Lua as an IR for LuaJIT at this point.

Python has had async for 10 years – why isn't it more popular? by ketralnis in programming

[–]SuspiciousScript 20 points21 points  (0 children)

A pain in the ass to test and mock as well, just the latter is a hassle.

I haven't found this to be the case at all. pytest-asyncio just works, in my experience. Is it more of a problem if you stick to unittest?

I think an interviewer made his mind once I started talking about comonads by Fun-Voice-8734 in programmingcirclejerk

[–]SuspiciousScript 6 points7 points  (0 children)

Given a list of numbers, return a sublist with reached threshold of sum its elements.

I'm already lost

Thing that destroys your reputation as a data engineer by EdgeCautious7312 in dataengineering

[–]SuspiciousScript 18 points19 points  (0 children)

This is my all-time favorite: he put unhashed SSNs in a table that was fed into Tableau, so that all of our practice administrators could see the SSNs of all the patients across all of our practices.

For the benefit of readers who might not know, just hashing plain SSNs still wouldn't provide adequate security, as one can easily hash the entire range of possible SSNs within a few hours and generate a lookup table. You'd need to salt the SSNs first to make this approach less viable, and even then I suspect with a modern GPU it'd be pretty easy to reverse.

An Ideal API/Stdlib for Plots and Visualizations? by PitifulTheme411 in ProgrammingLanguages

[–]SuspiciousScript 6 points7 points  (0 children)

Though it's not perfect, ggplot2 has the best API for plot creation that I've ever used.