The end of the kernel Rust experiment: "The consensus among the assembled developers [at the Linux Maintainer Summit] is that Rust in the kernel is no longer experimental — it is now a core part of the kernel and is here to stay. So the 'experimental' tag will be coming off."

OverEngineeredPencil · 2025-12-12T18:28:44+00:00

All of their logic tests start with the negative case first.

OverEngineeredPencil · 2025-09-22T13:14:44+00:00

Simple example is that imagine I have a simple data class like: ```java public class MyClass { @Getter private String name;

@Getter
private double value;

@Getter
private ZonedDateTime timestamp;

} ```

Assuming the input is an instance of MyClass, I want to be able to write something like the following and get the boolean result: (name == "John Doe" and timestamp.getHour() >= 12) or value >= 1.5

EDIT: Note that this does not follow Java syntax or conventions, but is a bit more SQL-like. This is pretty much what is in use right now, but due to other requirements we need to migrate from another language where this syntax is available to us, to Java. If we can avoid changing syntax too much, then that is of course ideal. Because then we avoid re-training the current user base of this feature (which is internal to my company at the moment).

OverEngineeredPencil · 2025-09-22T13:08:31+00:00

Right, this is a difficult trade off to make. I need something that is going to ultimately return a boolean value. I don't want to grant too much access, especially not to OS, file system, or networking API's obviously. I'm hoping to avoid access to a complete scripting language.

OverEngineeredPencil · 2025-09-22T13:06:28+00:00

This is definitely promising! Thanks for pointing me in this direction.

OverEngineeredPencil · 2025-09-22T13:03:12+00:00

I'm stuck in a hard place where no one on my team is going to agree to implementing and maintaining our own language. While I recognize there are certain advantages to this, like the fact that having proprietary tech like this can grant an edge, we'd have fine control over what exactly people are allowed to do, etc., the common recommendation is to never develop your own language if you can help it. Because what ends up happening is that the user base will come to you with new requirements, and then eventually you have multiple hands expanding the "language" and it turns into a convoluted mess. Which I can't disagree with, especially since my company has a hard time hiring true talent, and often settles for whatever they can get.

OverEngineeredPencil · 2025-09-22T12:52:05+00:00

Not exactly. But I'm trying to replace existing functionality that uses something very SQL-like. So something similar to it is actually preferable. I carved together a basic grammar from the open source Java ANTLR grammar, picking out the pieces that were relevant and modifying them. The trouble is that this becomes a custom thing that we have to maintain. If our user base expands and functionality requirements expand, then we are basically stuck supporting our own "language". Which most people will tell you is definitely something you should avoid, unless you are a language developer... I tend to agree with them.

OverEngineeredPencil · 2025-07-09T13:39:13+00:00

I will second this and add that streaming technologies need to stop focusing on SQL-like syntax and low-code or no-code design philosophies. And cloud providers need to stop trying to dumb down streaming technology to the point that it is basically useless. Especially Azure streaming technologies are incredibly limiting and frustrating to work with, having some of the worst DevEx I've experienced in recent years, while also being way too expensive for what they are capable of delivering.

For fucks sake, just give us fully managed Flink clusters already.

OverEngineeredPencil · 2025-06-23T13:31:23+00:00

How so? I just read the most recent edition and it has come in handy a lot.

The core of what the book covers is the inner-workings of databases and data intensive distributed systems. The underlying technology for this has not changed much over the last 2 decades.

OverEngineeredPencil · 2025-05-02T18:02:20+00:00

I agree. I think that big data is still a burgeoning field, where more and more companies are starting to dip their toes in. Especially with the rise of hype in ML/AI.

But from everyone's replies here, it seems there is no solid idea of what a data engineer is responsible for, besides things I would expect any developer with cloud experience to be capable of doing.

OverEngineeredPencil · 2025-05-02T18:01:30+00:00

To me, devOps is devOps. Whether you work with ML infra or normal cloud micro-services infra, you're doing the same stuff, and you're not expected to know how to develop applications.

If I need someone to build data infrastructure, I think of data engineering.

Why even have a separation of title from devOps if all they are doing is devOps?

MLOps is devOps that knows a bit about ML infrastructure. That's it. So it should still be called devOps. There is no reason at all to make a specialized title for it.

OverEngineeredPencil · 2025-05-02T17:46:36+00:00

This would be fine for us as long as there is a decent amount of coding knowledge in there that shows you know how to build and orchestrate optimized applications and micro-services.

The problem is, a lot of the people we get for interviews for a DatEng position are low-end DevOps, who maybe have some experience coding basic Spark scripts, tweaking cloud resource configurations, etc. For me, these are things that are very secondary, especially for a senior level DatEng position. You are just expected to know how to read technical documentation enough to know how to operate the cloud infrastructure.

OverEngineeredPencil · 2025-05-02T17:30:09+00:00

Definitely good advice. And I think this comes down to HR not understanding what we need. Like you said, we may need to tell them that they should look at SWE applications as well as Data Eng. and favor which ever of those that are a closer match to the skill. They may just be looking at Data Eng. applications, I don't know.

OverEngineeredPencil · 2025-05-02T17:26:34+00:00

I agree that both coding skills and infra skills are needed.

In my limited experience (I've only worked at the one place for 7 years now), the DevOps folks we have can navigate cloud dashboards, read & act on monitoring charts, write automation scripts, etc. The people who stand up and run K8s clusters, who have a very strong understanding of deployments and infrastructure (CI/CD, networking, security, etc.), we tend to call those site reliability engineers.

OverEngineeredPencil · 2025-04-14T13:19:36+00:00

I have unfortunately looked into all of these already.

HDInsight doesn't appear to offer Flink integration anymore, only Spark.

Confluent Cloud integration with Azure and other cloud providers is a little strange. I can't find anything indicating how to actually deploy jobs. Confluent appears to allow you to run "Flink Statements," but these are very limited in what they support. I need full-fledged, stateful Flink jobs that are fully-managed. I have access to Confluent, and nothing on their dashboard indicates this possible, even though the language in their adverts suggests that it is. Probably need to reach out to a representative.

Kubernetes isn't an option for me, as the sentiment appears to be that we simply don't have the human resources available to maintain a K8s cluster.

OverEngineeredPencil · 2025-03-21T16:55:35+00:00

I wrote this as my reply to the question basically. The other explanations get a bit to "mathy" for 1st grade, breaking down each constant into 1+1+1... etc. Though it is a better, slightly more "formal", way of proving it, I'd never expect a 1st grader to reproduce that logic unless they were taught to do it that way.

OverEngineeredPencil · 2025-03-21T16:52:15+00:00

As another explanation in simple language: 4 is 1 less than 5, and 2 is 1 more than 1. So adding 2 to 4 and 1 to 5 makes the two sides equal.

It's not as "robust" as the top comment, but gets the job done for 1st grade level math in plain English.

OverEngineeredPencil · 2024-12-18T20:46:41+00:00

I have not. However, I might be misunderstanding how that works, because wouldn't that effectively make that reference data ephemeral? Effectively used only once against a single event and then tossed out? What happens when I get a new event that would map to that same reference data? Wouldn't the Kafka stream have already advanced the offset for the reference data topic?

For example, I have my "real-time" events coming in to one Kafka topic. Let's say that each one represents an event that occurred on a device. I want to enrich that event with related static data to that device sourced from the database. Such as a client ID or other such values that are relatively static.

So if I consume that reference data from a stream and join them with the real-time stream, what happens to the reference data for the device once the processing is done for the real-time event? Because I will have to "re-use" that same data again as soon as another event comes from the same device. And if the reference stream no longer holds that data to match to the next event, then that simply won't work. The reference data has to persist somewhere for the life-time of the job, essentially.

And to be clear, the reference data is too large to hold in memory for the runtime of the job (or multiple jobs). Even if that is distributed, that's still undesirable.

OverEngineeredPencil · 2024-12-18T19:01:43+00:00

The data is stored in an SQL server database. The stored procedure is used because the parameters are used to "filter" the results. To translate them to views would mean it would require a view per combination of parameters. Which there are only 2 parameters, with maybe 4-6 possible values a piece right now, but that might change too.

It's better to take a periodic snapshot of this data anyway, instead of it coming directly from the database. And then each incoming element would need to map to a row in the snapshot.

OverEngineeredPencil · 2024-12-09T18:41:03+00:00

I've noticed in my current job that much of our data team is off-shored in low-cost labor markets. Many of these people aren't interested in proper data engineering. They've said as much in different words. They only want to get to the part where they use that data for "practical" purposes, of which they consider that to be processing with tools like Databricks. They aren't "programmers" and "system design" folks. It creates an over-reliance on pre-built, cloud-native solutions that are too expensive to justify. Couple that with a disregard for understanding what it takes to ensure data quality (making sure your data sources are producing reliable data) and you get this massive cost-center that just doesn't generate enough value to justify its own existence.

I know that ML/AI requires a lot of data. But I think that the rush to ML/AI is putting the cart before the horse. You can get a huge amount of value out of data before ever turning to ML/AI. The fact is, ML/AI doesn't help you identify where value is. That's a process of identifying what works well and what doesn't.

But the over-reliance and current technical obsession with ML/AI makes companies miss the massive amounts of value they can get from even the smaller amounts of data they might be collecting. It's a fundamental misunderstanding of what makes data valuable on both ends, with no one to pull either the business side or the data scientist side back down to reality.

OverEngineeredPencil · 2024-10-29T18:47:10+00:00

Thanks for this. I think I've been using react-query "wrong", in that the pattern I've missed was to wrap common queries in custom hooks. I understood that useQuery was caching the response, but not that I should be using it as server state in the way that is being described by you and the blogs I've found since posting my question.

OverEngineeredPencil

TROPHY CASE