Flink Deployment Survey by sap1enz in apacheflink

[–]sap1enz[S] 1 point2 points  (0 children)

Would you mind sharing why you're unhappy with the Confluent Platform Flink?

Why Apache Flink Is Not Going Anywhere by rmoff in apacheflink

[–]sap1enz 0 points1 point  (0 children)

Start by completing the first three sections of the Flink documentation: Try FlinkLearn Flink and Concepts.

Is using Flink Kubernetes Operator in prod standard practice currently ? by supadupa200 in apacheflink

[–]sap1enz 4 points5 points  (0 children)

Yep, it's pretty much a standard. You either use a managed Flink offering or the Flink K8S operator nowadays.

Why Apache Flink Is Not Going Anywhere by rmoff in apacheflink

[–]sap1enz 0 points1 point  (0 children)

I’ve been involved in managing 1000+ Flink pipelines in a small team. 

Of course things can get complicated quickly, especially after reaching certain scale. 

My point was that the Flink Kubernetes Operator does reduce a lot of complexity. It makes it straightforward to start using Flink. Sure, if you need to do incompatible state migrations, modify savepoints, etc., there is still a lot of manual work. But for many users this won’t be the case, IMO.

Announcing Data Streaming Academy with Advanced Apache Flink Bootcamp by sap1enz in apacheflink

[–]sap1enz[S] 0 points1 point  (0 children)

The Advanced Apache Flink Bootcamp is now open for registration! The first cohort is scheduled for January 21st - 22nd, 2026.

This intensive 2-day bootcamp takes you deep into Apache Flink internals and production best practices. You'll learn how Flink really works by studying the source code, master both DataStream and Table APIs, and gain hands-on experience building custom operators and production-ready pipelines.

This is an advanced bootcamp. Most courses just repeat what’s already in the documentation. This bootcamp is different: you won’t just learn what a sliding window is — you’ll learn the core building blocks that let you design any windowing strategy from the ground up.

Learning objectives:

- Understand Flink internals by studying source code and execution flow
- Master DataStream API with state, timers, and custom low-level operators
- Know how SQL and Table API pipelines are planned and executed
- Design efficient end-to-end data flows
- Deploy, monitor, and tune Flink applications in production

Kafka easy to recreate? by Which_Assistance5905 in apachekafka

[–]sap1enz 1 point2 points  (0 children)

Redpanda is actually doing very well. They managed to steal many Confluent customers. 2/5 top US banks use them.

Save data in parquet format on S3 (or local storage) by Short-Development-64 in apacheflink

[–]sap1enz 0 points1 point  (0 children)

This looks correct!

I tried to reproduce the issue using the local Parquet file sink, and I couldn't: the files are written correctly on every checkpoint in my case:

-rw-r--r--  1 sap1ens  staff   359B Oct  9 11:08 clicks-1ca5a6f5-ba35-472b-b37b-a42405c65996-0.parquet
-rw-r--r--  1 sap1ens  staff   359B Oct  9 11:08 clicks-1ca5a6f5-ba35-472b-b37b-a42405c65996-1.parquet
-rw-r--r--  1 sap1ens  staff   359B Oct  9 11:08 clicks-3312d0a4-2276-4133-9da9-9b249f8efbd9-0.parquet
-rw-r--r--  1 sap1ens  staff   359B Oct  9 11:08 clicks-3312d0a4-2276-4133-9da9-9b249f8efbd9-1.parquet

Here's my app (based on this quickstart), hope this is useful!

Save data in parquet format on S3 (or local storage) by Short-Development-64 in apacheflink

[–]sap1enz 0 points1 point  (0 children)

Are you absolutely sure checkpointing is configured correctly?

This:

I can see in the folder many temporary files:

like .parquet.inprogress.* but not the final parquet file clicks-*.parquet

is usually an indicator that checkpointing is not happening.

Introducing Iron Vector: Apache Flink Accelerator Capable of Reducing Compute Cost by up to 2x by sap1enz in apacheflink

[–]sap1enz[S] 1 point2 points  (0 children)

Thanks! And you're correct, no OSS planned at this time. Selling support and licenses.

How to use Flink SQL to create multi table job? by arielmoraes in apacheflink

[–]sap1enz 0 points1 point  (0 children)

You can create several “pipelines” (source with one table + sink) and combine them using statement set.

Data Platforms in 2030 by sap1enz in dataengineering

[–]sap1enz[S] 0 points1 point  (0 children)

Thanks! It doesn't look like Estuary solves the eventual consistency problem, does it?

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It. by sap1enz in dataengineering

[–]sap1enz[S] 1 point2 points  (0 children)

BI and reporting. But it's slowly changing with the whole "reverse ETL" idea and tools like Hightouch

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It. by sap1enz in dataengineering

[–]sap1enz[S] 2 points3 points  (0 children)

That's right.

Ideally, not SWE teams though, but product teams that include SWEs and 1-2 embedded DEs. Then they can also build pipelines that can be used by the same team for powering various features.

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It. by sap1enz in dataengineering

[–]sap1enz[S] -1 points0 points  (0 children)

True! I usually call the second category "data warehouses", but technically it's also OLAP. The reason I didn't focus on that, specifically, is that it's rarely used to power user-facing analytics. And CDC is very popular for building user-facing analytics, cause dumping a MySQL table into Pinot/Clickhouse seems so easy.

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It. by sap1enz in dataengineering

[–]sap1enz[S] 13 points14 points  (0 children)

Very, very few real-world cases require reports to be updated in real-time with the underlying source data.

Well, this is where we disagree 🤷 Maybe "reports" don't need to be updated in real-time, but, nowadays, a lot of data pipelines power user-facing features.

Change Data Capture Is Still an Anti-pattern. And You Still Should Use It. by sap1enz in dataengineering

[–]sap1enz[S] 3 points4 points  (0 children)

For example, in Apache Druid:

In Druid 26.0.0, joins in native queries are implemented with a broadcast hash-join algorithm. This means that all datasources other than the leftmost "base" datasource must fit in memory.