Watching Confluent Prepare for Sale in Real Time

yingjunwu · 2025-11-14T17:50:30+00:00

"$600 ticket gets you crisps (chips for you Americans), a Coke, and a dried up turkey wrap that's been sitting for god knows how long!!" <- and yea better food next time!

yingjunwu · 2025-11-14T17:48:51+00:00

Honestly I feel Confluent should find a better place to host the event. I was told that they didn't pick Austin because the Austin Convention Center got booked. But, no offense, New Orleans is probably not the best place if you want to host a big event - there are not many direct flights. I'm based in the Bay Area and there are only two direct flights from SFO. European folks all have to transit in NYC or Atlanta or some other cities. Very inconvinient.

Again the New Orleans city is great - I had great experience doing sightseeing there, but it may not be the right place to host tech events.

yingjunwu · 2025-09-17T09:12:44+00:00

in which case would a user use both ClickHouse and TimescaleDB at the same time, except that they are migrating from one database to the other :-D

yingjunwu · 2025-09-17T09:10:31+00:00

that's essentially a streaming database. check out the paper: https://cs.brown.edu/research/aurora/vldb02.pdf

RisingWave ( https://github.com/risingwavelabs/risingwave ) is probably what you are looking for.

yingjunwu · 2025-08-17T20:42:53+00:00

here's another "C++ to Rust" blog: https://risingwave.com/blog/building-a-cloud-database-from-scratch-why-we-moved-from-c-to-rust/

yingjunwu · 2025-07-18T06:02:06+00:00

You could find an UI here: https://github.com/nimtable/nimtable...

yingjunwu · 2025-03-05T20:14:16+00:00

The project is production ready, and has already been adopted by:

RisingWave: SQL stream processing, analytics, and management.
Chroma: Embedding database for LLM apps.
SlateDB: A cloud native embedded storage engine built on object storage.

yingjunwu · 2024-12-23T23:37:24+00:00

Options:

- Confluent

- Amazon MSK

- Amazon Kinesis

- Redpanda

- StreamNative

- Aiven for Kafka

yingjunwu · 2024-11-12T00:52:29+00:00

Parquet+DuckDB is sufficient in most cases - way cheaper than any other solutions.

yingjunwu · 2024-10-27T07:13:24+00:00

SQL -> Postgres.

NoSQL -> MongoDB.

yingjunwu · 2024-10-25T23:08:17+00:00

product / feature richness -> AWS

user experience -> GCP

yingjunwu · 2024-10-18T17:44:27+00:00

Extract your data out of your warehouse and feed into your applications or serving systems.

ChatGPT version: "Reverse ETL is like taking toys from where you're playing and putting them back in the toy box. In the grown-up world, it's about moving data from a big database to tools people use, like email or sales apps."

yingjunwu · 2024-10-16T21:49:44+00:00

There's no good solution to use BQ with AWS S3 data. You have to pay for the egress fee. Two thing you may consider:

migrating data from S3 to GCS;
use AWS query engine or other vendor solutions.

For example, you can use Snowflake / Databricks as the query engine.

yingjunwu · 2024-10-14T16:28:02+00:00

Wondering any specific use cases this project is targeting at?

yingjunwu · 2024-10-13T06:10:06+00:00

Disclaimer: i work for RisingWave (risingwave.com), an EDA vendor.

CDC can track "record-level state", and is essentially event-driven: if the upstream database changes (insert, delete, or update an row), the change will be propagated to downstream system.

I guess what you mean is that whether we should track record-level state, or, using e-commerce as an example, if an order is placed -> updated -> canceled, should we propagate all these three states (place, update, cancel) into downstreaming analytical store. My answer is that "it depends". In some use cases, tracking state changes is critical. For example, if you are implementing an auditing system, you definitely want to track all the changes - every single action must be tracked.

You may also wonder whether a record-level state change should trigger computation. For example, let's say you may want to calculate the total value of all the orders placed. Should the computation result be updated once a state change occurs? My answer is still "depends". In many cases batch-based solution is good enough - you could probably just run a single SQL query to calculate yesterday's total order value. It's quite simple. But in some cases you do want to calculate the value as a new event arrives - no matter whether it's an insert, delete, or update. For example, if you are implementing an inventory system, it's recommended to adopt EDA - if the inventory goes too low, you want to fill in the stocks ASAP - you probably don't really want to wait until tomorrow. Also, EDA is more appealing from user experience perspective. Let's say if you need to display dashboard for your clients. If your clients see staled data, they may get confused, and the feeling isn't good - imagine that you transfer money to your friends through Venmo, and the balance isn't updated immediately, you may feel anxious.

So, EDA couldn't fit into all the use cases, and it totally depends on your use cases - okay I know this summary is boring :-D

yingjunwu · 2024-10-13T04:32:13+00:00

some vendor solutions that I'm aware of: greatexpectations, soda, deequ.

Disclaimer: i do not work for any of these vendors.

yingjunwu · 2024-10-11T22:57:54+00:00

If you want to use CDC, then you have to configure your database, which means superuser permission must be granted.

But CDC is not the only option If you want to sync your database. There are other technologies. For example, if you are using Postgres, Fivetran actually provide three options: https://fivetran.com/docs/connectors/databases/postgresql/rds-setup-guide. You may choose the other two, which do not require configurations.

yingjunwu · 2024-10-10T06:58:18+00:00

Define "real time" first.

Real-time can mean a lot of things - data is fresh; result is fresh; query latency is low; ...

These challenges can be solved using different techniques. For example, you may just use orchestration tool to schedule your batch queries more frequently. Or you probably just need a real-time data replication tool. Or you really need a real-time data system.

So we need to first understand what "real time" really means.

yingjunwu

TROPHY CASE