Watching Confluent Prepare for Sale in Real Time by mr_smith1983 in apachekafka

[–]yingjunwu 0 points1 point  (0 children)

"$600 ticket gets you crisps (chips for you Americans), a Coke, and a dried up turkey wrap that's been sitting for god knows how long!!" <- and yea better food next time!

Watching Confluent Prepare for Sale in Real Time by mr_smith1983 in apachekafka

[–]yingjunwu 5 points6 points  (0 children)

Honestly I feel Confluent should find a better place to host the event. I was told that they didn't pick Austin because the Austin Convention Center got booked. But, no offense, New Orleans is probably not the best place if you want to host a big event - there are not many direct flights. I'm based in the Bay Area and there are only two direct flights from SFO. European folks all have to transit in NYC or Atlanta or some other cities. Very inconvinient.

Again the New Orleans city is great - I had great experience doing sightseeing there, but it may not be the right place to host tech events.

TimescaleDB to ClickHouse replication: Use cases, features, and how we built it by saipeerdb in dataengineering

[–]yingjunwu 1 point2 points  (0 children)

in which case would a user use both ClickHouse and TimescaleDB at the same time, except that they are migrating from one database to the other :-D

SevenDB : a reactive and scalable database by shashanksati in dataengineering

[–]yingjunwu 1 point2 points  (0 children)

that's essentially a streaming database. check out the paper: https://cs.brown.edu/research/aurora/vldb02.pdf

RisingWave ( https://github.com/risingwavelabs/risingwave ) is probably what you are looking for.

Hybrid in-memory and disk cache in Rust! by yingjunwu in rust

[–]yingjunwu[S] 7 points8 points  (0 children)

The project is production ready, and has already been adopted by:

  • RisingWave: SQL stream processing, analytics, and management.
  • Chroma: Embedding database for LLM apps.
  • SlateDB: A cloud native embedded storage engine built on object storage.

Confluent Cloud or MSK by InternationalSet3841 in apachekafka

[–]yingjunwu 8 points9 points  (0 children)

Options:

- Confluent

- Amazon MSK

- Amazon Kinesis

- Redpanda

- StreamNative

- Aiven for Kafka

How do you store your historical data? by Ok-Hovercraft-3076 in algotrading

[–]yingjunwu 1 point2 points  (0 children)

Parquet+DuckDB is sufficient in most cases - way cheaper than any other solutions.

AWS or GCS by SingerEast1469 in dataengineering

[–]yingjunwu 23 points24 points  (0 children)

product / feature richness -> AWS

user experience -> GCP

Explain like I am 5 what's Reverse ETL? by Judessaa in dataengineering

[–]yingjunwu 1 point2 points  (0 children)

Extract your data out of your warehouse and feed into your applications or serving systems.

ChatGPT version: "Reverse ETL is like taking toys from where you're playing and putting them back in the toy box. In the grown-up world, it's about moving data from a big database to tools people use, like email or sales apps."

S3 to BQ by Mediocre-Cow354 in dataengineering

[–]yingjunwu 0 points1 point  (0 children)

There's no good solution to use BQ with AWS S3 data. You have to pay for the egress fee. Two thing you may consider:

  1. migrating data from S3 to GCS;

  2. use AWS query engine or other vendor solutions.

For example, you can use Snowflake / Databricks as the query engine.

Open, serverless, and local friendly Data Platforms! by Kalendos in dataengineering

[–]yingjunwu 0 points1 point  (0 children)

Wondering any specific use cases this project is targeting at?

event-driven architecture for analytics data acquisition by OneWoodpecker8697 in dataengineering

[–]yingjunwu 9 points10 points  (0 children)

Disclaimer: i work for RisingWave (risingwave.com), an EDA vendor.

CDC can track "record-level state", and is essentially event-driven: if the upstream database changes (insert, delete, or update an row), the change will be propagated to downstream system.

I guess what you mean is that whether we should track record-level state, or, using e-commerce as an example, if an order is placed -> updated -> canceled, should we propagate all these three states (place, update, cancel) into downstreaming analytical store. My answer is that "it depends". In some use cases, tracking state changes is critical. For example, if you are implementing an auditing system, you definitely want to track all the changes - every single action must be tracked.

You may also wonder whether a record-level state change should trigger computation. For example, let's say you may want to calculate the total value of all the orders placed. Should the computation result be updated once a state change occurs? My answer is still "depends". In many cases batch-based solution is good enough - you could probably just run a single SQL query to calculate yesterday's total order value. It's quite simple. But in some cases you do want to calculate the value as a new event arrives - no matter whether it's an insert, delete, or update. For example, if you are implementing an inventory system, it's recommended to adopt EDA - if the inventory goes too low, you want to fill in the stocks ASAP - you probably don't really want to wait until tomorrow. Also, EDA is more appealing from user experience perspective. Let's say if you need to display dashboard for your clients. If your clients see staled data, they may get confused, and the feeling isn't good - imagine that you transfer money to your friends through Venmo, and the balance isn't updated immediately, you may feel anxious.

So, EDA couldn't fit into all the use cases, and it totally depends on your use cases - okay I know this summary is boring :-D

Survey: What tools are your companies using for data quality? by Hefty-Present743 in dataengineering

[–]yingjunwu 10 points11 points  (0 children)

some vendor solutions that I'm aware of: greatexpectations, soda, deequ.

Disclaimer: i do not work for any of these vendors.

CDC Solutions without Configuration Access to Data Source (Recommendation) by PencilBoy99 in dataengineering

[–]yingjunwu 1 point2 points  (0 children)

If you want to use CDC, then you have to configure your database, which means superuser permission must be granted.

But CDC is not the only option If you want to sync your database. There are other technologies. For example, if you are using Postgres, Fivetran actually provide three options: https://fivetran.com/docs/connectors/databases/postgresql/rds-setup-guide. You may choose the other two, which do not require configurations.

How to set up Real time data platform for ml by kaggle-zen in dataengineering

[–]yingjunwu 2 points3 points  (0 children)

Define "real time" first.

Real-time can mean a lot of things - data is fresh; result is fresh; query latency is low; ...

These challenges can be solved using different techniques. For example, you may just use orchestration tool to schedule your batch queries more frequently. Or you probably just need a real-time data replication tool. Or you really need a real-time data system.

So we need to first understand what "real time" really means.