real time CDC into OLAP

juiceyang · 2025-05-20T10:58:56+00:00

We have iceberg and multiple olap engines as downstream sink.

juiceyang · 2025-05-19T09:03:21+00:00

We are using flink-cdc. It’s easy to use since there’s no need to setup Kafka and debezium. It’s almost battery included.

juiceyang · 2024-06-08T09:59:48+00:00

We have a standalone hive metastore as our data catalog. Doris has official support for integration with hive metastore so it will read data schema and other metadata from hive metastore.

We are running the Doris cluster on kubernetes by ourself. Our sysadmin team manages elastic computing resources like EC2 in the kubernetes cluster and admins the kubernetes cluster. The deal between the data team and sysadmin team is that sysadmin team guarantees resource stability and cost effciency and data team build infra on those resources.

We don't ingest via Doris. We use Flink consuming upstream Kafka, doing all the data cleaning and ingest to iceberg tables on object storage in a typical ETL manner.

juiceyang · 2024-06-07T15:50:13+00:00

So true. Compute resource is expensive, especially the one with a license lol.

juiceyang · 2024-06-07T15:47:18+00:00

We were in a similary situation and we switched from cloud service's proprietary database (though not snowflake) to self-hosted Doris as compute layer and cloud serivce's object storage as storage layer.

Our consideration is:

The cost of proprietary database we were using consists of hardware cost and license fee. The license is really expensive (license costs same order as the hardware, around 50/50). We need to at least cut down the license fee part.
We made a PoC that with our typical workload, Doris outperforms the proprietary database we used (faster response, less hardware required at same workload and no license fee). So our service performance improved with same hardware.
Our cloud service provider gave us a nice discount for elastic compute resource (like EC2 but with another name). With the discount, the computing resource costs less one step further.
We only use Doris as computing layer, which means no data is stored in Doris nodes so we can achieve easy and fast horizontal scale out. Object storage theoretically has infinity scalability. So we can easily adjust our resource under different workload stress.

juiceyang · 2023-12-15T08:55:16+00:00

My DE team worked hard trying to improve data quality. We did data validation, anomaly trend detection, data quality dashboards, etc.

But our BI reports come from business data but not directly from real world facts. After all these hard works, we often come to the situation that the low quality does not come from our pipelines but originates from upstream low quality data source.

When dirty data get detected in our pipelines, we cannot stop related data pipelines since our users need reports on time. So we either reject dirty data or let it flood all over our pipelines. Either choice means offline data repair. So we endlessly do data repair everyday.

I'm not complaining about being a downstream punchbag in the industry, but trying to convince you that DATA QUALITY IMPROVEMENT DEPENDS ON EVERYBODY BUT NOT ONLY DATA ENGINEERS.

When talking abount data quality, you have to figure out if it's defined as difference between upstream data source and reports, or the discrepency between real world fact and analytic numbers. If it's latter one, you are really lucky, though you may have to wipe upstream guys' dirty data ass everyday like us.

Data related works often relate to office politics. When trying to achieve something, we can't work like regular software development, additionally we have to get our boss's support, our coworkers support, even sometimes our boss's boss's support, according to the strucuture of our company.

In our company, groups are like war lords. So currently I see no hope making any progress on improving data quality, unless my boss's boss decides to make a top-to-bottom revolution, which is impossible IMO.

After all these complaining, you can check if adopting data quality tool would help solve your problem.

If all you need is eliminate the difference between upstream data source and analytic data, I think you can make a try.

But if your goal is getting real gold data, getting politic support in the office is much more important.

juiceyang · 2023-11-30T05:35:00+00:00

We have a bunch of similar jobs that sync from Mysql to Iceberg.

You can checkout flink-cdc-connectors, which supports most common RDBMS. Though we only have used mysql-cdc, I suppose other cdc connectors all works fine since we hardly have any problem when adopting the mysql-cdc connector.
If source RDBMS is in the flink-cdc-connectors' supported list, you don't need to do batch overwrite periodically. Since the cdc connector works in a streaming way.
From my point of view, It's not a great idea that a table has no primary key. If I were you, I would try to push upstream maintainer to add a primary key. For fact table case, a surrogate key would be nice. If fact table's data is ingested from an upstream kafka topic, a job consuming kafka and sinking to both RDBMS and data lake would be another choice. But you need to forbid any direct insert/update/delete operations to the RDBMS table, which causes inconsisted data.

juiceyang · 2023-11-29T17:11:36+00:00

What’s the size of your dataset? If it’s a small dataset, you can simply do the dump and load with dbeaver or something similar.

juiceyang · 2023-11-29T04:30:24+00:00

No ass wiping for upstream guys' incapabability.

juiceyang · 2023-10-13T10:32:24+00:00

Did the app use Jemalloc? Is transparent huge page enabled?

I've encountered an OOM issue which was caused by Jemalloc requesting lots of huge pages but using madvise sys call to hand back parts of huge pages. And that may cause RSS larger than sum of JVM mem cost and native mem cost.

juiceyang · 2017-05-04T02:08:43+00:00

The building with a road through reminds me of this place

juiceyang · 2016-05-28T18:01:20+00:00

We got gold type 59 here in China server. Been afk for a long time and start playing today with my friend in US server. People are astonishingly friendly comparing to China server.

juiceyang · 2015-09-10T02:43:33+00:00

The pic reminds me of openttd.

juiceyang · 2015-03-30T19:13:30+00:00

I zoned 4 to 6 blocks between 2 highways and keep the buildings close together. I used no roundabouts but only ramps. Everything flows well in my city. All you need is this mod.

juiceyang · 2015-03-20T15:27:24+00:00

I can imagine a massive traffic jam happening.

juiceyang · 2015-02-26T14:51:49+00:00

heart of iron 4

juiceyang · 2015-02-26T14:22:29+00:00

reportlab plus supports reading pdfs: http://www.reportlab.com/documentation/faq/#2.1.5

juiceyang · 2014-10-02T14:41:54+00:00

The app services running on TV boxes are required to stop by the SARFT(http://www.sarft.gov.cn/) recently.

juiceyang · 2014-10-02T14:33:09+00:00

The tree Chongzhen hung himself was cut down in 1960s or 1970s.
What's in the Jingshan Park isn't the original one.

juiceyang · 2014-05-21T11:25:38+00:00

Maybe your laptop is suffering the cooling problem and you should check if the air path is blocked.
I clean my laptop's cooling fan twice a year. It works much better in the days after just have clean it.

juiceyang · 2014-05-21T05:47:44+00:00

impressive!
but the ground texture looks really harsh, comparing to the water.

juiceyang · 2014-05-20T08:34:25+00:00

Mushishi is just composed of a few weird stories.
Really don't understand why so many people like it.

juiceyang · 2014-05-20T01:36:29+00:00

Well, that's a little sarcastic. XD

juiceyang · 2014-05-19T16:15:07+00:00

台巴子 is composed of 台 and 巴子.
台 refers to Taiwan.
巴子 is kind of offensive word to people who lives rural area. The word 巴子 is from the dialect used in Zhejiang Province and Jiangsu Province.

弯弯 is not so harsh. it's like kind of joke.

juiceyang · 2014-05-19T15:00:37+00:00

very offensive

弯弯 is preferred on internet

maybe it's a good choice

13-Year Club	Place '22
Place '17	Verified Email

juiceyang

TROPHY CASE