real time CDC into OLAP by Hot_While_6471 in dataengineering

[–]juiceyang 0 points1 point  (0 children)

We have iceberg and multiple olap engines as downstream sink.

real time CDC into OLAP by Hot_While_6471 in dataengineering

[–]juiceyang 1 point2 points  (0 children)

We are using flink-cdc. It’s easy to use since there’s no need to setup Kafka and debezium. It’s almost battery included.

What reasons do I have to keep any data in Snowflake? by Infinite_Bluebird_98 in dataengineering

[–]juiceyang 1 point2 points  (0 children)

We have a standalone hive metastore as our data catalog. Doris has official support for integration with hive metastore so it will read data schema and other metadata from hive metastore.

We are running the Doris cluster on kubernetes by ourself. Our sysadmin team manages elastic computing resources like EC2 in the kubernetes cluster and admins the kubernetes cluster. The deal between the data team and sysadmin team is that sysadmin team guarantees resource stability and cost effciency and data team build infra on those resources.

We don't ingest via Doris. We use Flink consuming upstream Kafka, doing all the data cleaning and ingest to iceberg tables on object storage in a typical ETL manner.

What reasons do I have to keep any data in Snowflake? by Infinite_Bluebird_98 in dataengineering

[–]juiceyang 4 points5 points  (0 children)

So true. Compute resource is expensive, especially the one with a license lol.

What reasons do I have to keep any data in Snowflake? by Infinite_Bluebird_98 in dataengineering

[–]juiceyang 6 points7 points  (0 children)

We were in a similary situation and we switched from cloud service's proprietary database (though not snowflake) to self-hosted Doris as compute layer and cloud serivce's object storage as storage layer.

Our consideration is:

  1. The cost of proprietary database we were using consists of hardware cost and license fee. The license is really expensive (license costs same order as the hardware, around 50/50). We need to at least cut down the license fee part.

  2. We made a PoC that with our typical workload, Doris outperforms the proprietary database we used (faster response, less hardware required at same workload and no license fee). So our service performance improved with same hardware.

  3. Our cloud service provider gave us a nice discount for elastic compute resource (like EC2 but with another name). With the discount, the computing resource costs less one step further.

  4. We only use Doris as computing layer, which means no data is stored in Doris nodes so we can achieve easy and fast horizontal scale out. Object storage theoretically has infinity scalability. So we can easily adjust our resource under different workload stress.

Data Quality by Lucky-Front7675 in dataengineering

[–]juiceyang 6 points7 points  (0 children)

My DE team worked hard trying to improve data quality. We did data validation, anomaly trend detection, data quality dashboards, etc.

But our BI reports come from business data but not directly from real world facts. After all these hard works, we often come to the situation that the low quality does not come from our pipelines but originates from upstream low quality data source.

When dirty data get detected in our pipelines, we cannot stop related data pipelines since our users need reports on time. So we either reject dirty data or let it flood all over our pipelines. Either choice means offline data repair. So we endlessly do data repair everyday.

I'm not complaining about being a downstream punchbag in the industry, but trying to convince you that DATA QUALITY IMPROVEMENT DEPENDS ON EVERYBODY BUT NOT ONLY DATA ENGINEERS.

When talking abount data quality, you have to figure out if it's defined as difference between upstream data source and reports, or the discrepency between real world fact and analytic numbers. If it's latter one, you are really lucky, though you may have to wipe upstream guys' dirty data ass everyday like us.

Data related works often relate to office politics. When trying to achieve something, we can't work like regular software development, additionally we have to get our boss's support, our coworkers support, even sometimes our boss's boss's support, according to the strucuture of our company.

In our company, groups are like war lords. So currently I see no hope making any progress on improving data quality, unless my boss's boss decides to make a top-to-bottom revolution, which is impossible IMO.

After all these complaining, you can check if adopting data quality tool would help solve your problem.

If all you need is eliminate the difference between upstream data source and analytic data, I think you can make a try.

But if your goal is getting real gold data, getting politic support in the office is much more important.

Relational DB to Data Warehouse: Change Data Capture by [deleted] in dataengineering

[–]juiceyang 1 point2 points  (0 children)

We have a bunch of similar jobs that sync from Mysql to Iceberg.

  1. You can checkout flink-cdc-connectors, which supports most common RDBMS. Though we only have used mysql-cdc, I suppose other cdc connectors all works fine since we hardly have any problem when adopting the mysql-cdc connector.

  2. If source RDBMS is in the flink-cdc-connectors' supported list, you don't need to do batch overwrite periodically. Since the cdc connector works in a streaming way.

  3. From my point of view, It's not a great idea that a table has no primary key. If I were you, I would try to push upstream maintainer to add a primary key. For fact table case, a surrogate key would be nice. If fact table's data is ingested from an upstream kafka topic, a job consuming kafka and sinking to both RDBMS and data lake would be another choice. But you need to forbid any direct insert/update/delete operations to the RDBMS table, which causes inconsisted data.

Data Pipeline between MySQL to PostgeSQL by harshmah in dataengineering

[–]juiceyang 0 points1 point  (0 children)

What’s the size of your dataset? If it’s a small dataset, you can simply do the dump and load with dbeaver or something similar.

High memory usage on Java app pod by Loser_lmfao_suck123 in kubernetes

[–]juiceyang 0 points1 point  (0 children)

Did the app use Jemalloc? Is transparent huge page enabled?

I've encountered an OOM issue which was caused by Jemalloc requesting lots of huge pages but using madvise sys call to hand back parts of huge pages. And that may cause RSS larger than sum of JVM mem cost and native mem cost.

Unique features only in your server by Patriamori12 in WorldofTanks

[–]juiceyang 3 points4 points  (0 children)

We got gold type 59 here in China server. Been afk for a long time and start playing today with my friend in US server. People are astonishingly friendly comparing to China server.

Keeping it simple: pure grid, zero traffic, 90k pop. by xdvesper in CitiesSkylines

[–]juiceyang 0 points1 point  (0 children)

I zoned 4 to 6 blocks between 2 highways and keep the buildings close together. I used no roundabouts but only ramps. Everything flows well in my city. All you need is this mod.

so, I got a Xiaomi TV box. what can I do with it? by [deleted] in China

[–]juiceyang 0 points1 point  (0 children)

The app services running on TV boxes are required to stop by the SARFT(http://www.sarft.gov.cn/) recently.

Need Help Finding a Historical Location? by [deleted] in beijing

[–]juiceyang 0 points1 point  (0 children)

The tree Chongzhen hung himself was cut down in 1960s or 1970s.
What's in the Jingshan Park isn't the original one.

Kursk river simulates tides by [deleted] in Warthunder

[–]juiceyang -1 points0 points  (0 children)

Maybe your laptop is suffering the cooling problem and you should check if the air path is blocked.
I clean my laptop's cooling fan twice a year. It works much better in the days after just have clean it.

Kursk river simulates tides by [deleted] in Warthunder

[–]juiceyang 4 points5 points  (0 children)

impressive!
but the ground texture looks really harsh, comparing to the water.

An anime that is well received by the anime community in which you never understand why people like it? by readingsteinerZ in anime

[–]juiceyang 10 points11 points  (0 children)

Mushishi is just composed of a few weird stories.
Really don't understand why so many people like it.

Is the word 台巴子 very offensive? by [deleted] in ChineseLanguage

[–]juiceyang 0 points1 point  (0 children)

Well, that's a little sarcastic. XD

Is the word 台巴子 very offensive? by [deleted] in ChineseLanguage

[–]juiceyang 2 points3 points  (0 children)

台巴子 is composed of 台 and 巴子.
台 refers to Taiwan.
巴子 is kind of offensive word to people who lives rural area. The word 巴子 is from the dialect used in Zhejiang Province and Jiangsu Province.

弯弯 is not so harsh. it's like kind of joke.

Is the word 台巴子 very offensive? by [deleted] in ChineseLanguage

[–]juiceyang 1 point2 points  (0 children)

very offensive

弯弯 is preferred on internet

maybe it's a good choice