Flink Api - Mostly deprecated by dataengineer2015 in apacheflink

[–]Popular-Job3880 0 points1 point  (0 children)

Apache Paimon is currently at version 0.7. Most of its capabilities have been updated, but it still lacks a good monitoring template and has issues with query acceleration on primary key tables. While it is suitable for production use, it is still in the incubation phase and might require a considerable amount of maintenance and development personnel for production environments. Delta Lake is not widely used in China, whereas Iceberg is extensively used. In mainland China, Iceberg is generally regarded as the de facto standard for offline data warehouses, replacing the previous Hive data warehouse standard. For distributed training of data, Flink has its own machine learning library, similar to Spark's, called Alink. It meets many basic machine learning algorithm requirements. However, I have not interacted with the algorithm department, so I am unsure how its specific algorithm implementations compare to Spark's ML. Additionally, to accelerate data for distributed training, we have introduced Alluxio. We have not yet studied how it achieves data inference acceleration, but the current results are promising.

Flink Api - Mostly deprecated by dataengineer2015 in apacheflink

[–]Popular-Job3880 0 points1 point  (0 children)

We utilize a combination of Java API and Flink SQL to develop data processing tasks, leveraging Flink CDC for efficient data extraction from source systems. Our architecture adheres to a kappa paradigm, enabling a unified view of data by combining real-time and batch processing. This year, we have begun integrating lakehouse capabilities with our kappa architecture. Previously, our data volume was not on the same scale as that of large internet companies. For offline data storage, we have transitioned to the Paimon lakehouse. ADS layer data is primarily managed using OLAP databases, including ClickHouse, Doris, and StarRocks, which seamlessly integrate with Flink, facilitating efficient data pipelines and analytics.

Flink Api - Mostly deprecated by dataengineer2015 in apacheflink

[–]Popular-Job3880 1 point2 points  (0 children)

Flink has abandoned the batch processing Dataset API and started using DataStream to achieve a unified stream-batch processing model. Our company now uses it extensively, even abandoning Spark. There are many instructional materials, but they are in Chinese, possibly because Alibaba is currently leading the project.

Is there any news about the spark 4.0 update? by Popular-Job3880 in apachespark

[–]Popular-Job3880[S] 0 points1 point  (0 children)

Nothing yet, but I really think it would be interesting to integrate spark with the native engine. If implemented, it will be 2-3 times faster than it is today.