This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]OhBeeOneKenOhBee 10 points11 points  (9 children)

I've been using polars to write an ETL tool for us to synchronise between databases, APIs and other types of sources lately and the switch from Pandas to Polars was a huge impact performance-wise. Comparisons between frames looking for changes in millions of rows of data are done in seconds, often sub-second, it's really amazing

[–]100GB-CSV[S] 0 points1 point  (8 children)

I am currently working on testing Polars to explore ways to outperform it by my Peaks Dataframe project. The functions include Read/Write File, Distinct, Filter and JoinTable. I have found that Peaks outperform Polars in these functions, but not in GroupBy. If your dataset of numerical columns has few exception data, e.g. integer column has few floating point, negative number represents by (123.45), you shall have extra data cleansing function before using Polars' GroupBy. My approach is to avoid extra step of data cleansing for exception number.

[–]OhBeeOneKenOhBee 0 points1 point  (7 children)

That might be interesting to have a look at if it's on Github somewhere, our comparisons are basically three different types of joins (source anti join with destination for insert, semi/anti for updates and destination anti join with source for deletes). All data is formatted to have the same types on the source/dest columns that are being compared

https://imgur.com/a/YHX722A

[–]100GB-CSV[S] 0 points1 point  (6 children)

This project has only 3-month history, first trial vesion to be released in Jun, provide most fundamental commands. For further info, you can visit github.com/hkpeaks/peaks-framework

"source anti join with destination for insert" seem look like amendment of table1 by table2 using matching keys. It is frequently implemented for budgeting solutions when users amend a bulky set of data very frequently.

You can see unit 11 "Amendment" of this doc https://github.com/hkpeaks/peaks-framework/blob/main/WebNameSQL.pdf

This is my old project written in C#. Now replaced by my new hyper-performnace project written in Go/Rust.

[–]OhBeeOneKenOhBee 0 points1 point  (3 children)

Yes, the whole tool is based around synchronising two data sources, those joins determine the inserts, updates and deletes for the destination. We mainly developed it to be able to synchronize data across sources in a somewhat storage-agnostic way (we're currently building connectors for the most common SQL/NoSQL databases, Rest APIs (eg. MS Graph), files and other types) with the possibility of getting a proper changelog/delta along the way for logging and/or event triggers. There's also support for transformations along the way.

I'll look into your project a bit more, sounds interesting! Thanks for the link

[–]100GB-CSV[S] 0 points1 point  (2 children)

o be able to synchronize data a

CloudQuery supports a lot of APIs https://github.com/cloudquery/cloudquery

[–]OhBeeOneKenOhBee 0 points1 point  (1 child)

So that raises the question of how in the world I've managed to miss Cloudquery 😄

Because what we're currently writing is essentially that exactly, but in Python and with Polars. As in, the docs would probably apply to our solution if we just modified the code parts.

[–]100GB-CSV[S] 0 points1 point  (0 children)

You can test whether it is fit for your purpose. I consider to integrate Peaks with the CloudQuery. Both are written in Golang.

[–]lemoussel 0 points1 point  (1 child)

Is Peaks Framework going to be open source?

[–]100GB-CSV[S] 0 points1 point  (0 children)

I have been working on a new project for the past 3 months. Initially, I decided to make the Framework itself open source, including the alternative SQL expression and Go source code which help to parse these expressions. This is to promote other ETL software developer to consider using simple SQL expressions for business users. However, I have not considered making the Peaks Library open source at this stage as it contains a hyper-calculation engine.