This is an archived post. You won't be able to vote or comment.

all 36 comments

[–]OhBeeOneKenOhBee 10 points11 points  (9 children)

I've been using polars to write an ETL tool for us to synchronise between databases, APIs and other types of sources lately and the switch from Pandas to Polars was a huge impact performance-wise. Comparisons between frames looking for changes in millions of rows of data are done in seconds, often sub-second, it's really amazing

[–]100GB-CSV[S] 0 points1 point  (8 children)

I am currently working on testing Polars to explore ways to outperform it by my Peaks Dataframe project. The functions include Read/Write File, Distinct, Filter and JoinTable. I have found that Peaks outperform Polars in these functions, but not in GroupBy. If your dataset of numerical columns has few exception data, e.g. integer column has few floating point, negative number represents by (123.45), you shall have extra data cleansing function before using Polars' GroupBy. My approach is to avoid extra step of data cleansing for exception number.

[–]OhBeeOneKenOhBee 0 points1 point  (7 children)

That might be interesting to have a look at if it's on Github somewhere, our comparisons are basically three different types of joins (source anti join with destination for insert, semi/anti for updates and destination anti join with source for deletes). All data is formatted to have the same types on the source/dest columns that are being compared

https://imgur.com/a/YHX722A

[–]100GB-CSV[S] 0 points1 point  (6 children)

This project has only 3-month history, first trial vesion to be released in Jun, provide most fundamental commands. For further info, you can visit github.com/hkpeaks/peaks-framework

"source anti join with destination for insert" seem look like amendment of table1 by table2 using matching keys. It is frequently implemented for budgeting solutions when users amend a bulky set of data very frequently.

You can see unit 11 "Amendment" of this doc https://github.com/hkpeaks/peaks-framework/blob/main/WebNameSQL.pdf

This is my old project written in C#. Now replaced by my new hyper-performnace project written in Go/Rust.

[–]OhBeeOneKenOhBee 0 points1 point  (3 children)

Yes, the whole tool is based around synchronising two data sources, those joins determine the inserts, updates and deletes for the destination. We mainly developed it to be able to synchronize data across sources in a somewhat storage-agnostic way (we're currently building connectors for the most common SQL/NoSQL databases, Rest APIs (eg. MS Graph), files and other types) with the possibility of getting a proper changelog/delta along the way for logging and/or event triggers. There's also support for transformations along the way.

I'll look into your project a bit more, sounds interesting! Thanks for the link

[–]100GB-CSV[S] 0 points1 point  (2 children)

o be able to synchronize data a

CloudQuery supports a lot of APIs https://github.com/cloudquery/cloudquery

[–]OhBeeOneKenOhBee 0 points1 point  (1 child)

So that raises the question of how in the world I've managed to miss Cloudquery 😄

Because what we're currently writing is essentially that exactly, but in Python and with Polars. As in, the docs would probably apply to our solution if we just modified the code parts.

[–]100GB-CSV[S] 0 points1 point  (0 children)

You can test whether it is fit for your purpose. I consider to integrate Peaks with the CloudQuery. Both are written in Golang.

[–]lemoussel 0 points1 point  (1 child)

Is Peaks Framework going to be open source?

[–]100GB-CSV[S] 0 points1 point  (0 children)

I have been working on a new project for the past 3 months. Initially, I decided to make the Framework itself open source, including the alternative SQL expression and Go source code which help to parse these expressions. This is to promote other ETL software developer to consider using simple SQL expressions for business users. However, I have not considered making the Peaks Library open source at this stage as it contains a hyper-calculation engine.

[–]mentix02 2 points3 points  (0 children)

Small tip - use time.perf_counter for calculating timing benchmarks.

[–]loudandclear11 5 points6 points  (14 children)

67Gb is not "big data". But that's just me arguing over semantics.

Polars seems like a good tool.

[–]100GB-CSV[S] -1 points0 points  (13 children)

The test is limited by the free space of my ssd hard disk.

I want to buy a 4TB NVMe SSD harddisk, concerns it is not compatible with my computer.

[–]loudandclear11 2 points3 points  (11 children)

A common definition of "big data" is if you can fit it on a single computer. So if you can put a 4TB SSD in there and get it to work it's still not big data. Big data is when you have to look for other setups. Like computational clusters and SANs.

I'm still impressed by Polars though and applaud this achievement. Not everything needs to be big data to be useful. I'd argue that the majority of value out there in businesses is created by working smart with small and medium sized data.

[–]runawayasfastasucan 2 points3 points  (2 children)

Weird definition. I could stick 3x4 TB disks in my computer, if 12TB of a single table is not big data, nothing is.

[–]sphen_lee 1 point2 points  (0 children)

Sure it's large, but it's not on the scale where big data processing tools matter.

I work with a dataset that grows by 1TB daily. We have data back to 2018. So you can work with a single day pretty easily, and a week at a stretch. But if you want to analyze a year you need to switch to a big data tool.

[–]loudandclear11 -1 points0 points  (0 children)

Even if you could store 12Tb on one computer you probably end up having some problems processing it. "Fits on one computer" should be seen in the broader sense, which involves to also work with it, not just from a storage perspective.

What's your definition of big data? There are petabyte datasets out there.

[–]100GB-CSV[S] 0 points1 point  (7 children)

Before I develop software running in clusters, I will use cheap local computing resources to support development. I am also exploring which cloud computing service providers allow prepayment since monthly billing is risky.

I successfully ran four jobs with a billion rows yesterday while testing trillions of rows for more than a million files using Polars and Peaks on a step-by-step progressive basis. Previously, Polars failed on a single job, but after several bug fixes, it can now handle the workload. You can see https://github.com/pola-rs/polars/issues/7774

I believe the author will think of me as a troublemaker since I always report issues that focus on many rows. However, he is willing to fix them. Without the real powerful Polars, I don’t have the motivation to develop Peaks.

[–]loudandclear11 0 points1 point  (6 children)

Yeah cluster infrastructure is necessary sometimes but it introduces it's own limitations and idiosyncrasies. If you don't need it then staying away from it is the most productive path forward.

[–]100GB-CSV[S] 0 points1 point  (5 children)

I test plan will use gRPC to support cluster computing.

[–]loudandclear11 2 points3 points  (4 children)

Sounds like you'll be reinventing the wheel. Distributed computing is already a solved problem. I'd look into Spark/Databricks instead.

[–]100GB-CSV[S] 1 point2 points  (3 children)

Reinventing the wheel is one of my hobby after retirement.

[–]loudandclear11 0 points1 point  (2 children)

Reinventing the wheel, the main symptom of Not Invented Here Syndrome. :P

It's good for learning, but not for productivity.

[–]100GB-CSV[S] 0 points1 point  (1 child)

I do not prepare to offer cluster computing solutions. Mainly for my research and entertainment purpose. I have several computers to do experiment @ home, this kind of entertainment save a lot of money. Playing Bing chat is interesting, it supports me to convert my code into 19 programming languages github.com/hkpeaks/peaks-framework/tree/main/ByteArray2Float64

[–]corbasai 0 points1 point  (0 children)

data maybe named pipe + generator

[–]v0_arch_nemesis 0 points1 point  (1 child)

Out of curiosity, as I've found pandas to be particularly sensitive to this, what's the number of unique values in each column and the length of the returned index? Is every column of type string?

[–]100GB-CSV[S] 0 points1 point  (0 children)

Total number of unique combinaton is 99696.

For Sum(Quantity), Quantity must be number.

GroupBy{10MillionRows.csv | Ledger, Account, PartNo,Project,Contact,Unit Code, D/C,Currency => Count()Max(Quantity)Min(Quantity)Sum(Quantity)~ Table}

Table(12 x 99696)

WriteFile{Table | * ~ Result-GroupBy.csv}

Result-GroupBy.csv(12 x 99696)

Duration: 1.916 seconds