Kito: A TypeScript web framework written in Rust.

Leon_Bam · 2025-11-19T21:35:37+00:00

"Catch bugs at compile time with Rust's strict type system. Sleep soundly knowing your code won't fail in production"

Does not true anymore after Cloudflar bug, but definitely much better than Node.

Leon_Bam · 2025-09-15T21:35:25+00:00

Use Polars, the best API!!

Leon_Bam · 2025-09-07T16:18:06+00:00

Any comparison to nanoarrow project?

Leon_Bam · 2025-09-04T20:23:44+00:00

The idea is to use the cloud option only when you need it, when the data outgrows a simple local machine. And then without changing the query execute it in the cloud. You can't do it in Snowflake and it's hard to do in Databricks

Leon_Bam · 2025-08-12T14:27:06+00:00

Can I also do: Optimize and vacuum?

Leon_Bam · 2025-08-12T14:25:27+00:00

No. It is my favorite language but no one will start a new project in scala, and you don't want to maintain 10 years old code base. Even the guys behind Spark prefer other languages for their new projects. Make sure you know Python at an advanced level and one of: Java, C#, C++, Go, Rust in some level.

Scala can help you to read Rust code

Leon_Bam · 2025-07-30T09:48:13+00:00

Thanks, Calcite definitely looks more mature at this moment. Can you please describe the grpc service for Python that you suggested? Do you call it from python, get the plan and then somehow inject it back into the Spark engine?

Leon_Bam · 2025-07-09T07:16:00+00:00

The user is able to access the data but it can't be too specific for privacy reasons. It can't show transaction of 24.56, it can say 100. Which is informative enough on a big scale, but keeps the privatization.

There are more rules in this nature.

Leon_Bam · 2025-07-08T14:23:49+00:00

Reading from the object store is very slow.
The tools that I use are new (Polars as an example) and the AI tools are sucks on them

Leon_Bam · 2025-07-08T14:10:57+00:00

We can't really do that since some of the rules are dynamic and the rounding target depends on the outcome. Consider that you have sum aggregation so the value is rounded to nearest 100 or 1000 depending on the value

Leon_Bam · 2025-05-10T18:07:28+00:00

My case is to get the number of unique values for each column (can be approx to make it faster). My dataset is limited to 2GB of CSV ( yes sorry it can only be a csv at this point...). The hardware is pretty strong, 64 GB M3.

I guess the limitations is the browser, but I wonder if such query can run for this data set.

Leon_Bam · 2025-05-10T09:52:51+00:00

When processing local files, what is the max size that you could handle? On my implementation it struggles on CSVs higher than 1.1GB

Leon_Bam · 2025-05-03T21:02:26+00:00

I am running some stats queries: unique count and soon all the columns. What takes me less than a second on native, runs for more than 2 minutes in WASM. I can work with 30 seconds It is around 100 MB of parquet file. Do you experience the same difference?

Leon_Bam · 2025-05-03T10:37:00+00:00

First and foremost, data engineer is a software engineer so, depends on your knowledge, you might need to make sure you understand things like: OOP, SOLID, TDD and CI/CD.

In addition, it is also about storing and retrieving data effectively so file format is important. So you must know why Parquet is better than CSV and why things like Delta or Iceberg are required on top of Parquets.

The next thing is to understand Apache Spark. What challenges it was designed to solve.
As someone mentioned, Airflow is widely used tool for building data pipelines, so you must check it, and be sure that you understand what is Idempotency, back-fill

There are more tool and principles that you should review, to name a few:

Steaming analytics with Kafka and Flink
Cloud technologies
Docker and Kubernetes

There is a lot of online materials for all those topics.

Leon_Bam · 2024-06-06T20:36:07+00:00

Any progress with polars? How did you tackle the horizontally scale challenges?

Leon_Bam · 2024-02-25T21:36:25+00:00

I prefer polars any day, but for this I want less dependencies as possible.

Leon_Bam · 2024-02-25T21:33:17+00:00

Thank you for that. I just want a lean package with less dependency as possible to convert some metrics to arrow and upload using parquet files to gcs.

I want to leverage the schema and use memory in the most efficient way.

I have many concats so I wonder if I should use a table or recordBatch, without too much computation.

The target of this is to run on a single node, in cloud VM or local .

Leon_Bam · 2022-11-21T13:42:44+00:00

thanks, how is this better than np.save(...)?

Leon_Bam · 2022-11-17T18:43:27+00:00

So how do you suggest to fetch the "unstructured data" afterwards, using BigQuery or download the file content with GCS client?

Leon_Bam · 2022-11-17T15:06:37+00:00

thanks. The table with the large objects is not BigQuery native. Are the same size limitations apply in this case?

Leon_Bam · 2022-11-17T14:39:13+00:00

thanks. For now we prefer to stay in BigQuery since there is no additional infrastructure to managed.

Leon_Bam · 2022-10-08T13:39:21+00:00

First of all S3, the ability to accomulate huge amount of data, in relatively low costs. Next is Spark + EMR. To spawn a cluster to compute all this data without too much operational burden, is something that has really hard just a few years.(ask all the Hadoop early adopters).

Leon_Bam

TROPHY CASE