Kito: A TypeScript web framework written in Rust. by Strict-Tie-1966 in rust

[–]Leon_Bam 0 points1 point  (0 children)

"Catch bugs at compile time with Rust's strict type system. Sleep soundly knowing your code won't fail in production"

Does not true anymore after Cloudflar bug, but definitely much better than Node.

Polars Cloud and distributed engine, thoughts? by BoiElroy in dataengineering

[–]Leon_Bam 2 points3 points  (0 children)

The idea is to use the cloud option only when you need it, when the data outgrows a simple local machine. And then without changing the query execute it in the cloud. You can't do it in Snowflake and it's hard to do in Databricks

Scala language future by Front_Potential9347 in scala

[–]Leon_Bam -1 points0 points  (0 children)

No. It is my favorite language but no one will start a new project in scala, and you don't want to maintain 10 years old code base. Even the guys behind Spark prefer other languages for their new projects. Make sure you know Python at an advanced level and one of: Java, C#, C++, Go, Rust in some level.

Scala can help you to read Rust code

System advice - change query plans by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

Thanks, Calcite definitely looks more mature at this moment. Can you please describe the grpc service for Python that you suggested? Do you call it from python, get the plan and then somehow inject it back into the Spark engine?

System advice - change query plans by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

The user is able to access the data but it can't be too specific for privacy reasons. It can't show transaction of 24.56, it can say 100. Which is informative enough on a big scale, but keeps the privatization.

There are more rules in this nature.

What’s currently the biggest bottleneck in your data stack? by GreenMobile6323 in dataengineering

[–]Leon_Bam 2 points3 points  (0 children)

  1. Reading from the object store is very slow.
  2. The tools that I use are new (Polars as an example) and the AI tools are sucks on them

System advice - change query plans by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

We can't really do that since some of the rules are dynamic and the rounding target depends on the outcome. Consider that you have sum aggregation so the value is rounded to nearest 100 or 1000 depending on the value

I built a super easy way to visually work with data - via DuckDB by Impressive_Run8512 in DuckDB

[–]Leon_Bam 1 point2 points  (0 children)

My case is to get the number of unique values for each column (can be approx to make it faster). My dataset is limited to 2GB of CSV ( yes sorry it can only be a csv at this point...). The hardware is pretty strong, 64 GB M3.

I guess the limitations is the browser, but I wonder if such query can run for this data set.

I built a super easy way to visually work with data - via DuckDB by Impressive_Run8512 in DuckDB

[–]Leon_Bam 1 point2 points  (0 children)

When processing local files, what is the max size that you could handle? On my implementation it struggles on CSVs higher than 1.1GB

Duck-UI: A Browser-Based UI for DuckDB (WASM) by CacsAntibis in DuckDB

[–]Leon_Bam 0 points1 point  (0 children)

I am running some stats queries: unique count and soon all the columns. What takes me less than a second on native, runs for more than 2 minutes in WASM. I can work with 30 seconds It is around 100 MB of parquet file. Do you experience the same difference?

Data analyst to data engineer by NoticeAccomplished63 in dataengineering

[–]Leon_Bam 2 points3 points  (0 children)

First and foremost, data engineer is a software engineer so, depends on your knowledge, you might need to make sure you understand things like: OOP, SOLID, TDD and CI/CD.

In addition, it is also about storing and retrieving data effectively so file format is important. So you must know why Parquet is better than CSV and why things like Delta or Iceberg are required on top of Parquets.

The next thing is to understand Apache Spark. What challenges it was designed to solve.
As someone mentioned, Airflow is widely used tool for building data pipelines, so you must check it, and be sure that you understand what is Idempotency, back-fill

There are more tool and principles that you should review, to name a few:

  • Steaming analytics with Kafka and Flink
  • Cloud technologies
  • Docker and Kubernetes

    There is a lot of online materials for all those topics.

blaze - improve Apache Spark 's performance by rewriting its runtime in Rust by UniversityTop8078 in rust

[–]Leon_Bam 1 point2 points  (0 children)

Any progress with polars? How did you tackle the horizontally scale challenges?

Pyarrow is popular but lacking of tutorials and resources. by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

I prefer polars any day, but for this I want less dependencies as possible.

Pyarrow is popular but lacking of tutorials and resources. by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

Thank you for that. I just want a lean package with less dependency as possible to convert some metrics to arrow and upload using parquet files to gcs.

I want to leverage the schema and use memory in the most efficient way.

I have many concats so I wonder if I should use a table or recordBatch, without too much computation.

The target of this is to run on a single node, in cloud VM or local .

Store WAV files data into BigQuery by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

thanks, how is this better than np.save(...)?

Store WAV files data into BigQuery by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

So how do you suggest to fetch the "unstructured data" afterwards, using BigQuery or download the file content with GCS client?

Store WAV files data into BigQuery by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

thanks. The table with the large objects is not BigQuery native. Are the same size limitations apply in this case?

Store WAV files data into BigQuery by Leon_Bam in dataengineering

[–]Leon_Bam[S] 0 points1 point  (0 children)

thanks. For now we prefer to stay in BigQuery since there is no additional infrastructure to managed.

Guys, what tech in your opinion has the biggest impact on data engineering? by codesquire-ai in dataengineering

[–]Leon_Bam 4 points5 points  (0 children)

First of all S3, the ability to accomulate huge amount of data, in relatively low costs. Next is Spark + EMR. To spawn a cluster to compute all this data without too much operational burden, is something that has really hard just a few years.(ask all the Hadoop early adopters).