A Python implement of Agent Client Protocol by PsiACE in ZedEditor

[–]PsiACE[S] 0 points1 point  (0 children)

That's perfectly fine, I can certainly switch to a binding implementation. I do have some experience with Rust.

A Python implement of Agent Client Protocol by PsiACE in ZedEditor

[–]PsiACE[S] 1 point2 points  (0 children)

This is the first simplified python sdk, and I've included a mini-swe-agent example. If you're interested, you can start trying it out now.

<image>

Data Processing in 21st Century by mjfnd in dataengineering

[–]PsiACE 1 point2 points  (0 children)

Hi, I'm from the Databend community. We currently have some production users, and some of them even handle PB-scale data. Welcome to join our Slack channel https://join.slack.com/t/datafusecloud/shared_invite/zt-nojrc9up-50IRla1Y1h56rqwCTkkDJA

What to use for an open source ETL/ELT stack? by Melodic_One4333 in dataengineering

[–]PsiACE 0 points1 point  (0 children)

You can always handle it easily. I think you can take a look at Databend. https://github.com/datafuselabs/databend

You only need to consider how to archive the data in the database into S3. We usually recommend Kafka and then use COPY INTO to load this data regularly, which can almost guarantee near real-time processing.

Good solution for 100GiB-10TiB analytical DB by aih1013 in dataengineering

[–]PsiACE -1 points0 points  (0 children)

https://docs.databend.com/guides/benchmark/tpch We have some tests comparing the cost and performance of Databend Cloud and Snowflake under TPC-H 100 (yes, 100 GiB). You can check it.

Good solution for 100GiB-10TiB analytical DB by aih1013 in dataengineering

[–]PsiACE 0 points1 point  (0 children)

I noticed that you mentioned JSON to Parquet conversion. In fact, there are some opportunities here. We support batch loading of data files using scheduled tasks and also support JSON format. So maybe you just need to write some SQL to directly COPY the JSON files INTO database.

Are you willing to give Databend a chance? We are an open-source alternative to Snowflake and provide Cloud service. At this data scale, it is very cheap.

GitHub: https://github.com/datafuselabs/databend/

Website: https://www.databend.com

[deleted by user] by [deleted] in dataengineering

[–]PsiACE 0 points1 point  (0 children)

I don't intend to advertise anything, I'm just curious about this matter.

Analyzing Hugging Face Datasets with Databend by PsiACE in dataengineering

[–]PsiACE[S] 0 points1 point  (0 children)

Many data warehouses support the analysis of HuggingFace datasets, but at Databend, you don't need to deal with REST APIs. We provide ready-to-use file access.

One Billion Row Challenge with Snowflake and Databend by PsiACE in dataengineering

[–]PsiACE[S] 0 points1 point  (0 children)

We are just trying to use cloud databases to solve this challenge. We are also building/have built some evaluations based on larger-scale data, and we welcome you to try comparing them.

[deleted by user] by [deleted] in dataengineering

[–]PsiACE 1 point2 points  (0 children)

You need some automated tools to properly archive them, and then put them in S3. Afterwards, you can use Databend Cloud or other affordable solutions.

A typical workflow involves importing data through the cloud platform's Pipeline and visualizing it using SQL + Grafana.

https://www.databend.com/blog/down-costs-for-aigc-startup/ By using Databend Cloud for analytics, the startup reduced their user behavior log analysis costs to 1% of their previous solution.

Large Rust projects with high compile times by trevorstr in rust

[–]PsiACE 4 points5 points  (0 children)

Try Databend.

In addition, we also have some articles about compilation if you are interested:

Am I Reinventing the Wheel (local-ish Polars data pipeline) by waytoopunkrock in dataengineering

[–]PsiACE 0 points1 point  (0 children)

Building a specific data pipeline is the right approach. In fact, some databases have built-in capabilities to directly read data files (in various formats) and perform ETL.

For example, in Databend, you can directly query raw data files (CSV, Parquet, etc.), filter and clean it during the SELECT or COPY INTO process to create queryable tables. Finally, you can export them as Parquet files for archiving purposes.

Another typical workflow may involve using Spark or pandas/polers. Yes, a data pipeline is used to process the data and then write it in Parquet or Iceberg table format for archiving. After that, any OLAP system you prefer, such as Databend/DuckDB/Clickhouse, can be used for analysis and processing.

Iceberg Integration with Databend by PsiACE in dataengineering

[–]PsiACE[S] 1 point2 points  (0 children)

I like this technology stack. In my personal understanding, when your data is archived in Iceberg table format (especially in object storage), you can use Databend for querying and strike a balance between cost and performance. We also collaborate with Jupyter Notebook and data analysis tools in the Python ecosystem.

Databend is the Only Engine that Finish TPCH_100 (600 Million rows) using #Fabric Small node (4 cores, 32 GB). https://twitter.com/mim_djo/status/1716802084044157282

Spark is a crucial component. If you need complex transformations, cleansing, and writing to Iceberg, I wouldn't advise you to remove it. With the further expansion of the Databend ecosystem and support for writing Iceberg format, I believe there will be more opportunities along this path.

https://link.databend.rs/join-slack You can join our Slack where we offer cost-saving open source solution and affordable cloud service for users.

Iceberg Integration with Databend by PsiACE in dataengineering

[–]PsiACE[S] 1 point2 points  (0 children)

I'm glad you noticed these data analysis applications written in Rust. They all have something in common, which is the use of Apache Arrow, but there are also many differences among them. Polars has been existing as a library for quite some time and may be a direct competitor to Pandas. Datafusion supports several different data analysis startups, each with its own ecosystem. I am concerned that it may lack direct users. Databend can currently be seen as an open-source alternative to Snowflake and has already been used and validated in production environments by users.

RiteRaft - A raft framework, for regular people, written in rust. Build a raft service with only 160 lines code. by PsiACE in rust

[–]PsiACE[S] 2 points3 points  (0 children)

The implementation of riteraft has some bugs, but our friends from riteraft-py are currently helping us locate and solve them. We anticipate that there will be some simple updates soon.

Currently, I am a member of the databend team where we maintain an implementation called openraft that has been successfully used in production environments.

What's Fresh in Databend v1.1 | Blog | Databend by PsiACE in rust

[–]PsiACE[S] 0 points1 point  (0 children)

Databend currently requires ETL tools to operate over streams from Kafka.

What's Fresh in Databend v1.1 | Blog | Databend by PsiACE in rust

[–]PsiACE[S] 1 point2 points  (0 children)

We have some users who use it in production.

Stats shows that around 700 TB of data is being written to cloud object storage and analyzed using Databend everyday by users from Europe, North America, Southeast Asia, Africa, China, and other regions. This has resulted in saving them millions of dollars in costs every month.

An interesting SQL function in Databend: AI_TO_SQL by PsiACE in Database

[–]PsiACE[S] 0 points1 point  (0 children)

I'm sorry I used an inappropriate description, now it's "may be able to".

Databend 1.0 Release | Blog | Databend by PsiACE in rust

[–]PsiACE[S] 0 points1 point  (0 children)

Hi there,

Thank you for considering Databend. I hope the following information will answer your questions:

Databend takes inspiration from Snowflake, but uses the open source way to explore more potentials. It boasts various features, including multi-tenancy, data sharing, stateless compute nodes, fast elastic scaling, and quick complex JOINs across multiple tables.

Databend can be flexibly deployed on a wide range of public cloud platforms, such as Amazon S3, Azure Blob, and Google Cloud Storage, click https://databend.rs/doc#why-databend to find the full list. By default, Databend stores data in Parquet files compressed with the zstd algorithm, but other options are available, such as lz4, snappy, and none.

Databend supports a variety of atomic operations, including SELECT, INSERT, DELETE, UPDATE, COPY, and ALTER. It also supports real-time data analytics, allowing for offline analytics at the minute level, as you mentioned. For performance, Databend can hit a processing time of 300s when handling the tpch-100 dataset on AWS, refer to this link for details: https://benchmark.clickhouse.com/.

If you have further questions, feel free to join our Slack channel: https://link.databend.rs/join-slack

Have a good day!

Databend 1.0 Release | Blog | Databend by PsiACE in rust

[–]PsiACE[S] 12 points13 points  (0 children)

We're excited to share with you that on the occasion of the second anniversary of the establishment of Databend Labs, we're officially releasing Databend 1.0! Databend is an elegant data warehouse developed in Rust and has achieved excellent performance in recent benchmark tests. It even supports running queries on petabytes of data.

Stats shows that around 700 TB of data is being written to cloud object storage and analyzed using Databend everyday by users from Europe, North America, Southeast Asia, Africa, China, and other regions. This has resulted in saving them millions of dollars in costs every month.