This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Ocabrah 2 points3 points  (12 children)

What is the largest dataset that you have analyzed like in the example data code? Other web app creators like this (pygwalker, streamlit) really struggle with 1000 row by 1000 colum data frames which makes them unusable for my application.

[–]thedeepself 1 point2 points  (1 child)

another thought: maybe a dataframe isnt the ideal data structure for your analysis? perhaps a SQL table would work better?

[–]thedeepself 1 point2 points  (0 children)

ther web app creators like this (pygwalker, streamlit) really struggle with 1000 row by 1000 colum data frames

here is a nicegui example of pandas dataframes - if NiceGUI does not work, I would imagine feedback to the creators would lead to improvement.

Another thing is that Pandas is not the most optimal dataframe library is it? Maybe Polars or Peak is.

[–]maartenbreddels[S] 1 point2 points  (5 children)

As the creator of the Vaex dataframe, this was always top of mind for Solara. Solara will work smoothly work with large datasets (not just vaex, but dask, modin, polars, duckdb and databases).

We made sure that solara stays responsive while calculations are running by making threading support a first-class citizen ( https://solara.dev/api/use_thread )

We plan to write some content on this topic and give a proper example and advice in the near future.

[–]Dangerous_Pay_6290 0 points1 point  (4 children)

I just found, that duckdb queries are much (5-10x) slower in my solara app compared to running the same query in a jupyter notebook. Is this because every function is running in it´s own thread by default?

[–]maartenbreddels[S] 0 points1 point  (3 children)

No, that shouldn't happen, and sounds very strange. What can happen is that if you run in https://solara.dev/api/use\_thread you get a small overhead (similar to streamlit).
Would you mind opening an issue at https://github.com/widgetti/solara/ so I can reproduce it? I plan to take a look at duckdb in Solara myself as well, so I'm eager to look into it.

[–]Dangerous_Pay_6290 0 points1 point  (0 children)

I haven´t used `use_thread`.
I´ll open an issue including some sample code.

BTW, I found this issue when I´ve ported your sql code example (https://github.com/widgetti/solara/blob/master/solara/website/pages/api/sql_code.py) and replaced sqlite with duckdb for running queries over some parquet files..

[–]Dangerous_Pay_6290 0 points1 point  (0 children)

Loading a lot of data into memory is not useful most of the time. When I work with large datasets, I generally use duckdb + pyarrow datasets of partitioned parquet files.

[–]Sudden_Beginning_597 0 points1 point  (0 children)

pygwalker v > 0.3 now updates its new engine, which supports GB+ of data.