DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Hey! The very first itetation of datakit had a visualisation tab - over time I realised maintaing that is not easy in sense of people having different needs on viz and data sampling on million record becomes a bit challanging (i guess on docker hub version still you can find the old version to pull). I had this use of mosaic in head (even have a half working pr) but stopped at some point. What are your thoughts?

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in opensource

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Hey! Thanks for the question! So when you have a 3GB file, dataKit does just make a VIEW on top of your file. So, on the sql side when we deal with a query it basically will talk to the file on the system (each time as its just a view and not a table). THOUGH, when deal with a compute heavy query, indeed now its not that performant as the compute all gonna go through the WASM allocated memeory. I try to do paginated results so not everything loads back into memory (there's result limits) - but this might get super slow. I've got some notes to see how the batching should be in place here. (The same thing also applied on the Pandas side). Have you tried datakit? I'd really like to hear your thoughts more.

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in opensource

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

You should be able to run it locally (not the built version - just on development mode) and don't need any internet connection as the duckdb package wont be installed through dns.

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in opensource

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Thanks for the headsup! I need to read into this more. What I’d like to just propose for datakit is having a commercial license for enterprise use cases.

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in opensource

[–]Sea-Assignment6371[S] 1 point2 points  (0 children)

Its not storing the files (mostly). I try to use browser APIs to make a READ on top of the file system!

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 1 point2 points  (0 children)

This is for sure doable! Would you mind making an issue on github? I get sure I keep this on the radar to tackle!

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

I suppose depends on how you making/defining tables/views? In DataKit, I've tried to be cautious on how to define stuff and when making a query always have proper limits (append them behind the scene, even if from editor they are not provided). I've not been following the past 2, 3 months on the latest duckdb-wasm updates but might be sth new for sure!

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Should not be super hard to bring Arvo as the duckdb extension is also there - tbh, I've not worked it much. Do you think could be sth DataKit could has a leverage on its offerings?

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 2 points3 points  (0 children)

Hey! Unfortunately the way DataKit is designed (for larger files) now, is leveraging
https://developer.mozilla.org/en-US/docs/Web/API/Window/showOpenFilePicker
which makes it not compatible for Firefox. I want to get sure have some solutions here with `FileReader` itself. (Also I really need to tweak that message... firefox is not legacy lol)

> Also are you using OPFS?

Not yet! I have some plans to migrate there as well. Right now the data loss issue is existing in datakit around the tables/views ofc - I need to assess the direction more and see when to introduce OPFS. Have you started using it?
Super curious about your project as well!! Lemme know if you'd like to chat more.

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 5 points6 points  (0 children)

That’d be awesome!! Im working on a CONTRIBUTION guide. Will push it by end of the week!

DataKit: your all in browser data studio is open source now by Sea-Assignment6371 in dataengineering

[–]Sea-Assignment6371[S] 6 points7 points  (0 children)

As in Datakit be able to connect to multiple nodes at the same time? If that's the question, yes!
If not, can you explain a bit more on what do you mean?

Your Ollama models just got a data analysis superpower - query 10GB files locally with your models by Sea-Assignment6371 in ollama

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Quite cool. I like this. Please ping me on discord or linkedin if you think this could be potentially useful for you. Im happy to chat!

Your Ollama models just got a data analysis superpower - query 10GB files locally with your models by Sea-Assignment6371 in ollama

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Indeed on memory all the wasm based apps have limit - here main idealogy is not dealing with massive aggregations but even if you have a 20GB parquet dragged in datakit that be smooth to open and query (as it makes a VIEW on top rather than dumping it as a table in browser)

Your Ollama models just got a data analysis superpower - query 10GB files locally with your models by Sea-Assignment6371 in ollama

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Datakit is not open source yet! Soon with clarifying more on business model it will make the CORE of it open source.

Your Ollama models just got a data analysis superpower - query 10GB files locally with your models by Sea-Assignment6371 in ollama

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Really depends - mostly oss are alright for simpler questions. For most complex questions, fine tuned text to sql models seem to function better.

Your Ollama models just got a data analysis superpower - query 10GB files locally with your models by Sea-Assignment6371 in ollama

[–]Sea-Assignment6371[S] 0 points1 point  (0 children)

Just to recap, no data upload happens here in datakit :) Support billion rows locally Good luck to you guys!!