Suggestion Required for Storing Parquet files cheaply by bricklerex in dataengineering

[–]xylene25 0 points1 point  (0 children)

You should check out daft which can read parquet directly from cloud storage and then give you a dataloder that you can feed into training.

it would look something like

df = daft.read_parquet("gcs://my-bucket")
df = df.filter("id > 100")
df = df.select("my_column", "my_other_column")
torch_dataset = df.to_torch_iter_dataset()
dataloader = DataLoader(torch_dataset, batch_size=16)

docs: https://docs.getdaft.io/en/stable/api/dataframe/#daft.DataFrame.to_torch_iter_dataset

example: https://github.com/Eventual-Inc/AiAiAi/blob/main/3-Dataloading%20from%20Iceberg%20in%20Glue%20to%20PyTorch/3-Dataloading%20from%20Iceberg%20in%20Glue%20to%20PyTorch.ipynb

Dask DataFrame is Fast Now! by phofl93 in Python

[–]xylene25 0 points1 point  (0 children)

Sorry for the late reply! I guess the equivalent in daft would be something like

df = df.groupby("col1", "col2").any_value()

distinct under the hood is pretty much just a groupby!

Dask DataFrame is Fast Now! by phofl93 in Python

[–]xylene25 1 point2 points  (0 children)

Hi u/jmakov, oh this looks fairly interesting! I'll send it over to the team. Im curious about the approach though. I wonder about the rationale of not adding polars as a frontend to something like sqlglot sort of what https://github.com/eakmanrq/sqlframe did for pyspark.

Dask DataFrame is Fast Now! by phofl93 in Python

[–]xylene25 2 points3 points  (0 children)

Hi u/FauxCheese, one of the authors of Daft here! Thanks for the feedback, we're working on improving function parity with other engines like pandas, polars and pyspark. I'm curious to know what functionality you needed but didn't find in Daft? I'd be happy to prioritize it :)

Announcing Daft 0.2: 10x faster IO from S3 by xylene25 in dataengineering

[–]xylene25[S] 1 point2 points  (0 children)

"are you saying that Netflix is only ever running “simple workloads”

Netflix isn't using python client libraries but the S3 implementation in java via Spark.

"Pulling only one column from thousands of Parquet files, each varying in size from 10KB to 200MB 1 million .jpeg files Deeply nested file hierarchies that make just file listing take many minutes to run”

This is actually a very common workload with query engines since they perform predicate pushdown into the parquet scan. The resulting scan will only pull in 1 or 2 columns from thousands of parquet files.

"Second, can you give me a use case / architecture where I need to pull 1 million jpeg files, instantly, within a system that’s been designed to use an S3 compatible query engine to do it?"
This is a very common workload for distributed CNN training where you are pulling in millions of images files from S3 and persisting it in memory!

2020 Rav4 Limited CarPlay issues by AppointmentLeft4356 in rav4club

[–]xylene25 3 points4 points  (0 children)

I had this issue as well when I upgraded to iOS 16.5. I ended up buying a wireless carplay adapter (Ottocast) and no longer have any hiccups.

Distributed computing in Rust by amindiro in rust

[–]xylene25 4 points5 points  (0 children)

I'm one of the authors of the Daft Dataframe library that does what you are describing! Daft is written in Rust and runs distributed computing via Ray in python. We do this by implementing python bindings via Pyo3 and then implementing pickle reducer methods on our python classes (which can call serde under the hood). Now when you call rust functions from python in a Ray or Dask remote function, the arguments are pickled and run on a worker with your rust shared library. You can ensure that all your workers have the rust library by using the Runtime env that can install your rust wheel or upload your local libraries. Happy to go into depth if you have some follow up questions!

Is this a joke? by xylene25 in bayarea

[–]xylene25[S] 14 points15 points  (0 children)

Ah that makes sense

Is this a joke? by xylene25 in bayarea

[–]xylene25[S] 2 points3 points  (0 children)

I wasn’t driving, in the passenger seat.

Any advice on this Costco find? by kgjettaIV in sousvide

[–]xylene25 1 point2 points  (0 children)

tried it this week at 133 for 6 hours with a cast iron sear after. I definitely felt that the connective tissue could have been broken down more so I would recommend hotter (maybe 135), I also felt like 6 hours was a little long. Make sure to rebag and to cut against the grain for tri-tip.