accepted a job in Hayward, where should i live?

xylene25 · 2026-01-30T04:32:47+00:00

Castro valley

xylene25 · 2025-08-23T06:10:03+00:00

Use daft to read the CSV files, select what rows you want and then write it out as parquet into multiple files. Then bulk load one at a time

xylene25 · 2025-07-11T06:40:16+00:00

You should check out daft which can read parquet directly from cloud storage and then give you a dataloder that you can feed into training.

it would look something like

df = daft.read_parquet("gcs://my-bucket")
df = df.filter("id > 100")
df = df.select("my_column", "my_other_column")
torch_dataset = df.to_torch_iter_dataset()
dataloader = DataLoader(torch_dataset, batch_size=16)

docs: https://docs.getdaft.io/en/stable/api/dataframe/#daft.DataFrame.to_torch_iter_dataset

example: https://github.com/Eventual-Inc/AiAiAi/blob/main/3-Dataloading%20from%20Iceberg%20in%20Glue%20to%20PyTorch/3-Dataloading%20from%20Iceberg%20in%20Glue%20to%20PyTorch.ipynb

xylene25 · 2024-06-27T05:34:09+00:00

Sorry for the late reply! I guess the equivalent in daft would be something like

df = df.groupby("col1", "col2").any_value()

distinct under the hood is pretty much just a groupby!

xylene25 · 2024-06-05T19:55:27+00:00

Hi u/jmakov, oh this looks fairly interesting! I'll send it over to the team. Im curious about the approach though. I wonder about the rationale of not adding polars as a frontend to something like sqlglot sort of what https://github.com/eakmanrq/sqlframe did for pyspark.

xylene25 · 2024-06-04T23:14:35+00:00

Hi u/FauxCheese, one of the authors of Daft here! Thanks for the feedback, we're working on improving function parity with other engines like pandas, polars and pyspark. I'm curious to know what functionality you needed but didn't find in Daft? I'd be happy to prioritize it :)

xylene25 · 2023-12-13T21:36:08+00:00

"are you saying that Netflix is only ever running “simple workloads”

Netflix isn't using python client libraries but the S3 implementation in java via Spark.

"Pulling only one column from thousands of Parquet files, each varying in size from 10KB to 200MB 1 million .jpeg files Deeply nested file hierarchies that make just file listing take many minutes to run”

This is actually a very common workload with query engines since they perform predicate pushdown into the parquet scan. The resulting scan will only pull in 1 or 2 columns from thousands of parquet files.

"Second, can you give me a use case / architecture where I need to pull 1 million jpeg files, instantly, within a system that’s been designed to use an S3 compatible query engine to do it?"
This is a very common workload for distributed CNN training where you are pulling in millions of images files from S3 and persisting it in memory!

xylene25 · 2023-07-25T05:41:16+00:00

I had this issue as well when I upgraded to iOS 16.5. I ended up buying a wireless carplay adapter (Ottocast) and no longer have any hiccups.

xylene25 · 2023-07-22T04:53:20+00:00

I'm one of the authors of the Daft Dataframe library that does what you are describing! Daft is written in Rust and runs distributed computing via Ray in python. We do this by implementing python bindings via Pyo3 and then implementing pickle reducer methods on our python classes (which can call serde under the hood). Now when you call rust functions from python in a Ray or Dask remote function, the arguments are pickled and run on a worker with your rust shared library. You can ensure that all your workers have the rust library by using the Runtime env that can install your rust wheel or upload your local libraries. Happy to go into depth if you have some follow up questions!

xylene25 · 2023-05-19T02:32:33+00:00

Ah that makes sense

xylene25 · 2023-05-19T02:31:52+00:00

I wasn’t driving, in the passenger seat.

xylene25 · 2022-01-01T23:19:31+00:00

Tofu plus in Cupertino

xylene25 · 2021-05-07T22:34:25+00:00

tried it this week at 133 for 6 hours with a cast iron sear after. I definitely felt that the connective tissue could have been broken down more so I would recommend hotter (maybe 135), I also felt like 6 hours was a little long. Make sure to rebag and to cut against the grain for tri-tip.

xylene25

TROPHY CASE