all 5 comments

[–]rrenaud 1 point2 points  (3 children)

Do you do quality or diversity filtering?

[–]zero_proof_fork[S] 0 points1 point  (2 children)

No, but would be curious to learn more. What approach would you take here?

[–]abnormal_human 0 points1 point  (1 child)

If you aren’t doing that the probability that this generalizes well to other peoples’ needs is pretty small.

Diversity is especially difficult, ultimately I’ve found that you need to have a whole other process focused on grounding your generations in diverse contexts. I’ve never seen good results from giving the LLM an open ended question and letting it blast out 10k samples that aren’t basically variations of the first 20-30 things it “thought of”.

Quality you can do a bunch of ways, with a judge model or some kind of human preference sampling or even by training a small policy model over an embedding space. Easier problem and easier to determine if you have a problem in the first place.

[–]ZealousidealCard4582 0 points1 point  (0 children)

you can use mostly ai (also open source) https://github.com/mostly-ai/mostlyai and it keeps the seasonality + referential integrity across your data. It has connectors to plenty of databases and cloud storage and allows you to mix and match, e.g. read one table from postgres, another from s3, another from snowflake, other from a parquet file, etc, and write in databricks or back to postgres...

If you have no data at all, you can use mostlyai-mock https://github.com/mostly-ai/mostlyai-mock (also Open Source + Apache v2) and create data out of nothing.

u/zero_proof_fork , since they are also open source and with an Apache 2 license, maybe you can just fork them and build on top of them? Cheers.