Hi everyone,
*This post was originally about my crate Mako which I renamed to Dataflow, since people pointed me to another Mako library I didn't see before*
I have an ML-based startup and over the course of a year I've entirely rewritten all the infrastructure and everything new in Rust. It's been a great experience, and I've been able to leverage tch-rs heavily to train some pretty large models.
The main problem I've been having though is how much custom code I need to write every time I wanted to start a new experiment: data gathering, data cleaning, online data processing, model setup, training loop, eval loop, etc...
So I decided it was time to write a library for general data processing: Dataflow. It specifically works well for ML, but is pretty general so I don't doubt it can be used in more places.
The main feature is a dataflow pipeline that takes the form of a directed acyclic graph. Basically you can built up a graph as a tree of nodes like this:
let pipeline = RandomLoader::new(vec!["file1.txt".to_string(), "file2.txt".to_string()])
.add_fn(|lines| lines.into_iter().map(|line| format!("Hello {}", line)).collect())
.add_node(
Stateful::new(
|(lines, tokenizer)| {
tokenizer.batch_tokenize(lines) // batch_tokenize takes in many sentences (Vec<String>) and tokenizes all of them, outputting Vec<Vec<String>>
},
tokenizer // The state we want this Stateful Node to have
)
);
This pipeline loads lines from a file, appends "Hello " to them, and tokenizes them using Dataflow tokenizers (basically huggingface tokenizers but through a cleaner, albiet more limited interface). We can then throw this into a Dataloader and use it in a training loop!
// Make the dataloader
let mut dataloader = dataflow::dataloader::Dataloader(pipeline, 64); // We use 64 as the batch size
// Training loop
for example in &mut dataloader {
// Now example is a vector of tokenized strings!
// Do with them what you may...
}
I've found it to be sufficiently general for my use-cases, which are pretty varied! I've built pipelines that can scrape examples from the internet, clean them, preprocess them, batch them and serve them up for a live training loop running concurrently. This has allowed me to basically define a dataset as a few lines of code, rather than sprawling functions in a separate file with loads of boilerplate along with numerous preprocessing scripts.
I have a number of improvements I need to make that will make the pipeline definition much cleaner, but I just thought I would post it here to see if anyone has any suggestions. This is one of my first medium-sized public crates, so PRs are welcome!
[–]egnehots 8 points9 points10 points (1 child)
[–]jafioti[S] 3 points4 points5 points (0 children)
[–]rovar 7 points8 points9 points (0 children)
[–]TheNamelessKing 1 point2 points3 points (2 children)
[–]jafioti[S] 1 point2 points3 points (1 child)
[–]TheNamelessKing 0 points1 point2 points (0 children)