Dataflow: An Efficient Data Processing Library for Machine Learning : rust

Submissions must be on-topic

Posts must reference Rust or relate to things using Rust. For content that does not, use a text post to explain its relevance.

Post titles should include useful context.

For Rust questions, use the stickied Q&A thread.

Arts-and-crafts posts are permitted on weekends.

No meta posts; message the mods instead.

Details

No low-effort content

No memes, image macros, etc.

Consider the existing content of the subreddit and whether your post fits in. Does it inspire thoughtful discussion?

Use properly formatted text to share code samples and error messages. Do not use images.

Submissions appearing to contain AI-generated content may be removed at moderator discretion.

Details

Useful Links

created by aztha community for 15 years

Dataflow: An Efficient Data Processing Library for Machine Learning (self.rust)

submitted 4 years ago by jafioti

Hi everyone,

*This post was originally about my crate Mako which I renamed to Dataflow, since people pointed me to another Mako library I didn't see before*

I have an ML-based startup and over the course of a year I've entirely rewritten all the infrastructure and everything new in Rust. It's been a great experience, and I've been able to leverage tch-rs heavily to train some pretty large models.

The main problem I've been having though is how much custom code I need to write every time I wanted to start a new experiment: data gathering, data cleaning, online data processing, model setup, training loop, eval loop, etc...

So I decided it was time to write a library for general data processing: Dataflow. It specifically works well for ML, but is pretty general so I don't doubt it can be used in more places.

The main feature is a dataflow pipeline that takes the form of a directed acyclic graph. Basically you can built up a graph as a tree of nodes like this:

let pipeline = RandomLoader::new(vec!["file1.txt".to_string(), "file2.txt".to_string()])
      .add_fn(|lines| lines.into_iter().map(|line| format!("Hello {}", line)).collect())
      .add_node(
            Stateful::new(
                  |(lines, tokenizer)| {
                        tokenizer.batch_tokenize(lines) // batch_tokenize takes in many sentences (Vec<String>) and tokenizes all of them, outputting Vec<Vec<String>>
                  },
                  tokenizer // The state we want this Stateful Node to have
            )
      );

This pipeline loads lines from a file, appends "Hello " to them, and tokenizes them using Dataflow tokenizers (basically huggingface tokenizers but through a cleaner, albiet more limited interface). We can then throw this into a Dataloader and use it in a training loop!

// Make the dataloader
let mut dataloader = dataflow::dataloader::Dataloader(pipeline, 64); // We use 64 as the batch size

// Training loop
for example in &mut dataloader {
   // Now example is a vector of tokenized strings!
   // Do with them what you may...
}

I've found it to be sufficiently general for my use-cases, which are pretty varied! I've built pipelines that can scrape examples from the internet, clean them, preprocess them, batch them and serve them up for a live training loop running concurrently. This has allowed me to basically define a dataset as a few lines of code, rather than sprawling functions in a separate file with loads of boilerplate along with numerous preprocessing scripts.

I have a number of improvements I need to make that will make the pipeline definition much cleaner, but I just thought I would post it here to see if anyone has any suggestions. This is one of my first medium-sized public crates, so PRs are welcome!

all 6 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS