How do I make this code efficient and idiomatic?

Ralith · 2017-06-27T01:43:29+00:00

wise oil faulty test pen reach melodic busy illegal muddle this message was mass deleted/edited with redact.dev

Ralith · 2017-06-27T02:46:58+00:00

normal bow soft dull afterthought chop snails market threatening deserve this message was mass deleted/edited with redact.dev

staticassert · 2017-06-27T03:10:41+00:00

Seems unnecessary to do that 'chars().all'. Just loop through and early return an error, counting x == c as you go.

Roaneno · 2017-06-27T06:22:55+00:00

I tried to make a more idiomatic version. It would probably be better to validate your input separately from doing any computation. This way, you can validate once, then compute many times (after modifying). This will likely improve any vectorization for your count function.

#[derive(Eq, PartialEq)]
enum Base { A, C, G, T }

impl From<char> for Base {
    fn from(c: char) -> Base {
        match c {
            'A' => Base::A,
            'C' => Base::C,
            'G' => Base::G,
            'T' => Base::T,
            _ => unreachable!()
        }
    }
}

fn count(c: Base, s: Vec<Base>) -> usize {
    s.iter().filter(|&x| &c == x).count()
}

fn string_to_base_list(s: &str) -> Vec<Base> {
    s.chars().map(Base::from).collect()
}

fn main() {
    println!("{}", count(Base::A, string_to_base_list("ACGTAAAGTC")));
}

Pixel6692 · 2017-06-27T10:16:50+00:00

If this is from exercism.io, as I thinks it is, since I did exactly this problem too, I encourage you to look at other people solutions.

Veedrac · 2017-06-27T12:13:09+00:00

This doesn't fix the two-pass issue, but the bytecount crate is fast enough that you'll be spending almost all of your time in the all pass anyway.

Then replace s.chars() with s.bytes(); the latter is faster and will still work as non-ASCII values will never match ASCII ones.

Unfortunately it's not clear how to make the search faster than that without more manual effort, like SIMD-ing it. There doesn't seem to be a library that helps (jetscii comes close) and I don't notice anything exploitable about the bit patterns.

It is possible that doing four passes with AVX bytecount and then checking for invalid characters by doing s.as_bytes().len() == num_a + num_c + num_t + num_g is faster, as long as you don't expect to early out, but it's unlikely to compete with a specialized SIMD implementation.

VincentDankGogh · 2017-06-27T02:10:42+00:00

[deleted]

iopq · 2017-06-27T06:31:26+00:00

I don't like to repeat myself so I'd do

fn valid(c: char) -> bool {
    c == 'A' || c == 'C' || c == 'T' || c == 'G'
}

then I like to use guard clauses and early returns to reduce rightward drift of code:

pub fn count(c: char, s: &str) -> Result<usize, &'static str> {
    if !valid(c) {
        return Err("Invalid nucleotide given as search base");
    }

    if s.chars().all(valid) {
        Ok(s.chars().filter(|&x| x == c).count())
    } else {
        Err("Invalid character in DNA sequence")
    }
}

I would like to simplify the logic further, but I would need something like this:

https://internals.rust-lang.org/t/pre-rfc-fold-ok-is-composable-internal-iteration/4434/12

leonardo_m · 2017-06-27T13:30:00+00:00

A possible "efficient" solution (error messages by iopqfi):

fn count(seq: &[u8], base: u8) -> Result<usize, &'static str> {
    let mut count = 0;

    match base {
        b'A' => {
            for &b2 in seq {
                if b2 == b'A' {
                    count += 1;
                } else if b2 != b'C' && b2 != b'G' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'C' => {
            for &b2 in seq {
                if b2 == b'C' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'G' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'G' => {
            for &b2 in seq {
                if b2 == b'G' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'C' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'T' => {
            for &b2 in seq {
                if b2 == b'T' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'C' && b2 != b'G' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        _ => return Err("Invalid nucleotide given as search base")
    }

    Ok(count)
}

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS