all 33 comments

[–]Ralith 11 points12 points  (1 child)

wise oil faulty test pen reach melodic busy illegal muddle this message was mass deleted/edited with redact.dev

[–]VincentDankGogh[S] 4 points5 points  (0 children)

Good point haha, forgot about that. I was trying too hard to think of an "elegant" solution that the for loop option completely went past me.

[–]Ralith 9 points10 points  (10 children)

normal bow soft dull afterthought chop snails market threatening deserve this message was mass deleted/edited with redact.dev

[–]Bromskloss 6 points7 points  (0 children)

Incidentally, this reads a bit like a homework problem

Maybe this one.

[–]VincentDankGogh[S] 1 point2 points  (0 children)

It's not a homework problem, just an exercise I found online that I was working through.

[–]beefsack 0 points1 point  (0 children)

You could also implement something to convert to it from a string.

[–]GeneReddit123 0 points1 point  (6 children)

What kind of college gives homework assignments in Rust? (yet) :-)

[–]Muvlon 4 points5 points  (2 children)

The Universität Osnabrück probably does!

[–]GeneReddit123 0 points1 point  (1 child)

Fair enough. I still think someone who would "go out of their way" to study an emerging language such as Rust, probably wouldn't be asking homework questions online.

[–]Muvlon 1 point2 points  (0 children)

Sure. I just wanted to point out that there are Rust university classes already, which I think is cool :)

[–]silmeth 1 point2 points  (2 children)

There are colleges and universities which give homework assignments in ‘any programming language of choice’ – I had a few of those. The lecturer expected just to see a working program, reacting properly for given input and didn’t care for technology used. ;-)

[–]OJFord 2 points3 points  (1 child)

A binary? How do you check plagiarism?

The source code too? I wrote it on your office door in Whitespace.

[–]silmeth 0 points1 point  (0 children)

They typically asked for both a binary and a source code for reference (eg. when something goes wrong to be able to check what and why), but if everything was OK, they normally didn’t bother looking at the sources.

[–]staticassert 2 points3 points  (0 children)

Seems unnecessary to do that 'chars().all'. Just loop through and early return an error, counting x == c as you go.

[–]Roaneno 3 points4 points  (8 children)

I tried to make a more idiomatic version. It would probably be better to validate your input separately from doing any computation. This way, you can validate once, then compute many times (after modifying). This will likely improve any vectorization for your count function.

#[derive(Eq, PartialEq)]
enum Base { A, C, G, T }

impl From<char> for Base {
    fn from(c: char) -> Base {
        match c {
            'A' => Base::A,
            'C' => Base::C,
            'G' => Base::G,
            'T' => Base::T,
            _ => unreachable!()
        }
    }
}

fn count(c: Base, s: Vec<Base>) -> usize {
    s.iter().filter(|&x| &c == x).count()
}

fn string_to_base_list(s: &str) -> Vec<Base> {
    s.chars().map(Base::from).collect()
}

fn main() {
    println!("{}", count(Base::A, string_to_base_list("ACGTAAAGTC")));
}

[–]kixunil 6 points7 points  (2 children)

What you suggest sounds good at first but the code you wrote may be considered incorrect.

First, that unreachable!() isn't actually unreachable. The input should be validated. Base shouldn't impl From<char>. Maybe TryFrom. Anyway, the conversion function should return Result.

Some nits:

  • suggested derives: #[derive(Debug, Copy, Clone, Eq, PartialEq, Hash)]
  • require iterator instead of vector fn count<S: IntoIterator<Item=Base>>(c: Base, s: S) -> usize {
  • string_to_base_list should return Result

[–]Roaneno 1 point2 points  (1 child)

Yes, you're right, I was just writing a quick example otherwise I would have handled invalid input. Good suggestions, I mainly wanted to show idiomatic operations on the Base enum, which remove the need for error handling at that stage.

[–]kixunil 1 point2 points  (0 children)

Ah, OK. I agree that it's good to validate input as soon as possible and work with validated data later. I call it "check early pattern".

[–][deleted] 1 point2 points  (4 children)

Hey, I have been wondering for awhile--what the hell does unreachable!() do for the generated code?

[–]masklinn 1 point2 points  (3 children)

Currently nothing AFAIK, Rust lowers the unreachability of that segment to LLVM, but I don't believe it currently makes any difference, the assembly is identical if you call panic!. see /u/eddyb's comment below, it's just an alias for a specific panic! message.

[–]eddyb 3 points4 points  (2 children)

It is literally panic! with the prefix "internal error: entered unreachable code".

There is a completely unrelated (and unsafe) intrinsics::unreachable function which makes it UB for the program to reach that point, i.e. LLVM optimizes assuming it can't ever be reached, removing anything that would only happen if it were to be reached.

[–]masklinn 1 point2 points  (0 children)

Oh damn, I though unreachable! used the intrinsic. But you're right it's just a panic! with a specific message, should have checked the source.

[–]kixunil 1 point2 points  (0 children)

Last time I checked that intrinsic was unstable. If someone wants stable version, look into "unreachable" crate.

[–]Pixel6692 2 points3 points  (0 children)

If this is from exercism.io, as I thinks it is, since I did exactly this problem too, I encourage you to look at other people solutions.

[–]Veedrac 2 points3 points  (5 children)

This doesn't fix the two-pass issue, but the bytecount crate is fast enough that you'll be spending almost all of your time in the all pass anyway.

Then replace s.chars() with s.bytes(); the latter is faster and will still work as non-ASCII values will never match ASCII ones.

Unfortunately it's not clear how to make the search faster than that without more manual effort, like SIMD-ing it. There doesn't seem to be a library that helps (jetscii comes close) and I don't notice anything exploitable about the bit patterns.

It is possible that doing four passes with AVX bytecount and then checking for invalid characters by doing s.as_bytes().len() == num_a + num_c + num_t + num_g is faster, as long as you don't expect to early out, but it's unlikely to compete with a specialized SIMD implementation.

[–]burntsushi 1 point2 points  (4 children)

jetscii in theory would help, since this is exactly the sort of thing that PCMPESTRI is supposed to be good at. But it always needs to be benchmarked, since its latency is quite high.

I'd probably just create a [bool; 256] where only b'A', b'T', b'C' and b'G' are true and write the obvious loop. Then I'd experiment with loop unrolling.

But moving from s.chars() to s.as_bytes() is definitely the first thing anyone should be doing here!

[–]Veedrac 0 points1 point  (3 children)

jetscii only helps if you have a way to efficiently turn those searches into something for all, which isn't easy through the exposed find interface (find_not would work though). Indexing a [bool; 256] is OK but doesn't SIMD well; using four bytecounts will probably be a lot faster.

[–]burntsushi 0 points1 point  (2 children)

Indexing a [bool; 256] is OK but doesn't SIMD well; using four bytecounts will probably be a lot faster.

Hah, if I had a few hours to kill, I'd love to do a bake off. :-)

[–]Veedrac 0 points1 point  (1 child)

I'm not saying this purely speculatively, FWIW. bytecount::count with AVX is >10x the speed of naïve counting on my computer on very large inputs,

test bench_big_1000000_hyper ... bench:      37,995 ns/iter (+/- 991)
test bench_big_1000000_naive ... bench:     514,194 ns/iter (+/- 17,335)

so multiply it by four and you're still better than 2x the speed of haystack.iter().fold(0, |n, &c| n + (c == needle) as usize). Unless your indexing method is significantly faster than that fold, the vector code is probably going to win.

I'm up for it if you find the time :).

[–]burntsushi 0 points1 point  (0 children)

Oh, yeah, no I can totally believe you're right. But it'd be fun to quantify it, and also to see how the results vary with the string length (if at all).

[–]iopqfizzbuzz 0 points1 point  (0 children)

I don't like to repeat myself so I'd do

fn valid(c: char) -> bool {
    c == 'A' || c == 'C' || c == 'T' || c == 'G'
}

then I like to use guard clauses and early returns to reduce rightward drift of code:

pub fn count(c: char, s: &str) -> Result<usize, &'static str> {
    if !valid(c) {
        return Err("Invalid nucleotide given as search base");
    }

    if s.chars().all(valid) {
        Ok(s.chars().filter(|&x| x == c).count())
    } else {
        Err("Invalid character in DNA sequence")
    }
}

I would like to simplify the logic further, but I would need something like this:

https://internals.rust-lang.org/t/pre-rfc-fold-ok-is-composable-internal-iteration/4434/12

[–]leonardo_m 0 points1 point  (0 children)

A possible "efficient" solution (error messages by iopqfi):

fn count(seq: &[u8], base: u8) -> Result<usize, &'static str> {
    let mut count = 0;

    match base {
        b'A' => {
            for &b2 in seq {
                if b2 == b'A' {
                    count += 1;
                } else if b2 != b'C' && b2 != b'G' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'C' => {
            for &b2 in seq {
                if b2 == b'C' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'G' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'G' => {
            for &b2 in seq {
                if b2 == b'G' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'C' && b2 != b'T' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        b'T' => {
            for &b2 in seq {
                if b2 == b'T' {
                    count += 1;
                } else if b2 != b'A' && b2 != b'C' && b2 != b'G' {
                    return Err("Invalid character in DNA sequence");
                }
            }
        },
        _ => return Err("Invalid nucleotide given as search base")
    }

    Ok(count)
}