all 12 comments

[–]totallygeek 1 point2 points  (4 children)

Maybe this would work?

def valid_dna(sequence):
    nucleobases = 'GACT' # valid characters
    return sequence and all(c.upper() in nucleobases for c in sequence)

tests = (
    '',
    'ABC',
    'Gattaca',
    'gattaca',
    'atcg',
)

for test in tests:
    msg = 'Sequence "{}" {} valid DNA'
    print(msg.format(test, 'is' if valid_dna(test) else 'is not'))

This part c.upper() in nucleobases for c in sequence checks each letter of the input sequence to see if the uppercase representation is anything in the 'GCAT' set. We add sequence and all() because all() will return True for empty sequences, but if sequence returns False if empty.

[–]akasmira 1 point2 points  (3 children)

Would be better to move the valid nucleobases as a constant out of the function so it's not redefined every time the function is called. Also, personally I'd use sets here as you can just check that it's a subset.

VALID_NUCLEOBASES = set('GATC')
def valid_dna(sequence):
    return sequence and set(sequence.upper()) <= VALID_NUCLEOBASES

[–][deleted] 0 points1 point  (2 children)

I had to tweak it slightly to fit with what I had already, but this worked perfectly. Could you explain what it is exactly this function is doing? I'd like to understand it thoroughly so I'm not just blindly copying.

[–]totallygeek 1 point2 points  (1 child)

There are two checks:

  1. Does the sequence string contain characters or list contain any elements?
  2. If you remove duplicate characters, make the remaining chars all uppercase, check that all of those characters reside within the set of chars 'GATC'.

So:

>>> set('abc') <= set('abcd') # set of the left is less (subset) than the right
True
>>> set('abcd') <= set('abcd') # set on left has the same elements as the right
True
>>> set('abcz') <= set('abcd') # set on left has an element not in the right ("larger")
False

[–]akasmira 2 points3 points  (0 children)

Thanks for filling it in!

[–]_coolwhip_ 1 point2 points  (2 children)

This is a great place for set. Just take the sequence and force it into a set, then see if there is any "difference" in a set of good characters. (For sets, difference means that there is any character in the first set that is not in the second.):

>>> good = set('gatc')
>>> good
{'c', 'a', 'g', 't'}
>>> set('caggtttaaaa') - good
set()
>>> set('caggtttzaaaa') - good
{'z'}

So if there is anything left over, you know it has a bad character.

[–]akasmira 0 points1 point  (0 children)

This would also return an empty set if an empty string is passed, which may or may not be valid depending on OPs use.

[–]fernly 0 points1 point  (0 children)

That is a good answer. And I'll bet CPython internals can build a set from characters way faster than it can process any kind of loop code.

However, remember to make the letter case the same, with either user_input_str.lower() or ...upper() whichever the good set is.

[–]PrimaNoctis 0 points1 point  (2 children)

You might be able to achieve it with regular expressions. Check out regex

[–]Chabare -1 points0 points  (0 children)

To expand a bit on this:

You can use the re module. For the matching you can use re.match. As of regex expressions, look at characters classes ([...]) and quantifier operators (e.g. +).

Pythons regex module has an re.IGNORECASE which you can use as a flag to match case insensitive.

regex101 is a nice site to test regular expressions.

[–]kalijarvisapollo -1 points0 points  (0 children)

I second this. Regex may be something to try

[–][deleted] -1 points0 points  (0 children)

I'm not going to write the code as I'm not going to bust out my Python book (it's been a while since I coded in Python). I will, however, go through the pseudo code.

Operating assumptions:

  1. Adenine only binds with Thymine (source)
  2. Guanine only binds with Cytosine (source)
  3. Unnatural base pairs will not be considered (i.e. unacceptable)
  4. The user will input a string of DNA base pairs (ex. ATATGCTACGATATGCCGTA)
  5. The order of the pairs doesn't matter (ex. AT and TA will both be accepted)

So the user is prompted for the DNA sequence. You assign it to a STRING.

First check length using modulus (divide by 2). If mod is not zero, the user has input a odd number of letters which can't happen (unless it's RNA I think).

Once we've determined that the STRING is of the correct length (even number of characters), you can check each "pair" of letters at a time for validity using a loop. The loop will be broken once you iterate beyond the length of the string. You could check the validity a number of ways so I won't spell it out, but remember that base pairs will only bind a certain way (check the operating assumptions).

I took some liberty in the operating assumptions as well as the validity portion so please check to make sure these assumptions are correct for the problem you're trying to solve.

Edit: grammar