pythonrecursiondna-sequence

Valid characters in a String


I am given a string and have to return False if there is one or more invalid characters, otherwise True. The caveat is that I can only built-in functions and str operations (for example: in, +, indexing, len) and recursion. What I have so far is not working:

def is_valid_sequence(dna):
""" (str) -> bool

Return True if and only if the DNA sequence is valid
(A, T, C, and G)
:param dna: string sequence
:return: true or false
>>> is_valid_sequence('ATTAC')
True
>>> is_valid_sequence('FBSSS')
False
"""
valid_seq = 0

if len(dna) == 1 and dna in ['A', 'T', 'C', 'G']:
        valid_seq += 1
else:
    return is_valid_sequence(dna[1:])

Obviously, this code doesn't work because of recursion and adding 1 to the valid_seq variable just gets wiped after the next recursive iteration.


Solution

  • For DNA sequences of a small size (around a thousand characters), this is a practical implementation

    def is_valid_sequence (dna):
      # base case
      if len (dna) == 0:
        return True
      # check if first character is valid
      elif dna [0] not in ['A', 'T', 'C', 'G']:
        return False
      # otherwise, recurse on the rest of the characters
      else:
        return is_valid_sequence (dna [1:])
    
    print (is_valid_sequence ('AATGCGA'))  # True
    print (is_valid_sequence ('AATXGCGA')) # False
    

    A word of caution

    Be careful with recursion in python tho – a long dna string will cause a stack overflow. Attempting to validate even a sequence this "large" would fail

    GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA

    You can easily avoid this by implementing is_valid_sequence using a Clojure-style loop/recur mechanism for constant-space recursion

    def recur (*values):
      return (recur, values)
    
    def loop (f):
      acc = f ()
      while type (acc) is tuple and acc [0] is recur:
        acc = f (*acc [1])
      return acc
    
    def is_valid_sequence (dna):
      # stack-safe loop/recur implementation
      # initialize loop state variable `s` to value of `dna`
      return loop (lambda s = dna:
        # base case
        True if len (s) == 0 else
        # check if first character valid
        False if s [0] not in ['A', 'T', 'C', 'G'] else
        # else recurse on the rest of the characters
        recur (s [1:]))
    
    # does not overflow stack
    print (is_valid_sequence ('GATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACAGATTACA'))
    # True
    

    Continued improvements

    The loop/recur implementation exposes an additional opportunity to tune the performance of our function. Slicing the string as we have done with dna[0] and dna[1:] creates new strings in memory; this was only necessary because of the recursion API used in the first function we wrote

    The loop/recur interface allows us to use whatever data type is suitable for computing our output – in this case, a simple integer index will do. Lexical scoping takes care of the rest – dna is accessible within our lambda and does not need to do dna[1:] slicing which will save a lot of time/space for large inputs

    def is_valid_sequence (dna):
      # only create this array once
      valid_chars = ['A', 'T', 'C', 'G']
      # initialize loop state variable `i` with 0
      return loop (lambda i = 0:
        True if len (dna) == i else
        False if dna [i] not in valid_chars else
        recur (i + 1))
    

    Python and the Unruly Lambda

    Notice how we were forced to use a pure expression inside the lambda instead of the traditional if/elif/else statement syntax – this is not a problem for simple programs, but more complex programs might be difficult to express this way in python.

    This works identically to the program above but uses a plain old function instead of a lambda

    def is_valid_sequence (dna):
      valid_chars = ['A', 'T', 'C', 'G']
      # plain old function; replaces lambda
      def myloop (i = 0):
        if len (dna) == 0:
          return True
        elif dna [i] not in valid_chars:
          return False
        else:
          return recur (i + 1)
      # use plain old function instead of lambda
      return loop (myloop)