pythonpyteststructure

Python project file structure with relative imports, or how to properly structure a project like described


I have been trying to solve bioinformatics problems from the Rosalind.info website, and I am now facing some trouble when I want to perform some simple testing.

My project is structured in the following way:

Rosalind-problems/
├─ bioinformatics_stronghold/
│  ├─ data/
│  ├─ modules/
│  │  ├─ __init__.py
│  │  ├─ read_fasta.py
│  ├─ CONS.py
│  ├─ IEV.py
├─ tests/
│  ├─ __init__.py
│  ├─ test_CONS.py
│  ├─ test_IEV.py

The goal here is to be able to test all the individual files (CONS.py, IEV.py etc.) in the bioinformatics stronghold folder. The problem that I have encountered however is:

See all the affected file below:

test_IEV.py

import pytest
from bioinformatics_stronghold.IEV import calculate_offspring

def test_calculate_offspring():
    assert calculate_offspring([1, 0, 0, 1, 0, 1]) == 3.5
    assert calculate_offspring([1, 1, 1, 1, 1, 1]) == 8.5

IEV.py

def calculate_offspring(input_list:list[int]) -> float:
    """This function will take an input list of non-negative integers no larger than 20,000. The function will then calculate the expected offspring showing the dominant phenotype.

    Args:
        input_list (list): Input a list of integers representing the number of couples

    Returns:
        float: The expected number of offspring
    """
    input_list = input_list
    expected_dominant_offspring = 0
    
    # For all cases, it is assumed that all couples will have exactly 2 calculate_offspring
    for index, count in enumerate(input_list):
        print("Index:", index, "   ", "Num couples:", count)
            
        # Case AA-AA, all offspring will be dominant phenotye
        if index == 0:
            expected_dominant_offspring += count * 2 * 1 
            
        # Case AA-Aa, all offspring will be dominant phenotype
        elif index == 1:
            expected_dominant_offspring += count * 2 * 1
            
        # Case AA-aa, all offspring will be dominant phenotype
        elif index == 2:
            expected_dominant_offspring += count * 2 * 1

        # Case Aa-Aa, 3 out of 4 offspring will be dominant genotype
        elif index == 3:
            expected_dominant_offspring += count * (2 * (3/4))

        # Case Aa-aa, 1 out of 4 offspring will be dominant phenotype
        elif index == 4:
            expected_dominant_offspring += count * (2 * (2/4))

        # Case aa-aa, no offspring will be dominant phenotype
        elif index == 5:
            expected_dominant_offspring += count * 2 * 0

    print(expected_dominant_offspring)
    
    return expected_dominant_offspring

These two works just fine.

Now to the problematic files...

test_CONS.py

import pytest
from bioinformatics_stronghold.CONS import find_consensus_sequence


def test_find_consensus_sequence():
    assert find_consensus_sequence("tests\\data\\CONS_sample_data.fasta") == [[5, 1, 0, 0, 5, 5, 0, 0], [0, 0, 1, 4, 2, 0, 6, 1], [1, 1, 6, 3, 0, 1, 0, 0], [1, 5, 0, 0, 0, 1, 1, 6]], ['A', 'T', 'G', 'C', 'A', 'A', 'C', 'T']

Adding the line from bioinformatics_stronghold.modules.read_fasta import read_fasta_file just gives me an import error ModuleNotFound. Adding . or .. in from results in ImportError: attempted relative import with no known parent package.

CONS.py

from modules.read_fasta import read_fasta_file

def find_consensus_sequence(fasta_location):
    """ 
    This function will read a given fasta file and extract all sequences using the read_fasta.py module.
    The function will then create a profile matrix as well as a consensus sequence, both as lists.

    Args:
        fasta_location (str): The location of the fasta file as a string.

    Returns:
        profile_matrix (list[lists]): The profile matrix of all given sequences. 
        consensus_sequence (list): The consensus sequences of all given sequences.
    """
    
    fasta_content = read_fasta_file(fasta_location, debug=False)
    
    # Create a matrix with all sequences
    sequence_matrix = []
    for item in fasta_content:
        sequence_matrix.append(list(item.sequence))
    # print(sequence_matrix)
    
    # Create the empty profile matrix
    # [A, C, G, T]
    profile_matrix = [[0]*len(sequence_matrix[0]), [0]*len(sequence_matrix[0]), [0]*len(sequence_matrix[0]), [0]*len(sequence_matrix[0])]
    
    # print(profile_matrix)
    
    # Add to the nucleotide count depending on the sequence
    for index, sublist in enumerate(sequence_matrix):
        for index, nucleotide in enumerate(sublist):
            if nucleotide == "A":
                profile_matrix[0][index] += 1
            if nucleotide == "C":
                profile_matrix[1][index] += 1
            if nucleotide == "G":
                profile_matrix[2][index] += 1
            if nucleotide == "T":
                profile_matrix[3][index] += 1
                
    # print(profile_matrix)

    consensus_sequence = []
    # NOTE: Ugly solution, but it seems to work. Quite ineffective, but not sure how to improve at this time.
    # For each position in the sequence, check which "letter" is larger than all other
    for index in range(len(profile_matrix[0])):
        
        if profile_matrix[0][index] > profile_matrix[1][index] and profile_matrix[0][index] > profile_matrix[2][index] and          profile_matrix[0][index] > profile_matrix[3][index]:
            consensus_sequence.append("A")
            
        elif profile_matrix[1][index] > profile_matrix[0][index] and profile_matrix[1][index] > profile_matrix[2][index] and          profile_matrix[1][index] > profile_matrix[3][index]:
            consensus_sequence.append("C")
            
        elif profile_matrix[2][index] > profile_matrix[0][index] and profile_matrix[2][index] > profile_matrix[1][index] and          profile_matrix[2][index] > profile_matrix[3][index]:
            consensus_sequence.append("G")
            
        elif profile_matrix[3][index] > profile_matrix[0][index] and profile_matrix[3][index] > profile_matrix[1][index] and          profile_matrix[3][index] > profile_matrix[2][index]:
            consensus_sequence.append("T")
    
    # print(consensus_sequence)
    
    return profile_matrix, consensus_sequence

test_CONS.py just wont work. The problem seems to be that the modules folder cannot be found.

Adding an __init__.py to the bioinformatics_stronghold folder does not solve this problem.

If I move the tests folder into the bioinformatics_stronghold folder, pytest just breaks with no apparent error messages and I cannot setup testing in VSCodium.

My question then is:


Solution

  • I think changing this should do it:

    CONS.py

    from .modules.read_fasta import read_fasta_file
    

    If that doesn't do it, there's probably some sort of import issue in read_fasta.py, and I'd encourage you to comment here with the full error traceback you're seeing, rather than just the error message.

    Note: Your naming conventions do not follow PEP8 guidelines.

    Modules should have short, all-lowercase names. Underscores can be used in the module name if it improves readability.


    Edit: Here's an example on how to structure your project and make it callable.

    Rosalind-problems/
    ├─ bioinformatics_stronghold/
    │  ├─ data/
    │  ├─ modules/
    │  │  ├─ __init__.py
    │  │  ├─ read_fasta.py
    │  ├─ __main__.py
    │  ├─ CONS.py
    │  ├─ IEV.py
    ├─ tests/
    │  ├─ __init__.py
    │  ├─ test_CONS.py
    │  ├─ test_IEV.py
    

    __main__.py

    from .modules import read_fasta
    
    
    read_fasta.call_a_function()
    

    To execute this, you just type python -m bioinformatics_stronghold in the terminal. With a single main entrypoint, you can do all sorts of things, like accepting user input, adding an argparse interface, etc.

    More reading: https://docs.python.org/3/library/__main__.html