Error importing python modules in nextflow script block

I have a similar problem to those described here and here. The code is as follows:


    process q2_predict_dysbiosis { publishDir 'results', mode: 'copy'
    
    input:
    path abundance_file
    path species_abundance_file
    path stratified_pathways_table
    path unstratified_pathways_table
    
    output:
    path "${abundance_file.baseName}_q2pd.tsv"
    
    script:
    """
    #!/usr/bin/env python
    
    from q2_predict_dysbiosis import calculate_index
    import pandas as pd
    
    pd.set_option('display.max_rows', None)
    
    taxa = pd.read_csv("${species_abundance_file}", sep="\\t", index_col=0)
    paths_strat = pd.read_csv("${stratified_pathways_table}", sep="\\t", index_col=0)
    paths_unstrat = pd.read_csv("${unstratified_pathways_table}", sep="\\t", index_col=0)
    
    score_df = calculate_index(taxa, paths_strat, paths_unstrat)
    score_df.to_csv("${abundance_file.baseName}_q2pd.tsv", sep="\\t", float_format="%.2f")
    """
    }

Obtained error:

Caused by:
  Process `q2_predict_dysbiosis (1)` terminated with an error exit status (1)


Command executed:

  #!/usr/bin/env python

  from q2_predict_dysbiosis import calculate_index
  import pandas as pd

  pd.set_option('display.max_rows', None)

  taxa = pd.read_csv("abundance1-taxonomy_table.txt", sep="\t", index_col=0)
  paths_strat = pd.read_csv("pathways_stratified.txt", sep="\t", index_col=0)
  paths_unstrat = pd.read_csv("pathways_unstratified.txt", sep="\t", index_col=0)

  score_df = calculate_index(taxa, paths_strat, paths_unstrat)
  score_df.to_csv("abundance1_q2pd.tsv", sep="\t", float_format="%.2f")

Command exit status:
  1

Command output:
  (empty)

Command error:
  Traceback (most recent call last):
    File ".command.sh", line 3, in <module>
      from q2_predict_dysbiosis import calculate_index
  ModuleNotFoundError: No module named 'q2_predict_dysbiosis'

I have followed the instructions in this link, but it still doesn't work. I would like to keep the code block like that, and not run a script.py file. I am using the code from this repository.

Thanks in advance!

UPDATE

To try to resolve the import error I have done the following:

Creating a bin/ directory which is in the same directory as script.nf. No results.
Changing the shebang declaration. No results.

q2_predict_dysbiosis is not installed (it has no installation instructions), but it runs locally. I think the problem is that Nextflow doesn't locate q2_predict_dysbiosis.py, even though it is in the ./bin directory.

Solution

The Python import system uses the following sequence to locate packages and modules to import:

The current working directory (i.e. $PWD): This is the directory from which the Python interpreter was launched.
The PYTHONPATH environment variable: If set, this environment variable can specify additional directories for Python to search for packages and modules.
The sys.path list in the program: The paths in this list determine where Python looks for modules, and you can modify sys.path within your code to include additional directories.
System-wide or virtual environment installed packages: These are the packages that have been globally installed on the system or within a virtual environment.

A quick solution is to simply set the PYTHONPATH environment variable using the env scope in your nextflow.config. For example, with q2_predict_dysbiosis.py in a folder called packages in the root directory of your project repository (i.e. the directory where the main.nf script is located):

env {

    PYTHONPATH = "${projectDir}/packages"
}

Tested using main.nf:

process q2_predict_dysbiosis {

    debug true

    script:
    """
    #!/usr/bin/env python
    import sys
    print(sys.path)

    from q2_predict_dysbiosis import calculate_index

    assert 'q2_predict_dysbiosis' in sys.modules
    """
}

workflow {

    q2_predict_dysbiosis()
}

Results:

$ nextflow run main.nf 

 N E X T F L O W   ~  version 24.10.0

Launching `main.nf` [grave_avogadro] DSL2 - revision: 2f0c31286e

executor >  local (1)
[8f/50976f] q2_predict_dysbiosis [100%] 1 of 1 ✔
[
    '/path/to/project/work/8f/50976fe453d54fd6e11b3501d4b05a',
    '/path/to/project/packages',
    '/usr/lib/python312.zip',
    '/usr/lib/python3.12',
    '/usr/lib/python3.12/lib-dynload',
    '/usr/lib/python3.12/site-packages'
]

A better solution, though, is to refactor. Move your custom code into a separate file (e.g. your_script.py), place it in your bin directory and make it executable (chmod a+x bin/your_script.py). Also move q2_predict_dysbiosis.py into this directory or into a sub-directory called utils. I use the latter in my example below. Your directory structure might look like:

$ find .
.
./main.nf
./bin
./bin/utils
./bin/utils/q2_predict_dysbiosis.py
./bin/your_script.py

And your_script.py might look like the following using argparse to provide a user-friendly command-line interface:

#!/usr/bin/env python

import argparse
import pandas as pd

from utils.q2_predict_dysbiosis import calculate_index

pd.set_option('display.max_rows', None)

def custom_help_formatter(prog):
    return argparse.HelpFormatter(prog, max_help_position=80)

def parse_args():
    parser = argparse.ArgumentParser(
        description="Calculate dysbiosis index using abundance and pathways tables.",
        formatter_class=custom_help_formatter,
    )

    parser.add_argument(
        "species_abundance_file",
        help="Path to the species abundance file",
    )
    parser.add_argument(
        "stratified_pathways_table",
        help="Path to the stratified pathways table file",
    )
    parser.add_argument(
        "unstratified_pathways_table",
        help="Path to the unstratified pathways table file",
    )
    parser.add_argument(
        "output_file",
        help="Path to the output file to save the results",
    )

    return parser.parse_args()

def main(
    species_abundance_file,
    stratified_pathways_table,
    unstratified_pathways_table,
    output_file
):
    taxa = pd.read_csv(species_abundance_file, sep="\t", index_col=0)
    paths_strat = pd.read_csv(stratified_pathways_table, sep="\t", index_col=0)
    paths_unstrat = pd.read_csv(unstratified_pathways_table, sep="\t", index_col=0)
    
    score_df = calculate_index(taxa, paths_strat, paths_unstrat)
    score_df.to_csv(output_file, sep="\t", float_format="%.2f")

if __name__ == "__main__":
    args = parse_args()

    main(
        args.species_abundance_file,
        args.stratified_pathways_table,
        args.unstratified_pathways_table,
        args.output_file,
    )

Tested using main.nf:

$ cat main.nf 
process q2_predict_dysbiosis {

    debug true

    script:
    """
    your_script.py --help
    """
}

workflow {

    q2_predict_dysbiosis()
}

Results:

$ nextflow run main.nf 

 N E X T F L O W   ~  version 24.10.0

Launching `main.nf` [peaceful_stonebraker] DSL2 - revision: fea21868c7

executor >  local (1)
[88/538f31] q2_predict_dysbiosis [100%] 1 of 1 ✔
usage: your_script.py [-h] species_abundance_file stratified_pathways_table unstratified_pathways_table output_file

Calculate dysbiosis index using abundance and pathways tables.

positional arguments:
  species_abundance_file       Path to the species abundance file
  stratified_pathways_table    Path to the stratified pathways table file
  unstratified_pathways_table  Path to the unstratified pathways table file
  output_file                  Path to the output file to save the results

options:
  -h, --help                   show this help message and exit

If your dependencies also require certain local files to run, place the required files into a sub-directory in your project repository. Declare these files in your workflow block (e.g. using data_dir = path("${projectDir}/data")) and append entries for these in your processes' input block. If the names of the input files are hardcoded in your Python script, supply a string value to path to ensure that Nextflow stages the files with the correct filename(s) (e.g. using path 'data'). Once the files are localized in the process working directory, python should be able to find them. This assumes the path(s) in your Python script are relative and not absolute paths. If they are absolute paths, you will need to make them relative. A minimal example might look like:

process test_proc {

    debug true

    input:
    path 'data'

    script:
    """
    ls -1 data/{foo,bar,baz}.txt
    """
}

workflow {

    data_dir = "${projectDir}/data"

    test_proc( data_dir )
}

$ mkdir data
$ touch data/{foo,bar,baz}.txt
$ nextflow run main.nf 

 N E X T F L O W   ~  version 24.10.0

Launching `main.nf` [prickly_nightingale] DSL2 - revision: 83d939e180

executor >  local (1)
[ec/bf1f56] process > test_proc [100%] 1 of 1 ✔
data/bar.txt
data/baz.txt
data/foo.txt