I have a similar problem to those described here and here. The code is as follows:
process q2_predict_dysbiosis { publishDir 'results', mode: 'copy'
input:
path abundance_file
path species_abundance_file
path stratified_pathways_table
path unstratified_pathways_table
output:
path "${abundance_file.baseName}_q2pd.tsv"
script:
"""
#!/usr/bin/env python
from q2_predict_dysbiosis import calculate_index
import pandas as pd
pd.set_option('display.max_rows', None)
taxa = pd.read_csv("${species_abundance_file}", sep="\\t", index_col=0)
paths_strat = pd.read_csv("${stratified_pathways_table}", sep="\\t", index_col=0)
paths_unstrat = pd.read_csv("${unstratified_pathways_table}", sep="\\t", index_col=0)
score_df = calculate_index(taxa, paths_strat, paths_unstrat)
score_df.to_csv("${abundance_file.baseName}_q2pd.tsv", sep="\\t", float_format="%.2f")
"""
}
Obtained error:
Caused by:
Process `q2_predict_dysbiosis (1)` terminated with an error exit status (1)
Command executed:
#!/usr/bin/env python
from q2_predict_dysbiosis import calculate_index
import pandas as pd
pd.set_option('display.max_rows', None)
taxa = pd.read_csv("abundance1-taxonomy_table.txt", sep="\t", index_col=0)
paths_strat = pd.read_csv("pathways_stratified.txt", sep="\t", index_col=0)
paths_unstrat = pd.read_csv("pathways_unstratified.txt", sep="\t", index_col=0)
score_df = calculate_index(taxa, paths_strat, paths_unstrat)
score_df.to_csv("abundance1_q2pd.tsv", sep="\t", float_format="%.2f")
Command exit status:
1
Command output:
(empty)
Command error:
Traceback (most recent call last):
File ".command.sh", line 3, in <module>
from q2_predict_dysbiosis import calculate_index
ModuleNotFoundError: No module named 'q2_predict_dysbiosis'
I have followed the instructions in this link, but it still doesn't work. I would like to keep the code block like that, and not run a script.py file. I am using the code from this repository.
Thanks in advance!
UPDATE
To try to resolve the import error I have done the following:
Creating a bin/
directory which is in the same directory as script.nf
. No results.
Changing the shebang declaration. No results.
q2_predict_dysbiosis
is not installed (it has no installation instructions), but it runs locally. I think the problem is that Nextflow doesn't locate q2_predict_dysbiosis.py
, even though it is in the ./bin
directory.
The Python import system uses the following sequence to locate packages and modules to import:
The current working directory (i.e. $PWD
): This is the directory from which the Python interpreter was launched.
The PYTHONPATH
environment variable: If set, this environment variable can specify additional directories for Python to search for packages and modules.
The sys.path
list in the program: The paths in this list determine where Python looks for modules, and you can modify sys.path
within your code to include additional directories.
System-wide or virtual environment installed packages: These are the packages that have been globally installed on the system or within a virtual environment.
A quick solution is to simply set the PYTHONPATH
environment variable using the env
scope in your nextflow.config. For example, with q2_predict_dysbiosis.py
in a folder called packages in the root directory of your project repository (i.e. the directory where the main.nf script is located):
env {
PYTHONPATH = "${projectDir}/packages"
}
Tested using main.nf:
process q2_predict_dysbiosis {
debug true
script:
"""
#!/usr/bin/env python
import sys
print(sys.path)
from q2_predict_dysbiosis import calculate_index
assert 'q2_predict_dysbiosis' in sys.modules
"""
}
workflow {
q2_predict_dysbiosis()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 24.10.0
Launching `main.nf` [grave_avogadro] DSL2 - revision: 2f0c31286e
executor > local (1)
[8f/50976f] q2_predict_dysbiosis [100%] 1 of 1 ✔
[
'/path/to/project/work/8f/50976fe453d54fd6e11b3501d4b05a',
'/path/to/project/packages',
'/usr/lib/python312.zip',
'/usr/lib/python3.12',
'/usr/lib/python3.12/lib-dynload',
'/usr/lib/python3.12/site-packages'
]
A better solution, though, is to refactor. Move your custom code into a separate file (e.g. your_script.py), place it in your bin
directory and make it executable (chmod a+x bin/your_script.py
). Also move q2_predict_dysbiosis.py into this directory or into a sub-directory called utils. I use the latter in my example below. Your directory structure might look like:
$ find .
.
./main.nf
./bin
./bin/utils
./bin/utils/q2_predict_dysbiosis.py
./bin/your_script.py
And your_script.py might look like the following using argparse
to provide a user-friendly command-line interface:
#!/usr/bin/env python
import argparse
import pandas as pd
from utils.q2_predict_dysbiosis import calculate_index
pd.set_option('display.max_rows', None)
def custom_help_formatter(prog):
return argparse.HelpFormatter(prog, max_help_position=80)
def parse_args():
parser = argparse.ArgumentParser(
description="Calculate dysbiosis index using abundance and pathways tables.",
formatter_class=custom_help_formatter,
)
parser.add_argument(
"species_abundance_file",
help="Path to the species abundance file",
)
parser.add_argument(
"stratified_pathways_table",
help="Path to the stratified pathways table file",
)
parser.add_argument(
"unstratified_pathways_table",
help="Path to the unstratified pathways table file",
)
parser.add_argument(
"output_file",
help="Path to the output file to save the results",
)
return parser.parse_args()
def main(
species_abundance_file,
stratified_pathways_table,
unstratified_pathways_table,
output_file
):
taxa = pd.read_csv(species_abundance_file, sep="\t", index_col=0)
paths_strat = pd.read_csv(stratified_pathways_table, sep="\t", index_col=0)
paths_unstrat = pd.read_csv(unstratified_pathways_table, sep="\t", index_col=0)
score_df = calculate_index(taxa, paths_strat, paths_unstrat)
score_df.to_csv(output_file, sep="\t", float_format="%.2f")
if __name__ == "__main__":
args = parse_args()
main(
args.species_abundance_file,
args.stratified_pathways_table,
args.unstratified_pathways_table,
args.output_file,
)
Tested using main.nf
:
$ cat main.nf
process q2_predict_dysbiosis {
debug true
script:
"""
your_script.py --help
"""
}
workflow {
q2_predict_dysbiosis()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 24.10.0
Launching `main.nf` [peaceful_stonebraker] DSL2 - revision: fea21868c7
executor > local (1)
[88/538f31] q2_predict_dysbiosis [100%] 1 of 1 ✔
usage: your_script.py [-h] species_abundance_file stratified_pathways_table unstratified_pathways_table output_file
Calculate dysbiosis index using abundance and pathways tables.
positional arguments:
species_abundance_file Path to the species abundance file
stratified_pathways_table Path to the stratified pathways table file
unstratified_pathways_table Path to the unstratified pathways table file
output_file Path to the output file to save the results
options:
-h, --help show this help message and exit
If your dependencies also require certain local files to run, place the required files into a sub-directory in your project repository. Declare these files in your workflow
block (e.g. using data_dir = path("${projectDir}/data")
) and append entries for these in your processes' input
block. If the names of the input files are hardcoded in your Python script, supply a string value to path to ensure that Nextflow stages the files with the correct filename(s) (e.g. using path 'data'
). Once the files are localized in the process working directory, python should be able to find them. This assumes the path(s) in your Python script are relative and not absolute paths. If they are absolute paths, you will need to make them relative. A minimal example might look like:
process test_proc {
debug true
input:
path 'data'
script:
"""
ls -1 data/{foo,bar,baz}.txt
"""
}
workflow {
data_dir = "${projectDir}/data"
test_proc( data_dir )
}
$ mkdir data
$ touch data/{foo,bar,baz}.txt
$ nextflow run main.nf
N E X T F L O W ~ version 24.10.0
Launching `main.nf` [prickly_nightingale] DSL2 - revision: 83d939e180
executor > local (1)
[ec/bf1f56] process > test_proc [100%] 1 of 1 ✔
data/bar.txt
data/baz.txt
data/foo.txt