I am interested in running Uniprot's Protein descriptor model, ProtNLM, to add some bonus descriptors for a big chunk of protein sequence I have.
They have a trial notebook available here.
Here is the full code of the notebook:
!python3 -m pip install -q -U tensorflow==2.8.2
!python3 -m pip install -q -U tensorflow-text==2.8.2
import tensorflow as tf
import tensorflow_text
import numpy as np
import re
import IPython.display
from absl import logging
tf.compat.v1.enable_eager_execution()
logging.set_verbosity(logging.ERROR) # Turn down tensorflow warnings
def print_markdown(string):
IPython.display.display(IPython.display.Markdown(string))
! mkdir -p protnlm
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/saved_model.pb -P protnlm -q --no-check-certificate
! mkdir -p protnlm/variables
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.index -P protnlm/variables/ -q --no-check-certificate
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.data-00000-of-00001 -P protnlm/variables/ -q --no-check-certificate
imported = tf.saved_model.load(export_dir="protnlm")
infer = imported.signatures["serving_default"]
sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP AVHASLDKFLASVSTVLTSKYR" #@param {type:"string"}
sequence = sequence.replace(' ', '')
names, scores = run_inference(sequence)
for name, score, i in zip(names, scores, range(len(names))):
print_markdown(f"Prediction number {i+1}: **{name}** with a score of **{score:.03f}**")
The one change I have made is to update the tensorflow version on these lines:
!python3 -m pip install -q -U tensorflow==2.8.2
!python3 -m pip install -q -U tensorflow-text==2.8.2
to be >=2.8.2
since the 2.8.2 version couldn't be installed.
Now, I can't run the model whatsoever. The third cell, which intakes the sequence, can't find the run_inference()
function:
NameError
Traceback (most recent call last)
<ipython-input-5-4a7325a0e004> in <cell line: 0>()
8 sequence = sequence.replace(' ', '')
9
---> 10 names, scores = run_inference(sequence)
11
12 for name, score, i in zip(names, scores, range(len(names))):
NameError: name 'run_inference' is not defined
I didn't see this function defined in the notebook, so I assumed it was internal to tensorflow (maybe only version 2.8.2, but I couldn't find anything in searching docs), or otherwise loaded with the model.
How can I get this script running again?
I checked notebook and there is button Show code
below of 2. Load the model
and it shows code with def run_inference(seq):
#@markdown Please execute this cell by pressing the _Play_ button.
def query(seq):
return f"[protein_name_in_english] <extra_id_0> [sequence] {seq}"
EC_NUMBER_REGEX = r'(\d+).([\d\-n]+).([\d\-n]+).([\d\-n]+)'
def run_inference(seq):
labeling = infer(tf.constant([query(seq)]))
names = labeling['output_0'][0].numpy().tolist()
scores = labeling['output_1'][0].numpy().tolist()
beam_size = len(names)
names = [names[beam_size-1-i].decode().replace('<extra_id_0> ', '') for i in range(beam_size)]
for i, name in enumerate(names):
if re.match(EC_NUMBER_REGEX, name):
names[i] = 'EC:' + name
scores = [np.exp(scores[beam_size-1-i]) for i in range(beam_size)]
return names, scores
You can see the same code in issue:
I found this notebook also on GitHub and it shows this code
google-research/protnlm/protnlm_use_model_for_inference_uniprot_2022_04.ipynb