pythontensorflowbioinformatics

Cannot run ProtNLM from Uniprot with Tensorflow in colab notebook


I am interested in running Uniprot's Protein descriptor model, ProtNLM, to add some bonus descriptors for a big chunk of protein sequence I have.

They have a trial notebook available here.

Here is the full code of the notebook:

!python3 -m pip install -q -U tensorflow==2.8.2
!python3 -m pip install -q -U tensorflow-text==2.8.2
import tensorflow as tf
import tensorflow_text
import numpy as np
import re

import IPython.display
from absl import logging

tf.compat.v1.enable_eager_execution()

logging.set_verbosity(logging.ERROR)  # Turn down tensorflow warnings

def print_markdown(string):
  IPython.display.display(IPython.display.Markdown(string))

! mkdir -p protnlm

! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/saved_model.pb -P protnlm -q --no-check-certificate
! mkdir -p protnlm/variables
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.index -P protnlm/variables/ -q --no-check-certificate
! wget -nc https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/savedmodel__20221011__030822_1128_bs1.bm10.eos_cpu/variables/variables.data-00000-of-00001 -P protnlm/variables/ -q --no-check-certificate

imported = tf.saved_model.load(export_dir="protnlm")
infer = imported.signatures["serving_default"]

sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP AVHASLDKFLASVSTVLTSKYR" #@param {type:"string"}
sequence = sequence.replace(' ', '')

names, scores = run_inference(sequence)

for name, score, i in zip(names, scores, range(len(names))):
  print_markdown(f"Prediction number {i+1}: **{name}** with a score of **{score:.03f}**")

The one change I have made is to update the tensorflow version on these lines:

!python3 -m pip install -q -U tensorflow==2.8.2
!python3 -m pip install -q -U tensorflow-text==2.8.2

to be >=2.8.2 since the 2.8.2 version couldn't be installed.

Now, I can't run the model whatsoever. The third cell, which intakes the sequence, can't find the run_inference() function:

NameError                                 
Traceback (most recent call last)

<ipython-input-5-4a7325a0e004> in <cell line: 0>()
      8 sequence = sequence.replace(' ', '')
      9 
---> 10 names, scores = run_inference(sequence)
     11 
     12 for name, score, i in zip(names, scores, range(len(names))):

NameError: name 'run_inference' is not defined

I didn't see this function defined in the notebook, so I assumed it was internal to tensorflow (maybe only version 2.8.2, but I couldn't find anything in searching docs), or otherwise loaded with the model.

How can I get this script running again?


Solution

  • I checked notebook and there is button Show code below of 2. Load the model
    and it shows code with def run_inference(seq):

    #@markdown Please execute this cell by pressing the _Play_ button.
    
    def query(seq):
      return f"[protein_name_in_english] <extra_id_0> [sequence] {seq}"
    
    EC_NUMBER_REGEX = r'(\d+).([\d\-n]+).([\d\-n]+).([\d\-n]+)'
    
    def run_inference(seq):
      labeling = infer(tf.constant([query(seq)]))
      names = labeling['output_0'][0].numpy().tolist()
      scores = labeling['output_1'][0].numpy().tolist()
      beam_size = len(names)
      names = [names[beam_size-1-i].decode().replace('<extra_id_0> ', '') for i in range(beam_size)]
      for i, name in enumerate(names):
        if re.match(EC_NUMBER_REGEX, name):
          names[i] = 'EC:' + name
      scores = [np.exp(scores[beam_size-1-i]) for i in range(beam_size)]
      return names, scores
    

    You can see the same code in issue:

    issue with the protnlm_use_model_for_inference_uniprot_2022_04.ipynb colab notebook · Issue #2073 · google-research/google-research


    I found this notebook also on GitHub and it shows this code

    google-research/protnlm/protnlm_use_model_for_inference_uniprot_2022_04.ipynb