pythonpandasdataframecsvbioinformatics

Pandas dataframe is mangled when writing to csv


I have written a pipeline to send queries to uniprot, but am having a strange issue with one of the queries. I've put this into a small test case below.

I am getting the expected dataframe (df) structure (one row and 15 columns, one per field), but when I export this to CSV and open in excel it looks mangled. Specifically, instead of one row I get two, with the second starting partway through the 'Sequence' dataframe column (I've given more details in the bottom comment). This is for one of 99 queries, and the rest were all fine. I suspect this is an issue in my pd.to_csv call, but if anyone could give more details it would be much appreciated.

Thanks! Tim

import requests 
import pandas as pd
import io 

def queries_to_table(base, query, organism_id):
    rest_url = base + f'query=(({query})AND(organism_id:{organism_id}))'
    response = requests.get(rest_url)
    if response.status_code == 200:
        return pd.read_csv(io.StringIO(response.text), 
                           sep = '\t')
    else:
        raise ValueError(f'The uniprot API returned a status code of {response.status_code}.  '\
                         'This was not 200 as expected, which may reflect an issue '\
                         f'with your query:  {query}.\n\nSee here for more '\
                         'information: https://www.uniprot.org/help/rest-api-headers.  '\
                         f'Full url: {rest_url}')

size = 500
fields = 'accession,id,protein_name,gene_names,organism_name,'\
          'length,sequence,go_p,go_c,go,go_f,ft_topo_dom,'\
          'ft_transmem,cc_subcellular_location,ft_intramem'
url_base = f'https://rest.uniprot.org/uniprotkb/search?size={size}&'\
           f'fields={fields}&format=tsv&'
query = '(id:TITIN_HUMAN)'
organism_id = 9606

df = queries_to_table(url_base, query, organism_id)
#-> df looks fine - one row and 15 columns

pd.concat([df]).to_csv('test2_error.csv')
#-> opening in excel this is broken - it splits df['Sequence'] into two rows at 
#the junction between 'RLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDG' and 
#'PLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYY'. In df['Sequence'], this sequence is joined 
#by a 'q' (the below string covers the junction, and has the previously quoted substrings in capitals):
#tdstlrpmfkRLLANAECQEGQSVCFEIRVSGIPPPTLKWEKDGqPLSLGPNIEIIHEGLDYYALHIRDTLPEDTGYYrvtatntags

Solution

  • If I run your code and open the CSV file in a text editor, it has one header line (,Entry,Entry Name,Protein names,Gene Names,Organism,Length,Sequence,Gene Ontology (biological process),Gene Ontology (cellular component),Gene Ontology (GO),Gene Ontology (molecular function),Topological domain,Transmembrane,Subcellular location [CC],Intramembrane) and one line of data.

    Opening the file in Numbers (the Mac default spreadsheet software) also shows it just fine.

    IOW, I think Pandas is fine, and your code is fine ā€“ it's just Excel that's being Excel: a cell can't contain more than 32,767 characters, and the long value has 34,350 characters.

    (Preferably don't use Excel for purposes where you care about data integrity, in my opinion.)

    If you need some format where you can verify the data with human eyes, I might suggest HTML:

    with open('test2_error.html', 'w') as f:
        df.to_html(f)
        f.write("<style>td { word-break: break-all; }</style>")
    

    ā€“ the additional hack to add a <style> makes the long (gene?) sequence wrap onto multiple lines in the browser; otherwise you could just do df.to_html('test2_error.html').