pythonbashpubchem

Parse a remote xml.gz file of a database without downloading


I need to parse a Pubchem database to search for certain clues on the pages of compounds

(Toxicity codes, to be exact, they look like 'H300'), and then add their CIDs to the correspondent lists

The Database is here https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/

But the xml.gz files there are so big that they can't be unpacked on my computer So maybe there is a way to read this files directly on the server of a PubChem


Solution

  • One way I would approach this is to use curl and gunzip and maybe grep:

    Example:

    curl -ks https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/Compound_000000001_000500000.xml.gz -o - | gunzip | grep someString
    

    This will stream down the file, and in realtime decompress it, which will allow you in realtime to grep for what you need