I need to parse a Pubchem database to search for certain clues on the pages of compounds
(Toxicity codes, to be exact, they look like 'H300'), and then add their CIDs to the correspondent lists
The Database is here https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/
But the xml.gz files there are so big that they can't be unpacked on my computer So maybe there is a way to read this files directly on the server of a PubChem
One way I would approach this is to use curl
and gunzip
and maybe grep
:
Example:
curl -ks https://ftp.ncbi.nih.gov/pubchem/Compound/CURRENT-Full/XML/Compound_000000001_000500000.xml.gz -o - | gunzip | grep someString
This will stream down the file, and in realtime decompress it, which will allow you in realtime to grep for what you need