pythonnpmnpmjs

python: getting npm package data from a couchdb endpoint


I want to fetch the npm package metadata. I found this endpoint which gives me all the metadata needed.

I made a following script to get this data. My plan is to select some specific keys and add that data in some database (I can also store it in a json file, but the data is huge). I made following script to fetch the data:

import requests
import json
import sys

db = 'https://replicate.npmjs.com';

r = requests.get('https://replicate.npmjs.com/_all_docs', headers={"include_docs" : "true"})

for line in r.iter_lines():
    # filter out keep-alive new lines
    if line:
        print(line)
        decoded_line = line.decode('utf-8')
        print(json.loads(decoded_line))

Notice, I don't even include all-docs, but it sticks in an infinite loop. I think this is because the data is huge.

A look at the head of the output from - https://replicate.npmjs.com/_all_docs

gives me following output:

{"total_rows":1017703,"offset":0,"rows":[
{"id":"0","key":"0","value":{"rev":"1-5fbff37e48e1dd03ce6e7ffd17b98998"}},
{"id":"0-","key":"0-","value":{"rev":"1-420c8f16ec6584c7387b19ef401765a4"}},
{"id":"0----","key":"0----","value":{"rev":"1-55f4221814913f0e8f861b1aa42b02e4"}},
{"id":"0-1-project","key":"0-1-project","value":{"rev":"1-3cc19950252463c69a5e717d9f8f0f39"}},
{"id":"0-100","key":"0-100","value":{"rev":"1-c4f41a37883e1289f469d5de2a7b505a"}},
{"id":"0-24","key":"0-24","value":{"rev":"1-e595ec3444bc1039f10c062dd86912a2"}},
{"id":"0-60","key":"0-60","value":{"rev":"2-32c17752acfe363fa1be7dbd38212b0a"}},
{"id":"0-9","key":"0-9","value":{"rev":"1-898c1d89f7064e58f052ff492e94c753"}},
{"id":"0-_-0","key":"0-_-0","value":{"rev":"1-d47c142e9460c815c19c4ed3355d648d"}},
{"id":"0.","key":"0.","value":{"rev":"1-11c33605f2e3fd88b5416106fcdbb435"}},
{"id":"0.0","key":"0.0","value":{"rev":"1-5e541d4358c255cbcdba501f45a66e82"}},
{"id":"0.0.1","key":"0.0.1","value":{"rev":"1-ce856c27d0e16438a5849a97f8e9671d"}},
{"id":"0.0.168","key":"0.0.168","value":{"rev":"1-96ab3047e57ca1573405d0c89dd7f3f2"}},
{"id":"0.0.250","key":"0.0.250","value":{"rev":"1-c07ad0ffb7e2dc51bfeae2838b8d8bd6"}}, 

Notice, that all the documents start from the second line (i.e. all the documents are part of the "rows" key's values). Now, my question is how to get only the values of "rows" key (i.e. all the documents). I found this repository for the similar purpose, but can't use/ convert it as I am a total beginner in JavaScript.


Solution

  • If there is no stream=True among the arguments of get() then the whole data will be downloaded into memory before the loop over the lines even starts.

    Then there is the problem that at least the lines themselves are not valid JSON. You'll need an incremental JSON parser like ijson for this. ijson in turn wants a file like object which isn't easily obtained from the requests.Response, so I will use urllib from the Python standard library here:

    #!/usr/bin/env python3
    from urllib.request import urlopen
    
    import ijson
    
    
    def main():
        with urlopen('https://replicate.npmjs.com/_all_docs') as json_file:
            for row in ijson.items(json_file, 'rows.item'):
                print(row)
    
    
    if __name__ == '__main__':
        main()