node.jsnpm

List all public packages in the npm registry


For research purposes, I'd like to list all the packages that are available on npm. How can I do this?

Some old docs at https://github.com/npm/registry/blob/master/docs/REGISTRY-API.md#get-all mention an /-/all endpoint that presumably once worked, but http://registry.npmjs.org/-/all now just returns {"message":"deprecated"}.


Solution

  • Step 1: Getting a list of all package names

    If you're happy to use data that's up to 24 hours out of date and provided by a third party, you can use all-the-package-names. This npm package, updated daily, literally just exports a giant flat array of package names. (The same org and maintainer also publish all-the-package-repos, which additionally has links to each package's GitHub repo. Their other packages for analysing the npm registry have been unmaintained and dead for years as far as I can tell, which is sad.)

    If you want to do it yourself, that is possible too. At https://replicate.npmjs.com is an API that is kind of like a CouchDB API (once, it really was a CouchDB API) but with loads of stuff disabled. We need the _all_docs endpoint. It was once possible (back when I wrote the original version of this answer) to simply hit this endpoint with no query string and get back a single giant response with all the packages in the registry, but no longer; if you try that today, the API will terminate the connection well before you get a complete list of packages. Instead, we need to paginate using the limit and startkey parameters. Here is a simple Node.js script to do that:

    import fs from 'node:fs';
    
    // Max value according to https://github.com/orgs/community/discussions/152515
    const LIMIT = 10000;
    
    const initialResponse = await fetch(
      `https://replicate.npmjs.com/registry/_all_docs?limit=${LIMIT}`,
    
      // Header below only needed temporarily until May 29th 2025.
      // See https://github.com/orgs/community/discussions/152515
      // Essentially Microsoft/GitHub/npm is doing an API migration (mostly
      // disabling stuff that used to work), and is periodically browning out the
      // old, more fully-featured API. This header opts into the new,
      // less-fully-featured API that at least isn't offline half the time.
      // From May 29th, this will be the default (and the only option).
      {headers:{'npm-replication-opt-in': 'true'}}
    );
    const result = (await initialResponse.json()).rows;
    console.log(`Fetched initial ${result.length} packages`);
    
    while (true) {
      const lastKey = result[result.length - 1].key;
      const params = new URLSearchParams({
        limit: LIMIT,
        startkey: JSON.stringify(lastKey),
      });
      const resp = await fetch(
        `https://replicate.npmjs.com/registry/_all_docs?${params}`,
        // Again, remove this line after May 29th 2025:
        {headers:{'npm-replication-opt-in': 'true'}}
      )
      const respJson = await resp.json();
      const respRows = respJson.rows;
    
      // The startkey parameter is inclusive, so the first row we get should be the
      // same as the last row from the previous page. If the replicate.npmjs.com
      // API were a real CouchDB API, we could pass skip=1 to skip that duplicate
      // row, but it isn't, and doesn't support that parameter, so we just need to
      // ignore the first row. We sanity-check it's as expected, first:
      if (respRows[0].key !== lastKey) {
        throw `Expected first row of request to have key ${lastKey} but it was ${respRows[0].key}`
      }
    
      if (respRows.length === 1) {
        // We're done! There are no more packages.
        break;
      }
    
      for (const row of respRows.slice(1)) {
        result.push(row);
      }
    
      console.log(
        `Reached offset ${respJson.offset} of ${respJson.total_rows} total rows.`
      )
    }
    
    console.log("Finished! Writing to allpackages.json ...");
    fs.writeFileSync("allpackages.json", JSON.stringify(result));
    

    The JSON file output by the script above will be an array of objects like this, where the id and key are both the package name:

    {"id":"lodash","key":"lodash","value":{"rev":"634-9273a19c245f088da22a9e4acbabc213"}},
    

    At the moment in time that I am rewriting this answer, on 14th May 2025, there are 3542583 packages in that response and, on my fibre internet in the UK, each request for a batch of 10000 takes around 5 seconds, for a total download time of around half an hour. The resulting file is 377MB.

    Step 2 - Getting more data about packages

    The 2025 rework of the API removes the ability to pass the include_docs parameter to the _all_docs API in order to retrieve metadata about packages in bulk. Instead, for most metadata, you have to make one request per package. For instance, for metadata about react, like its description and release history, you'd hit https://registry.npmjs.org/react. There are some unofficial docs about the https://registry.npmjs.org API at https://www.edoardoscibona.com/exploring-the-npm-registry-api.

    Yes, this does imply that to download metadata about all packages in the registry, you need to make (as of May 2025) 3.5 million requests. Below is a script to do it. It's somewhat crude (and doesn't handle every imaginable error scenario gracefully), but probably good enough:

    import fs from "node:fs";
    import path from "node:path";
    
    const MAX_SIMULTANEOUS_REQUESTS = 50;
    
    const packages = JSON.parse(fs.readFileSync("allpackages.json").toString()).reverse();
    
    async function startFetcherThread() {
      while (packages.length > 0) {
        // Log progress every 1000 packages:
        if (packages.length % 1000 == 1000 - MAX_SIMULTANEOUS_REQUESTS) {
          console.log(new Date(), `${packages.length + MAX_SIMULTANEOUS_REQUESTS} packages to go`);
        }
    
        const pkg = packages.pop();
        const packageName = pkg.key;
    
        if (packageName.split('/').includes('.') || packageName.split('/').includes('..')) {
          console.log(`Skipping ${packageName} because it is playing silly buggers in its package name`);
        }
        const outputPath = "metadata/" + packageName;
        if (fs.existsSync(outputPath)) {
          // Presumably we downloaded this on a previous run that we aborted. Skip.
          continue;
        }
    
        let resp;
        try {
          resp = await fetch(`https://registry.npmjs.org/${packageName}`);
        } catch (e) {
          console.error(`Failed to fetch ${packageName}`);
          continue;
        }
        if (resp.status !== 200) {
          console.error(`Got ${resp.status} when trying to get ${packageName}`);
          continue;
        }
        const respJson = await resp.json();
        await fs.promises.mkdir(path.dirname(outputPath), {recursive: true});
        await fs.promises.writeFile(outputPath, JSON.stringify(respJson));
      }
    }
    
    for (let i = 0; i < MAX_SIMULTANEOUS_REQUESTS; i++) {
      startFetcherThread();
    }
    

    I haven't tried other values of MAX_SIMULTANEOUS_REQUESTS; as such, I don't know if this is optimal in any sense, nor if a bigger number will lead to hitting a rate limit. I can say that when I ran this script on my machine, for the first few thousand packages I was chewing through 1000 every 11 seconds or so, but later it dropped to a speed of about 35 seconds per 1000 packages. You can therefore expect a total runtime of over 1 day.

    If you want download counts, for instance because you want to target some analysis at the top 100 or top 1000 most-downloaded packages, you can get those from an API sort-of-documented at https://github.com/npm/registry/blob/main/docs/download-counts.md. (Note that that entire repo of documentation is officially an archive, and much of what is in REGISTRY-API.md and REPLICATE-API.md are obsolete, but at the time of me writing this, the docs about download counts appear to me to still be correct and up to date.)