javascriptweb-scrapingnext.jscheerio

Unable to scrape artist data from Beatport using Cheerio in Next.js 14 Server Actions


I'm trying to scrape artist data from Beatport using Cheerio in a Next.js 14 Server Action. The goal is to search for an artist, click on the first artist card in the results, and extract the artist's URL. However, my current implementation is not finding the artist card even though I can see it in the HTML when I inspect the page.

Here's the code I'm using:

"use server";

import fetch from "node-fetch";
import { load } from "cheerio";

interface BeatportArtist {
  name: string;
  beatportUrl: string;
  imageUrl: string;
}

const BASE_URL = "https://www.beatport.com";

export async function scrapeBeatportArtist(
  name: string
): Promise<BeatportArtist | null> {
  try {
    const searchUrl = `${BASE_URL}/search?q=${encodeURIComponent(name)}`;
    console.log(`Searching Beatport for artist: ${name}`);
    console.log(`Search URL: ${searchUrl}`);

    const searchResponse = await fetch(searchUrl);
    const searchHtml = await searchResponse.text();
    const $search = load(searchHtml);

    console.log('Search HTML loaded.');

    // Find the first div with the specific class
    const artistCard = $search("div.ArtistCard-style__Wrapper-sc-7ba2494f-10.gdlIrO.show-artist").first();
    console.log('Artist card:', artistCard.html());  // Log the HTML of artistCard

    if (!artistCard.length) {
      console.log(`No artist card found for artist: ${name}`);
      return null;
    }

    // Find the <a> within artistCard with the title corresponding to the artist's name
    const artistLink = artistCard.find(`a.artwork[title="${name}"]`).attr("href");

    console.log('Artist link:', artistLink);  // Log the artistLink

    if (!artistLink) {
      console.log(`No Beatport profile found for artist: ${name}`);
      return null;
    }

    const artistUrl = `${BASE_URL}${artistLink}`;
    console.log(`Found Beatport profile for artist ${name}: ${artistUrl}`);
    const artistResponse = await fetch(artistUrl);
    const artistHtml = await artistResponse.text();
    const $artist = load(artistHtml);

    const imageUrl = $artist(".artist-hero__image img").attr("src") || "";

    return {
      name,
      beatportUrl: artistUrl,
      imageUrl,
    };
  } catch (error) {
    console.error(`Error scraping Beatport for artist ${name}:`, error);
    return null;
  }
}

Issues:

The script logs "No artist card found for artist: [artist name]" even though I can see the artist card in the HTML when inspecting the page. I am using the class ArtistCard-style__Wrapper-sc-7ba2494f-10.gdlIrO.show-artist to locate the artist card, and then I attempt to find the tag with the class artwork and the title attribute matching the artist's name.

<div class="ArtistCard-style__Wrapper-sc-7ba2494f-10 gdlIrO show-artist" data-testid="artist-card">
  <div class="ArtistCard-style__Meta-sc-7ba2494f-9 bcxGRv">
    <a title="Artist Name" class="artwork" href="/artist/artist-name/123456">
      <div class="ArtistCard-style__Overlay-sc-7ba2494f-7 kSaKRF"></div>
      <span class="ArtistCard-style__Name-sc-7ba2494f-5 derVIL">Artist Name</span>
      <div class="ArtistCard-style__ImageWrapper-sc-7ba2494f-8 hmTKKR">
        <img alt="Artist Name" src="artist-image-url.jpg" />
      </div>
    </a>
  </div>
</div>

Attempts to resolve:

I have confirmed that the artistCard.html() log outputs the expected HTML structure. I have tried using different selectors and checking the loaded HTML to ensure it matches the structure I'm searching for. What could be going wrong here? Any help or suggestions on how to correctly find and extract the artist's URL from the search results would be greatly appreciated.


Solution

  • The data is injected into the DOM tree after it loads by JS, so what you see in dev tools doesn't match the unhydrated HTML the server gives you, but the data you want is present in the base HTML within a JSON payload:

    <script id="__NEXT_DATA__" type="application/json">{"props":{"pageProps":...
    

    You can extract this JSON chunk and traverse it (I used a tool I wrote to do this) to find the info you want, then build the URL:

    const data = JSON.parse($("#__NEXT_DATA__").text());
    const {artist_id, artist_name} =
      data.props.pageProps.dehydratedState.queries[0].state.data.artists.data[0];
    const url = `https://www.beatport.com/artist/${artist_name}/${artist_id}`;
    console.log(url);
    

    Complete, runnable example:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const BASE_URL = "https://www.beatport.com";
    const name = "autechre";
    const searchUrl = `${BASE_URL}/search?q=${encodeURIComponent(name)}`;
    
    fetch(searchUrl)
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
        const data = JSON.parse($("#__NEXT_DATA__").text());
        const {artist_id, artist_name} =
          data.props.pageProps.dehydratedState.queries[0].state.data.artists.data[0];
        const url = `${BASE_URL}/artist/${artist_name}/${artist_id}`;
        console.log(url); // => https://www.beatport.com/artist/Autechre/21277
      })
      .catch(err => console.error(err));
    

    For multi-word artists, try splitting on spaces and joining them with a "-" to make the URL: artist_name.replace(/ +/g, "-"). If you run into anomalies, you might need to switch to Puppeteer to run JS and extract the actual URL built dynamically, which doesn't appear to be available on the initial load. Here's a starter:

    const puppeteer = require("puppeteer"); // ^22.10.0
    
    // Same searchUrl variables as above
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      await page.goto(searchUrl, {waitUntil: "domcontentloaded"});
      const a = await page.waitForSelector('[href^="/artist/"]');
      console.log(await a.evaluate(el => el.href));
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    It's a good idea to block any unnecessary requests (fonts, images, tracking scripts, stylesheets, etc).

    Or use the API.