web-scrapingmongodb-stitch

Web Scraping in React & MongoDB Stitch App


I'm moving a MERN project into React + MongoDB Stitch after seeing it allows for easy user authentication, quick deployment, etc.

However, I am having a hard time understanding where and how can I call a site scraping function. Previously, I web scraped in Express.js with cheerio like:

app.post("/api/getTitleAtURL", (req, res) => {
  if (req.body.url) {
    request(req.body.url, function(error, response, body) {
      if (!error && response.statusCode == 200) {
        const $ = cheerio.load(body);
        const webpageTitle = $("title").text();
        const metaDescription = $("meta[name=description]").attr("content");
        const webpage = {
          title: webpageTitle,
          metaDescription: metaDescription
        };
        res.send(webpage);
      } else {
        res.status(400).send({ message: "THIS IS AN ERROR" });
      }
    });
  }
});

But obviously with Stitch no Node & Express is needed. Is there a way to fetch another site's content without having to host a node.js application just serving that one function?

Thanks


Solution

  • Turns out you can build Functions in MongoDB Stitch that allows you to upload external dependencies.

    However, there're limitation, for example, cheerio didn't work as an uploaded external dependency while request worked. A solution, therefore, would be to create a serverless function in AWS's lambda, and then connect mongoDB stitch to AWS lambda (mongoDB stitch can connect to many third party services, including many AWS lambda cloud services like lambda, s3, kinesis, etc).

    AWS lambda allows you to upload any external dependencies, if mongoDB stitch allowed for any, we wouldn't need lambda, but stitch still needs many support. In my case, I had a node function with cheerio & request as external dependencies, to upload this to lambda: make an account, create new lambda function, and pack your node modules & code into a zip file to upload it. Your zip should look like this: enter image description here

    and your file containing the function should look like:

        const cheerio = require("cheerio");
    const request = require("request");
    
    exports.rss = function(event, context, callback) {
    
      request(event.requestURL, function(error, response, body) {
    
        if (!error && response.statusCode == 200) {
    
          const $ = cheerio.load(body);
          const webpageTitle = $("title").text();
          const metaDescription = $("meta[name=description]").attr("content");
          const webpage = {
            title: webpageTitle,
            metaDescription: metaDescription
          };
    
          callback(null, webpage); 
          return webpage; 
        } else {
          callback(null, {message: "THIS IS AN ERROR"})
          return {message: "THIS IS AN ERROR"}; 
    
        }
      });
    };
    

    and in mongoDB, connect to a third party service, choose AWS, enter the secret keys you got from making an IAM amazon user. In rules -> actions, choose lambda as your API, and allow for all actions. Now, in your mongoDB stitch functions, you can connect to Lambda, and that function should look like this in my case:

       exports = async function(requestURL) {
      const lambda = context.services.get('getTitleAtURL').lambda("us-east-1");
    
      const result = await lambda.Invoke({
        FunctionName: "getTitleAtURL",
        Payload: JSON.stringify({requestURL: requestURL})
      });
    
      console.log(result.Payload.text());
      return EJSON.parse(result.Payload.text());
    };
    

    Note: this slowed down performances big time though, generally, it took twice extra time for the call to finish.