htmliosswiftweb-scrapingswiftsoup

Scrape product price from any Website using Swift Soup


Inside my app I would like to scrape the price of any product (user types in the wanted URL).

I searched quite a bit now and I found out that there are couple of Webscrapers, I think I will use SwiftSoup for now. However I couldn't find a single tutorial that teaches how to scrape for elements with "dynamic" tags. For example the price of a product on a website looks different for every website:

Example 1:

<div class="price">82 EUR</div>

Example 2:

<span class="gl-price__value">€ 139,95</span>

Example 3:

<span id="priceblock_ourprice" class="a-size-medium a-color-price priceBlockBuyingPriceString">79,99&nbsp;€</span>

I know I can scrape elements like this:

let html: String = "<a id=1 href='?foo=bar&mid&lt=true'>One</a> <a id=2 href='?foo=bar&lt;qux&lg=1'>Two</a>";
let els: Elements = try SwiftSoup.parse(html).select("a");
for element: Element in els.array(){
    print(try element.attr("href"))
}

But what is the best way to scrape dynamically? Couldn't find anything on this so I am happy for every help :)

Update

I managed to get the right 'price' if I know the exact 'class-name' :

let url = "https://www.adidas.de/adistar-trikot/CV7089.html"
    let className = "gl-price__value"


    do {
        let html: String = getHTMLfromURL(url: url)
        let doc: Document = try SwiftSoup.parse(html)

        let price: Element = try doc.getElementsByClass(className).first()!
        let priceText : String = try price.text()

        result.text = priceText

    } catch Exception.Error(let type, let message) {
        print(message)
    } catch {
        print("error")
    }

However, I would like to make this work so all 3 examples above work. Right now I am struggling to get the right 'regex' that includes all three examples... Anyone an idea?


Solution

  • I don't think there is a way to scrape virtually anything "dynamically". You have no way to detect all possible way people can write their html in showing you the price.

    What you could do, but I don't think it would be that easy, is to train a machine learning model to detect the price most of the times. But that's probably off of the scope of this question.

    Another way you could try is to simply look at most sites and add several "generic" algorithms to scrape their sites. If one doesn't work, you just try with another until you either succeed or give up. This way, avoiding to hardcode the class names and other stuff, you're gonna at least scrape all sites that have a similar structure as the one in your generic scrapers.

    One way (but I believe you could think of other, better, ways) I would approach the implementation of a "generic" scraper algorithm is to have a list of regex of the class of the prices to match and try with them all, trying then to validate the results you get inside the html text (e.g. is there any number inside the text? Does it contain symbols like €, $, ..? etc.). I would start with something like .*price.* and other similar regexes you could simply find by looking at most sites.

    You will definitely incur in some sites that you didn't think of. Then you can send yourself that info (when on the client you detect you can't find the price on a site), and you can look at the site yourself and add more regexes on your list (that probably will need to be updated server side and downloaded on your client every time it updates) if that solves the issue, or add another scraper algorithm or make one of your previous ones more generic and work with that use case too (but this requires a new app release).

    I'm sorry if this answer is not very specific, but your question was so wide it was nearly impossible to be more specific.

    PS: Not sure if this is the best approach (maybe some parser is better suited for this) but one regex I could rapidly think of that matches all 3 of your examples where <[^>]*class=".*price.*"[^>]*>([^<]*)<. Probably there is something more clever, but with this regex you'll automatically get the text inside the html element in the first capturing group. Than you just need to sanitize it (remove unwanted characters etc) and maybe validate it.