nlpgoogle-cloud-nl

Does the Google Cloud Natural Language API actually support parsing HTML?


I'm trying to extract the main body content from news sites & blogs.

The docs make it seem as though documents.analyzeSyntax would work as expected with HTML by passing it a document with the content as the page's raw HTML (utf-8) and the document's type set to HTML. The docs definitely include HTML as a supported content type.

In practice, however, the resulting sentences and tokens are muddled with HTML tags as though the parser thinks the input is plain text. As it stands, this rules out the GC NL API for my use case, and presumably many others as processing web pages via natural language is a pretty common task.

For reference, here is an example by Dandelion API of the type of output one would expect given HTML input (or rather in this case a URL to an HTML page as input).

My question, then, is am I missing something, possibly invoking the API incorrectly, or does the NL API not support HTML?


Solution

  • Yes it does.

    Not sure what language you were using, but below is an example in python using the client library:

    from google.cloud import language
    
    client = language.Client()
    
    # document of type PLAIN_TEXT
    text = "hello"
    document_text = client.document_from_text(text)
    syntax_text = document_text.analyze_syntax()
    
    print("\n\ndocument of type PLAIN_TEXE:")
    for token in syntax_text.tokens:
        print(token.__dict__)
    
    # document of type HTML
    html = "<p>hello</p>"
    document_html = client.document_from_html(html)
    syntax_html = document_html.analyze_syntax()
    
    print("\n\ndocument of type HTML:")
    for token in syntax_html.tokens:
        print(token.__dict__)
    
    # document of type PLAIN_TEXT but should be HTML
    document_mismatch = client.document_from_text(html)
    syntax_mismatch = document_mismatch.analyze_syntax()
    
    print("\n\ndocument of type PLAIN_TEXT but with HTML content:")
    for token in syntax_mismatch.tokens:
        print(token.__dict__)
    

    This works for me in that the html tags <p> and </p> are not processed as natural language.

    If you go through the setup steps on this page you can quickly experiment with the gcloud commandline tool:

    gcloud beta ml language analyze-syntax --content="<p>hello</p>" --content-type="HTML"