I'm trying to extract the main body content from news sites & blogs.
The docs make it seem as though documents.analyzeSyntax
would work as expected with HTML by passing it a document
with the content
as the page's raw HTML (utf-8) and the document's type
set to HTML
. The docs definitely include HTML as a supported content type.
In practice, however, the resulting sentences and tokens are muddled with HTML tags as though the parser thinks the input is plain text. As it stands, this rules out the GC NL API for my use case, and presumably many others as processing web pages via natural language is a pretty common task.
For reference, here is an example by Dandelion API of the type of output one would expect given HTML input (or rather in this case a URL to an HTML page as input).
My question, then, is am I missing something, possibly invoking the API incorrectly, or does the NL API not support HTML?
Yes it does.
Not sure what language you were using, but below is an example in python using the client library:
from google.cloud import language
client = language.Client()
# document of type PLAIN_TEXT
text = "hello"
document_text = client.document_from_text(text)
syntax_text = document_text.analyze_syntax()
print("\n\ndocument of type PLAIN_TEXE:")
for token in syntax_text.tokens:
print(token.__dict__)
# document of type HTML
html = "<p>hello</p>"
document_html = client.document_from_html(html)
syntax_html = document_html.analyze_syntax()
print("\n\ndocument of type HTML:")
for token in syntax_html.tokens:
print(token.__dict__)
# document of type PLAIN_TEXT but should be HTML
document_mismatch = client.document_from_text(html)
syntax_mismatch = document_mismatch.analyze_syntax()
print("\n\ndocument of type PLAIN_TEXT but with HTML content:")
for token in syntax_mismatch.tokens:
print(token.__dict__)
This works for me in that the html tags <p>
and </p>
are not processed as natural language.
If you go through the setup steps on this page you can quickly experiment with the gcloud
commandline tool:
gcloud beta ml language analyze-syntax --content="<p>hello</p>" --content-type="HTML"