[SOLVED] "United States" not ["United","States"]

"United States" not ["United","States"]

i have text field in elasticsearch and i want to visualize word-cloud from it on kibana...

first step we need to tokenize them, i used "standard tokenizer"... word-cloud visualization with this form results picture like below:

but what i need is that proper nouns like "United States", "United Nations", "Security Council" and ... must not dispart and i desired a word-cloud like this: * proper nouns or phrases maybe between 2-5 words almost. (like "the People's Republic of China")

what should i do? is this relevant to N-Gram?

example text:

The United States of America is a charter member of the United Nations and one of five permanent members of the UN Security Council.

The United States is host to the headquarters of the United Nations, which includes the usual meeting place of the General Assembly in New York City, the seat of the Security Council and several bodies of the United Nations. The United States is the largest provider of financial contributions to the United Nations, providing 22 percent of the entire UN budget in 2017 (in comparison the next biggest contributor is Japan with almost 10 percent, while EU countries pay a total of above 30 percent).1 From July 2016 to June 2017, 28.6 percent of the budget used for peacekeeping operations was provided by the United States.2 The United States had a pivotal role in establishing the UN.

Solution

This task is an NER task, not a standard tokenization task. There are plugins to do this with elastic but none are promising.

To make this work, you need to preprocess your data on the application side. Use an NLP parser (Standford Core NLP, Spacy...) and extract Named Entities. Create a keyword field in your mapping (call it entities for eg) where you save the entities you extracted from each document as an array and then you can use this field to generate your word-cloud.

Good Luck.