elasticsearchelasticsearch-2.0elasticsearch-mappingelasticsearch-templateapache-nifi

Elasticsearch: Indexing tweets - mapping, template or ETL


I am about to index tweets coming from Apache NiFi to Elasticsearch as POST and want to do the following:

  1. Make create_at field as date. Should I use mapping or index template for this?

  2. make some fields not analyzed. Like hashtags, URLs, etc.

  3. Want to store not entire tweet but some important fields. Like text, not all user information but some field, hashtags, URLs from entities (in post URLs). Don't need quoted source. Etc. What should I use in this case? template? Pre-process tweets with some ETL process in order to extract data I need and index in ES?

I am a bit confused. Will really appreciate advise.

Thanks in advance.


Solution

  • I guess in your NiFi you have something like GetTwitter and PostHTTP configured. NiFi is already some sort of ETL, so you probably don't need another one. However, since you don't want to index the whole JSOn coming out of Twitter, you clearly need another NiFi process inbetween to select what you want and transform the raw JSON into another more lightweight one. Here is an example on how to do it for Solr, but I'm not sure the same processor exists for Elasticsearch.

    This article about streaming Twitter data to Elasticsearch using Logstash shows a possible index template that you could use in order to mold your own (i.e. add the create_at data field if you like).

    The way to go for you since you don't want to index everything, is clearly to come up with your own mapping, which you can then use in an index template. Using index templates, you will be able to create daily/weekly/monthly twitter indices as you see fit.