scalaapache-sparkjohnsnowlabs-spark-nlp

How to use a annotator in sparknlp for a text file


As i am beginner to spark NLP, I started to do some hands on exercises using the functions which are displayed in the johnsnowlabs

I am using SCALA from data bricks and i got a large text file from https://www.gutenberg.org/

So first I import necessary libraries and data as follows,

import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
val book = sc.textFile("/FileStore/tables/84_0-5b1ef.txt").collect()
val words=bookRDD.filter(x=>x.length>0).flatMap(line => line.split("""\W+"""))
val rddD = words.toDF("text")

How to use different Annotators which are available in johnsnowlabs based on my purpose ?

For example if I want to find stop-words, then I can use

val stopWordsCleaner = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setStopWords(Array("this", "is", "and"))
      .setCaseSensitive(false)

But I have no idea about how to use this and find stop words of my text file. Do i need to use a pre-trained model with the annotator ?

I found very difficult to find a good tutorial about this. So it is grateful if someone can provide some useful hints.


Solution

  • StopWordsCleaner is the annotator to use to remove stop words.

    Refer: Annotators

    Stop Words maybe different for your text based on your context but generally all NLP Engines have a set of stop words which it would match and remove.

    In JSL spark-nlp, you may also set your stop words using setStopWords while using StopWordsCleaner.