pythonnlplda

How to handle numbers embedded in text during NLP pre-processing?


I am trying to run the LDA algorithm on a data set of news articles. I understand that numbers must be removed during the pre-processing step, and I have written a simple regex code to replace numbers with blanks.

df['number_removed'] = df['text'].str.replace('\d+', '',regex=True)

However, I would like to retain some numbers since removing them can potentially change the context/topic. For example,

[Desired] 'The fourth industrial revolution also referred to as Industry 40 is starting to change the way goods are produced'

[Wrong] 'The fourth industrial revolution also referred to as Industry is starting to change the way goods are produced'

Note: The punctuations have been removed in the example as part of pre-processing

So, I was wondering:

  1. Can essential numbers be retained before running LDA?
  2. How to selectively remove numbers or handle the above situation?

Solution

  • What is sometimes done in similar situations is to replace numbers with a dummy token, such as <NUMBER>, so that the fact that there was a number in the original text is preserved, but without disturbing the syntactic context. The actual value is usually not that important for generalisations.

    If you want to retain concrete numbers (like "industry 40") then I guess you need to adjust your regex to keep those patterns.