Given:
A simple and small pandas dataframe as follows:
df = pd.DataFrame(
{
"user_ip": ["u7", "u3", "u1", "u9", "u4","u8", "u1", "u2", "u5"],
"raw_sentence": ["First sentence!", np.nan, "I go to school everyday!", "She likes chips!", "I go to school everyday!", "This is 1 sample text!", "She likes chips!", "This is the thrid sentence.", "I go to school everyday!"],
}
)
user_ip raw_sentence
0 u7 First sentence!
1 u3 NaN
2 u1 I go to school everyday!
3 u9 She likes chips!
4 u4 I go to school everyday! <<< duplicate >>>
5 u8 This is 1 sample text!
6 u1 She likes chips! <<< duplicate >>>
7 u2 This is the thrid sentence.
8 u5 I go to school everyday! <<< duplicate >>>
Goal:
I wonder if I could possibly avoid calling map
or consider any other strategies for those rows with duplicated (exact similar) sentences in raw_sentence
column. My intention is to speedup my implementation for bigger sized pandas dataframe (~100K
rows).
[Inefficient] Solution:
Right now, I take advantage of .map()
using lambda
which goes through each row and call get_lm()
function to retrieves lemmas of raw input sentences as follows:
import nltk
nltk.download('all', quiet=True, raise_on_error=True,)
STOPWORDS = nltk.corpus.stopwords.words('english')
wnl = nltk.stem.WordNetLemmatizer()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
def get_lm(input_sent:str="my text!"):
tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)]
return lms
df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')
user_ip raw_sentence lemma
0 u7 First sentence! [first, sentence] <<< 1st occurrence => lemmatization OK! >>>
1 u3 NaN NaN <<< ignone None using na_action='ignore' >>>
2 u1 I go to school everyday! [go, school, everyday] <<< 1st occurrence => lemmatization OK! >>>
3 u9 She likes chips! [like, chip] <<< 1st occurrence => lemmatization OK! >>>
4 u4 I go to school everyday! [go, school, everyday] <<< already lemmatized, no need to do it again >>>
5 u8 This is 1 sample text! [sample, text] <<< 1st occurrence => lemmatization OK! >>>
6 u1 She likes chips! [like, chip] <<< already lemmatized, no need to do it again >>>
7 u2 This is the thrid sentence. [thrid, sentence] <<< 1st occurrence => lemmatization OK! >>>
8 u5 I go to school everyday! [go, school, everyday] <<< already lemmatized, no need to do it again >>>
Is there any more efficient approach to fix this issue?
Cheers,
Don't reinvent the wheel, use functools.cache
:
from functools import cache
@cache
def get_lm(input_sent:str="my text!"):
tks = [ w for w in tokenizer.tokenize(input_sent.lower()) if not w in STOPWORDS and len(w) > 1 and not w.isnumeric() ]
lms = [ wnl.lemmatize(w, t[0].lower()) if t[0].lower() in ['a', 's', 'r', 'n', 'v'] else wnl.lemmatize(w) for w, t in nltk.pos_tag(tks)]
return lms
df["lemma"] = df["raw_sentence"].map(lambda raw: get_lm(input_sent=raw), na_action='ignore')
Output:
user_ip raw_sentence lemma
0 u7 First sentence! [first, sentence]
1 u3 NaN NaN
2 u1 I go to school everyday! [go, school, everyday]
3 u9 She likes chips! [like, chip]
4 u4 I go to school everyday! [go, school, everyday]
5 u8 This is 1 sample text! [sample, text]
6 u1 She likes chips! [like, chip]
7 u2 This is the thrid sentence. [thrid, sentence]
8 u5 I go to school everyday! [go, school, everyday]