I am planning to build an AI system that learns from the corpus (text file) and needs to answer to question for user like chatbot to be short chatbot without any predefined data.
Until now I webscraped some data and stored as a text file and I used TF-IDF(cosine similarity) method to make the system to answer questions but the accuracy level is moderate only
def response(user_response):
robo_response=''
sent_tokens.append(user_response)
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)
vals = cosine_similarity(tfidf[-1], tfidf)
idx=vals.argsort()[0][-2]
flat = vals.flatten()
flat.sort()
req_tfidf = flat[-2]
if(req_tfidf==0):
robo_response=robo_response+"cant understand"
return robo_response
else:
robo_response = robo_response+sent_tokens[idx]
return robo_response
TD-IDF method which I used
Is there any other way to build a system to do the work somewhat accurately?
PFA links for something that you want to do.
https://demo.allennlp.org/reading-comprehension https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604
They are already built systems that allow you to do just that.
If you want to build something similar from scratch, there are a lot of process that need to be followed on the processed text.
Tfidf is a BoW(bag of words) algorithm that can help you identify the intent, but not relation between those intents. The matrix that is obtained from vectorized tfidf along with the label will just tell the machine that if for some text, similar matrix is obtained, this is the label. Which is handy in classification, but not for chatbot responses.
To get response from chatbot: -Segment text into sentences. -use various techniques to obtain context of the text, currently XLNet is provided best results( https://medium.com/dair-ai/xlnet-outperforms-bert-on-several-nlp-tasks-9ec867bb563b ). This will help you to formulate responses to queries that can be asked via chatbot. Above are few rudimentary steps, an actual AI system will involve a lot more.