I am implementing a simple search engine that searches in a source data which is the 12k pieces of written-news of different topics. We assume that the search engine just have the ability to respond to:
For instance this query:
is a query that should contain:
The point is that the Phrase Query should come continuously in a unique piece with no other words between the words! My problem is with splitting these three types of queries using string operation of Python or re library.
I have write this piece of code for extracting Phrase Queries and Not Queries. but I have not handled to extract the And queries yet!
query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)
For the input of:
the above code returns:
['global warming']
['USA']
Which is great. However I can not extract the And Query. How can I extract the And Query: worldwide in a different list?
If I understand the problem correct, anything that is not a part of the phase query and the not query, is part of the and query. So, we can essentially just remove the terms that come in those queries from the string and then split it to get the individual terms.
import re
data = '"global warming" worldwide !USA'
query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
and_query = data[:]
for q in phrase_query:
complete_text = '"' + q + '"'
and_query = and_query.replace(complete_text, "")
for q in not_query:
complete_text = "!" + q
and_query = and_query.replace(complete_text, "")
and_query = and_query.split()
print(and_query)
print(phrase_query)
print(not_query)
So, what I am doing here is, in the first for loop, I am looping over all the phrase queries and then completing them by adding the quotes before and after, just like they would be shown in the original query. Then I will replace them with a blank string. So it would basically just remove them. After that, I am doing a similar thing with all the not queries, but this time I am adding an exclamation in the front.
Then, the remaining terms in the search are all and queries, so we can split them to get those terms individually in a list.
EDIT for a more robust solution(one that handles spaces effectively):
import re
data = '" global warming " worldwide ! USA'
query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!([^w+]*)', query)
and_query = data[:]
for q in phrase_query:
complete_text = '"' + q + '"'
and_query = and_query.replace(complete_text, "")
for q in not_query:
complete_text = "!" + q
and_query = and_query.replace(complete_text, "")
and_query = [answer.strip() for answer in and_query.split()]
phrase_query = [answer.strip() for answer in phrase_query]
not_query = [answer.strip() for answer in not_query]
print(and_query)
print(phrase_query)
print(not_query)