python search-engine python-re string-operations

String operation in python: handling the queries of a simple search engine

I am implementing a simple search engine that searches in a source data which is the 12k pieces of written-news of different topics. We assume that the search engine just have the ability to respond to:

Phrase Queries that come with inside of the double-quotation marks
Not Queries that come after the exclamation mark
And Queries which come without any specific mark

For instance this query:

"global warming" worldwide !USA

is a query that should contain:

the Phrase Query: "global warming"
the And Query: worldwide
not contain the Not Query: USA

The point is that the Phrase Query should come continuously in a unique piece with no other words between the words! My problem is with splitting these three types of queries using string operation of Python or re library.

I have write this piece of code for extracting Phrase Queries and Not Queries. but I have not handled to extract the And queries yet!

query = input()
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)
print(phrase_query)
print(not_query)

For the input of:

"global warming" worldwide !USA

the above code returns:

['global warming']
['USA']

Which is great. However I can not extract the And Query. How can I extract the And Query: worldwide in a different list?

Solution

If I understand the problem correct, anything that is not a part of the phase query and the not query, is part of the and query. So, we can essentially just remove the terms that come in those queries from the string and then split it to get the individual terms.

import re

data = '"global warming" worldwide !USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!(\w+)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = and_query.split()


print(and_query)
print(phrase_query)
print(not_query)

So, what I am doing here is, in the first for loop, I am looping over all the phrase queries and then completing them by adding the quotes before and after, just like they would be shown in the original query. Then I will replace them with a blank string. So it would basically just remove them. After that, I am doing a similar thing with all the not queries, but this time I am adding an exclamation in the front.

Then, the remaining terms in the search are all and queries, so we can split them to get those terms individually in a list.

EDIT for a more robust solution(one that handles spaces effectively):


import re

data = '" global warming " worldwide ! USA'

query = data
phrase_query = re.findall(r'"([^"]*)"', query)
not_query = re.findall(r'!([^w+]*)', query)

and_query = data[:]

for q in phrase_query:
    complete_text = '"' + q + '"'
    and_query = and_query.replace(complete_text, "")
for q in not_query:
    complete_text = "!" + q
    and_query = and_query.replace(complete_text, "")

and_query = [answer.strip() for answer in and_query.split()]
phrase_query = [answer.strip() for answer in phrase_query]
not_query = [answer.strip() for answer in not_query]


print(and_query)
print(phrase_query)
print(not_query)