pythonlisttuplesphrase

Phrase matching in lists


Assuming I have a list representing a sentence ex:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']

and a list of place names

places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

how can I end up with:

[('O','terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', vocatas'), ('PLACE', 'Ta'), ('PLACE', 'Xellule'), ('O','et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O','in'), ('O','contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf')]

A quick note:

If for example Ta has to be only next to Xellule. If found in another context in a sentence this should not be tagged as PLACE ex: Ta Buni mar Ta Xellule...only the second Ta should be tagged.

This is an example of my place list:

 'Ras il Huichile',
 'Ras il Hued',
 'Ta Richardu',
 'Roma',
 'Russilion',
 'La Rukiha',
 'Irrukiha ta il Bayada',
 'Casalis Milleri',
 'Ta Sabat',
 'Casalis Zebug',
 'Ta Zagra',
 'Sagra in  Ras il Hued',
 'Ta Isalme'

and this is an example sentence:

terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus

Here the in although it is present in Sagra in Ras il Hued should not be tagged as place


Solution

  • ok, I updated my answer based on your edit:

    from functools import reduce
    
    sent = "terras ipsius Azar vocatas Ta Ta Zagra Ta Zagra Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus"
    places = [ 'Ras il Huichile', 'Ras il Hued', 'Ta Richardu', 'Roma', 'Russilion', 'La Rukiha', 'Irrukiha ta il Bayada',
    'Casalis Milleri', 'Ta Sabat', 'Casalis Zebug', 'Ta Zagra', 'Sagra in  Ras il Hued', 'Ta Isalme', 'Ta Xellule', 'Ginen Chagem',
    'Deyr Issafisaf']
    
    places_map = {p:[('PLACE', l) for l in p.split()] for p in places}
    
    def find_places(sent, places):
        if len(places) is 0:
            return [('O', l) for l in sent.split()]
    
        place = places[0]
        remaining_places = places[1:]
    
        sent_splits = sent.split(place)
        return reduce(lambda a,b:a+places_map[place]+b, [find_places(s, remaining_places) for s in sent_splits])
    
    print(find_places(sent, places))
    

    and the output is:

    [('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('O', 'Ta'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('O', 'Xellule'), ('O', 'et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O', 'in'), ('O', 'contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf'), ('O', 'cum'), ('O', 'iuribus'), ('O', 'suis'), ('O', 'omnibus')]
    

    so I used a recursive method to find a place in the sentence change it in the format you want and do this recursively on the remaining parts of the sentence with the remaining places and then finally join them together.