I have a sentence that has a specific format.
<subject> <action> <object> @ <price> ... // The sentence can continue
and I want to extract these values out of the sentence.
Constraints:
Bob
or Alice
bought
or sold
4apples
should return NULLsubject
but guaranteed to not
contain Bob/Alice
.@
Example:
Hi there, Bob sold apples @2.0 dollars each
Desired Output:
Subject: Bob
Action: sold
Object: apples
Price: 2.0
Currently, I do it the naive way by:
#!/usr/bin/env python3
sentence = "Hi there, alice sold apples @2.0 dollars each"
sentence = sentence.lower()
if 'alice' in sentence or 'bob' in sentence:
s_list = sentence.split(" ")
s_idx = -1
if 'bob' in sentence:
s_idx = s_list.index('bob')
elif 'alice' in sentence:
s_idx = s_list.index('alice')
if s_idx > -1:
Subject = s_list[s_idx]
Action = s_list[s_idx+1]
Object = s_list[s_idx+2] #more if/else to validate Object contraints
Price = s_list[s_idx+3] #more if/else to extract 2.0 if we get @2.0
print("Subject: {}, Action: {}, Object: {}, Price: {}".format(Subject, Action, Object, Price))
How can I do this better? Possibly using re
You could use a regex with a named capturing group for each element:
import re
sentence = "Hi there, alice sold apples @2.0 dollars each"
values = re.search('(?P<subject>bob|alice)\s+(?P<action>bought|sold)\s+(?P<object>[A-Za-z]{1,7})\s+@\s*(?P<price>\d+(?:\.\d+)?)', sentence)
if values:
Subject = values['subject']
Action = values['action']
Object = values['object']
Price = values['price']
print("Subject: {}, Action: {}, Object: {}, Price: {}".format(Subject, Action, Object, Price))
This will output
Subject: alice, Action: sold, Object: apples, Price: 2.0
Note you may want to supply the re.I
flag to re.search
to allow for bob
or Bob
(or Sold
or sold
etc.) to be matched; in that case you could replace [A-Za-z]
in the object
capture group with [a-z]
.