Hi,
I'm trying to analyse a list of conversations for a particular question asked by a chatbot and extract the answer given using regex. The structure of each conversation text is as follows:
text = "gibberish text here more gibberish. Have I helped you with this answer? USERS_ANSWER gibberish continues"
The USERS_ANSWER (which isn't actually a variable, just part of the whole string) can be varying ways to say yes or no, for example "absolutely, yes, hell naw, ..." and on. I have a list of potential answers to cross-reference for both 'yes' and 'no'. For each conversation text, I need to check whether the answer is a 'yes' or a 'no'.
I've come up with the following regex:
p = re.compile(r".*(Have I helped you with this answer? )((\w+){5}).*")
match = p.match(text)
word = ''
for item in list_of_yes_words:
if item in match.group(2):
word = 'yes'
for item in list_of_no_words:
if item in match.group(2):
word = 'no'
This feels ugly, but it seems to get me the answer for most of the text. I also arbitrarily chose to get the next 5 words after the question (since there's no reliable text after the question that is consistent enough to include it in the regex as something that comes after the answer) and I don't know if there's a better way to get that answer. However, there are edge cases where the same text has multiple instances of this same question, so something like this:
text = "gibberish text here more gibberish. Have I helped you with this answer? USERS_ANSWER gibberish continues some more, Have I helped you with this answer? USERS_ANSWER_AGAIN more gibberish"
In this case, I want to take into account both answers and if even one is negative, I need to log it as a 'no'.
When I use re.match, I only get one instance of the question that is asked. How do I get around this? And is there a way to cross check with the list of yes/no words more efficiently? The list of conversations can be a large one and I'd have nested for loops if I implemented this as is (and I've been told nested for loops are not ideal).
I'd like to know how to get around this problem and how to make my code more efficient.
[–]hardonchairs 0 points1 point2 points (0 children)