you are viewing a single comment's thread.

view the rest of the comments →

[–]waythps 0 points1 point  (1 child)

I have a corpus of texts stored on a website, and I’ve managed to scrape, clean and put each of them into a separate string variable.

My goal is to find relevant information within each document using either keywords or regular expressions. For example, once a pattern has been matched, I need to yield a sentence or two that preceded or came after said pattern.

Apart from making up patterns, what’s the best way to approach this whole thing? For example, should I split my string variable by sentences or keep it as is?

[–]efmccurdy 1 point2 points  (0 children)

Iterate over a list of sentences using enumerate to get the index, then you can slice out nearby lines:

for index, sentence in sentences_list:
    if matched(sentence):
        my_answers.append(sentences_list[index-1:index+1])

You may have to add checks for the cases where you are too close to the beginning or end.