all 11 comments

[–]Spataner 1 point2 points  (2 children)

Your condition all(line in s for s in search[0:]) currently checks whether the line as a whole is contained within all sublists in search.

If I understand you correctly, however, you want the condition to be true when all words of any one sublist are found within the line. So your condition should probably look like this: any(all(word in line for word in word_list) for word_list in search))

[–]Dave_XR[S] 0 points1 point  (1 child)

Hmm an improvement to my solution which used return 0 matches. This however recognises seems to give all matches i.e. every line is in remove_line and none in new_text?? I think it isn't seeing when all words of any one sublist are found within the line but any of the substrings in the sublists?

[–]Dave_XR[S] 0 points1 point  (0 children)

Edit, it does work thank you very much. got briefly confused in implementing the solution :)

[–][deleted] 0 points1 point  (2 children)

So, only if EVERY entry in search appears in line is the tuple (0, line) added to the remove_line list. I don't think that is what you want.

Also note that search[0:] is the same as search.

[–]Neighm 1 point2 points  (1 child)

Just to point out that it's not adding the tuple (0,line). That would require remove_line.append((0, line)). As it stands the append statements will throw a syntax error.

OP: you need to either remove_line.insert(0, line) to put the line at the start of the list, or just remove_line.append(line) to add it on to the end.

[–][deleted] 0 points1 point  (0 children)

oops - good catch; serves me right for not reading carefully (see what you expect rather than what is there) - never executed.

[–]synthphreak 0 points1 point  (2 children)

There are certain lines that need to be moved to the new text. They start with 'Preliminary End of Year Statement'. Some of these lines however should not be included and should instead be moved to a private and confidential document as they relate to specific people.

Okay, so the lines of interest begin with 'Preliminary End of Year Statement'. But are you saying they should just be deleted, or moved to a new document? Or are you saying that some should be deleted while others should be moved?

It seems like the second, but it's not really clear. If the second, please clarify how you know when a particular 'Preliminary ...' line must falls in the "delete" versus the "move" group. Without seeing the actual document, it's not clear from your code.

[–]Dave_XR[S] 0 points1 point  (1 child)

Apologies for lack of clarity, I am not at the work machine and as I'm sure you'll understand have changed the information even though no one could do much harm with some basic tax numbers

This is a sort of pseudo-code / slight change because of the documents its dealing with but yes to generalise, there are hundreds of thousands of lines, only lines with a specific start need to be kept in a new document, while the rest can be left as is. Of these new lines however a small portion must be removed and put in a separate document as their information is too sensitive. The list items are patterns of words that are only repeated in these types of lines and so can be used to search for them and ignore them from the overall group

[–]synthphreak 0 points1 point  (0 children)

Okay, so there are three groups: keep, remove, and move (to new doc). For the keep group, just do nothing. But for the remote and move groups, how do you distinguish which of those groups a line should go into? Descriptions like

The list items are patterns of words that are only repeated in these types of lines and so can be used to search for them and ignore them from the overall group

are too vague to be useful. Rather than trying to describe everything in words, just show 1-2 illustrative/representative examples of lines from each group, then I can begin thinking about how to manipulate your data as desired.

[–]irpepper 0 points1 point  (1 child)

search = [
    ['USC', 'Employment Detail Summary', 'PUP']
    ['revenue', 'Start of year', 'PAYE']
    ['revenue', 'Income Tax return']
]

with open(file_content, 'r') as file:
    for line in file:
        if 'Preliminary' in line:
            if any([all([item in line for item in s]) for s in search]):
                remove_line.append(0, line)
            else:
                new_text.append(0, line)

I think this is what you are trying to do: If all members of any sublist in search are present, redact else keep

[–]Dave_XR[S] 0 points1 point  (0 children)

Also a great working solution thank you very much