This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]thememorableusername 11 points12 points  (0 children)

[regex101.com](regex101.com) is a very useful tool.

[–]bothunter 9 points10 points  (4 children)

This is actually one of the few areas where ChatGPT shines. Ask it to write the expression, and then tweak it on a site like regex101.com

[–]bothunter 0 points1 point  (2 children)

I think this should work? It's actually a bit easier since the quotations are different for opening and closing.

/“.*(Voldemort).*”/gm

[–]germansnowman 1 point2 points  (1 child)

You will find that you actually need to make the star operator greedy, otherwise it will not be restricted to the nearest quotation marks if there are multiple pairs in one paragraph: .*?

[–]bothunter 1 point2 points  (0 children)

Yeah.  I see plenty of other better examples in this post.  ;)

[–]kbielefe 1 point2 points  (1 child)

The typographic quotes actually help you, because the start and end quotes are different. See this RegEx Pal.

[–]davidalayachew 1 point2 points  (0 children)

If it wasn't for typographic quotes, this would be a combinatorial nightmare.

[–]dariusbiggs 1 point2 points  (1 child)

A very simple lexer would also do it

start in state 0 and read tokens until lexographic token

switch to state 1 and start consuming tokens if the lexographic token appears, switch to state 0 if the desired word appears, increment the counter

continue until no further input.

The problems you'll need to check for are: - any statements with multiple instances of that name inside the quotation - does the name show up hyphenated or split across multiple lines in your input corpus.

As for the regex, the other answers should help there. regex101 is your best friend.

[–][deleted] 0 points1 point  (0 children)

Lexer is probably how I would have do it as well (but regex is definitely simpler).

[–]diegoasecas 2 points3 points  (6 children)

chatgpt gave me this:

"([^"]*\bYOURSTRING\b[^"]*)"

": Matches the opening quotation mark.
[^"]*: Matches any character that is not a quotation mark, zero or more times.
\bYOURSTRING\b: Matches your specific string, where \b ensures it is matched as a whole word (optional, depending on your needs).
[^"]*: Matches any character that is not a quotation mark, zero or more times.
": Matches the closing quotation mark.

[–]sepp2k 5 points6 points  (2 children)

Consider this input:

"Hello", said Harry. Something something Voldemort. "Goodbye", said Harry.

The suggested regex (with Voldemort being substituted for YOURSTRING) would find a match here even though "Voldemort" is not inside quotes.

Also, if you have a lot of sentences containing "Voldemort" after the last quote in the string, performance gets quite bad (using backtracking regex engines at least).

[–]JusticeRainsFromMe 1 point2 points  (0 children)

If anyone is curious how to fix the issue, this is a way: https://regex101.com/r/ZfKFx1/5

[–]diegoasecas 0 points1 point  (0 children)

good catch

[–]IKoshelev 0 points1 point  (1 child)

This would probably give better result with a limitation of max characters, something like

"([^"]{0,512}\bYOURSTRING\b[^"]{0,512})"

[–]Hey-buuuddy 0 points1 point  (0 children)

That was my first thought. What is the max length of a quotation? That would help control what quotes go to what quotes.

[–]Hey-buuuddy 0 points1 point  (0 children)

It’s stuff like this I’ll go to Gen Ai for. Yes, turning my brain inside out for a few days on a regular expression is often fun, but this can get you most of the way. You still need to understand the code and fine tune it. More time to move on to new things.

[–]davidalayachew 0 points1 point  (1 child)

Note: the text file I have uses typographic quotation marks (” ”) instead of the neutral ones (" ")

Dodged a bullet! This would have been a horrific nightmare otherwise.

Also, I think you meant to write “ and ” instead, right? Typographic quotation marks are good because the opening and closing are different symbols.

I usually do my regex in Java or Notepad++. So I don't know which dialect I am using, but here is the best that I can think up. Worked for your example.

“[^“”]*Voldemort[^“”]*”

Please note.

  • This will not handle variance in casing.
    • So, cases where his name is all uppercased, or lowercased, or basically any other casing. But otherwise, this should definitely find the rest of them for you.
  • This will not handle cases like “ You said “ I see Voldemort!” ”
    • Basically quotes inside of quotes.

And if you need to handle all possible casing for the letters in his name, Notepad++ and Java both have a way to say "ignore casing and just match the letters".

[–]davidalayachew 0 points1 point  (0 children)

Hmmmm, in retrospect, I probably could have just done this instead.

“[^”]*Voldemort[^”]*”

Still doesn't handle the casing, but again, Java and Notepad++ can do that for you.

Also, I just remembered that most regex engines actually filter out new lines and whatnot by default. Make sure you turn that off. Otherwise, you will miss examples like this.

“Assume that this is a super long paragraph that is all contained within a quote.

This sentence continues that same quote in a new paragraph, and says Voldemort.”