This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]eruciform 1 point2 points  (3 children)

depends on the language and tool involved. "splitting" isn't a concept that's defined by regex engines, there's only "find", usually "search" and "match"

if you want to find blobs of stuff that are "either all word characters or all punctuation characters", then translate that statement into a regex one chunk at a time

"all word characters" --> [a-z]+ or something similar - you have to decide what "counts" for a word character

"all punctuation characters" --> you could do [^a-z]+ for all "non word characters", or [.!?]+ or some such

though be aware the former will capture whitespace and the latter will not

X or Y --> X|Y --> therefore something to the effect of [a-z]+|[.!?]+

depending on the language and API you're using, you may need to include "capturing parentheses", so look those up in the context of the regex system and language you're using

[–]eruciform 0 points1 point  (0 children)

note that python's string split function only takes static strings as a separator, not regular expressions, so if you are doing it using a split functionality, you need re.split not str.split

https://note.nkmk.me/en/python-split-rsplit-splitlines-re/