Looking for Regex pattern to return similar results to my current function

Asked 28/3, 2024 at 2:21 Answered 28/3, 2024 at 4:20

I have some pascal-cased text that I'm trying to split into separate tokens/words. For example, "Hello123AIIsCool" would become ["Hello", "123", "AI", "Is", "Cool"].

Some Conditions

"Words" will always start with an upper-cased letter. E.g., "Hello"
A contiguous sequence of numbers should be left together. E.g., "123" -> ["123"], not ["1", "2", "3"]
A contiguous sequence of upper-cased letters should be kept together except when the last letter is the start to a new word as defined in the first condition. E.g., "ABCat" -> ["AB", "Cat"], not ["ABC", "at"]
There is no guarantee that each condition will have a match in a string. E.g., "Hello", "HelloAI", "HelloAIIsCool" "Hello123", "123AI", "AIIsCool", and any other combination I haven't provided are potential candidates.

I've tried a couple regex variations. The following two attempts got me pretty close to what I want, but not quite.

Version 0

import re

def extract_v0(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]*"
    num_pattern = r"\d+"
    pattern = f"{word_pattern}|{num_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts

string = "Hello123AIIsCool"
extract_v0(string)

['Hello', '123', 'A', 'I', 'Is', 'Cool']

Version 1

import re

def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z][^a-z]*"
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts

string = "Hello123AIIsCool"
extract_v1(string)

['Hello', '123', 'AII', 'Cool']

Best Option So Far

This uses a combination of regex and looping. It works, but is this the best solution? Or is there some fancy regex that can do it?

import re

def extract_v2(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z][A-Z]*"
    groups = []
    for pattern in [word_pattern, num_pattern, upper_pattern]:
        while string.strip():
            group = re.search(pattern=pattern, string=string)
            if group is not None:
                groups.append(group)
                string = string[:group.start()] + " " + string[group.end():]
            else:
                break
    
    ordered = sorted(groups, key=lambda g: g.start())
    return [grp.group() for grp in ordered]

string = "Hello123AIIsCool"
extract_v2(string)

['Hello', '123', 'AI', 'Is', 'Cool']

Bottleneck answered 28/3, 2024 at 2:21 Comment(0)

Based on your Version 1:

import re


def extract_v1(string: str) -> list[str]:
    word_pattern = r"[A-Z][a-z]+"
    num_pattern = r"\d+"
    upper_pattern = r"[A-Z]+(?![a-z])"  # Fixed
    pattern = f"{word_pattern}|{num_pattern}|{upper_pattern}"
    extracts: list[str] = re.findall(
        pattern=pattern, string=string
    )
    return extracts


string = "Hello123AIIsCool"
extract_v1(string)

Result:

['Hello', '123', 'AI', 'Is', 'Cool']

The fixed upper_pattern will match as many uppercased letters as possible, and will stop one before a lowercased letter if it exists.

Sumer answered 28/3, 2024 at 2:52 Comment(3)

I was so close! Thank you. +1 – Bottleneck 28/3, 2024 at 3:12

Assigning to extracts and then returning extracts seems needlessly verbose. As does pattern=pattern and string=string. – Neodarwinism 28/3, 2024 at 3:56

Assigning return value to a named variable is likely a good habit, for it'll make debugging easier and the additional time cost may typically be several nanoseconds which is almost neglectable. I agree with you on not using those keywords though. – Sumer 28/3, 2024 at 5:23

use re.sub and split()

import re

def pascal_case_split(identifier):
    return re.sub('([A-Z][a-z]+)', r' \1', re.sub('([A-Z]+)', r' \1', re.sub('([0-9]+)', r' \1', identifier))).split()

a = pascal_case_split("Hello123AIIsCool")
a

['Hello', '123', 'AI', 'Is', 'Cool']

reference

Agora answered 28/3, 2024 at 2:55 Comment(0)

re.findall should do the trick with much less work on your part. With re.X to allow for spacing out the pattern a bit.

>>> re.findall(
...   r'( [A-Z]{2,} (?! [a-z] ) | \d+ | [A-Z] [a-z]+ )', 
...   'Hello12 3AIIsCool', 
...   re.X
... )
['Hello', '123', 'AI', 'Is', 'Cool']

Pattern	Explanation
`[A-Z]{2,} (?! [a-z] )`	Matches two or more capital letters, not followed by a lowercase letter.
`\d+`	One or more numbers.
`[A-Z] [a-z]+`	A single uppercase letter followed by one or more lowercase letters.

As noted in comments, the first subpattern does not match a single capital letter. We can amend this by replacing [A-Z]{2,} with [A-Z]+ to match one or more capital letters not followed by a lowercase letter.

Neodarwinism answered 28/3, 2024 at 2:56 Comment(2)

Wow. Not sure how that one got past me. Thanks. – Neodarwinism 28/3, 2024 at 3:31

A key difference between this answer and the others is the [A-Z]{2,}. If I used the example string from Hao Wu's answer, this fails to capture the single "A" in "IsAMarkup". Changing it to {1,} or + makes it work. – Bottleneck 28/3, 2024 at 3:51

You may try this regex:

[A-Z](?:[a-z]+|[A-Z]+(?![a-z]))?|\d+

See the test case

import re

pattern = r"[A-Z](?:[a-z]+|[A-Z]+(?![a-z]))?|\d+"
text = "Hello123AIIsCoolAndHTML5IsAMarkupLanguage"

print(re.findall(pattern, text))
# ['Hello', '123', 'AI', 'Is', 'Cool', 'And', 'HTML', '5', 'Is', 'A', 'Markup', 'Language']

Volgograd answered 28/3, 2024 at 3:1 Comment(3)

Thank you for your answer. Can you explain what the ?: does? And the ?! – Bottleneck 28/3, 2024 at 3:5

(?: ... ) is a non-capturing group, I use it here to groups several OR conditions without leaving a back-reference. (?! ... ) is a negative lookahead, the following text must NOT match the pattern inside. – Volgograd 28/3, 2024 at 3:10

Linking to a pretty good explanation on non-capture groups. – Bottleneck 28/3, 2024 at 3:44

Seems easier to do backwards:

import re

def extract(string: str) -> list[str]:
    backwards = re.findall(r'[a-z]+[A-Z]|\d+|[A-Z]+', string[::-1])
    return [s[::-1] for s in backwards[::-1]]

string = "Hello123AIIsCool"
print(extract(string))

Output (Attempt This Online!):

['Hello', '123', 'AI', 'Is', 'Cool']

Mama answered 28/3, 2024 at 4:20 Comment(0)

Some Conditions

Version 0

Version 1

Best Option So Far

Recommended topics

Hot tags