RegEx to split camelCase or TitleCase (advanced)
Asked Answered
B

11

96

I found a brilliant RegEx to extract the part of a camelCase or TitleCase expression.

 (?<!^)(?=[A-Z])

It works as expected:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value

For example with Java:

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

My problem is that it does not work in some cases:

  • Case 1: VALUE -> V / A / L / U / E
  • Case 2: eclipseRCPExt -> eclipse / R / C / P / Ext

To my mind, the result shoud be:

  • Case 1: VALUE
  • Case 2: eclipse / RCP / Ext

In other words, given n uppercase chars:

  • if the n chars are followed by lower case chars, the groups should be: (n-1 chars) / (n-th char + lower chars)
  • if the n chars are at the end, the group should be: (n chars).

Any idea on how to improve this regex?

Burack answered 29/9, 2011 at 7:36 Comment(2)
Seems that you probably would need a conditional modifier on the ^ and another conditional case for capital letters in the negative lookbehind. Haven't tested for sure, but I think that'd be your best bet for fixing the problem.Quaver
If anybody is examiningMalo
G
122

The following regex works for all of the above examples:

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}   

It works by forcing the negative lookbehind to not only ignore matches at the start of the string, but to also ignore matches where a capital letter is preceded by another capital letter. This handles cases like "VALUE".

The first part of the regex on its own fails on "eclipseRCPExt" by failing to split between "RPC" and "Ext". This is the purpose of the second clause: (?<!^)(?=[A-Z][a-z]. This clause allows a split before every capital letter that is followed by a lowercase letter, except at the start of the string.

Gorrono answered 29/9, 2011 at 7:45 Comment(6)
this one does not work on PHP, while @ridgerunner's does. On PHP it says "lookbehind assertion is not fixed length at offset 13".Purulence
@Igoru: Regex flavours vary. The question is about Java, not PHP, and so is the answer.Gorrono
while the question is tagged as "java" the question is still generic - besides code samples (that could never be generic). So, if there's a simpler version of this regex and that also works cross-language, I thought someone should point that :)Purulence
@Igoru: The "generic regex" is an imaginary concept.Artilleryman
why, @CasimiretHippolyte? Aren't regex a resource that every language can use, at some degree? I have never seen a custom implementation of regex besides this one, as usually in opensource you can share a regex between platforms. There's no such a thing as j-regex, although there are extended regex, basic regex and perl regex. Those are standard flavours, and as such Java is just being weird with their home-made version.Purulence
@igorsantos07: No, built-in regex implementations vary wildly between platforms. Some are trying to be Perl-like, some are trying to be POSIX-like, and some are something in between or completely different.Donitadonjon
T
101

It seems you are making this more complicated than it needs to be. For camelCase, the split location is simply anywhere an uppercase letter immediately follows a lowercase letter:

(?<=[a-z])(?=[A-Z])

Here is how this regex splits your example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCPExt

The only difference from your desired output is with the eclipseRCPExt, which I would argue is correctly split here.

Addendum - Improved version

Note: This answer recently got an upvote and I realized that there is a better way...

By adding a second alternative to the above regex, all of the OP's test cases are correctly split.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

Here is how the improved regex splits the example data:

  • value -> value
  • camelValue -> camel / Value
  • TitleValue -> Title / Value
  • VALUE -> VALUE
  • eclipseRCPExt -> eclipse / RCP / Ext

Edit:20130824 Added improved version to handle RCPExt -> RCP / Ext case.

Titus answered 29/9, 2011 at 15:27 Comment(4)
Thanks for your input. I need to separate RCP and Ext in this example, because I convert the parts into a constant name (Style guideline: "all uppercase using underscore to separate words.") In this case, I prefer ECLIPSE_RCP_EXT to ECLIPSE_RCPEXT.Burack
Thanks for the help; I have modified your regex to add a couple of options to care for digits in the string: (?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?<=[0-9])(?=[A-Z][a-z])|(?<=[a-zA-Z])(?=[0-9])Conservatory
This is the best answer! Simple and clear. However this answer and the original RegEx by the OP do not work for Javascript & Golang!Dean
not work for meIgorot
F
38

Another solution would be to use a dedicated method in commons-lang: StringUtils#splitByCharacterTypeCamelCase

Frankish answered 29/9, 2011 at 18:56 Comment(2)
This is the easiest solutionLeeann
This answer should've had more upvotes.Stonwin
O
11

I couldn't get aix's solution to work (and it doesn't work on RegExr either), so I came up with my own that I've tested and seems to do exactly what you're looking for:

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

and here's an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

Here I'm separating each word with a space, so here are some examples of how the string is transformed:

  • ThisIsATitleCASEString => This Is A Title CASE String
  • andThisOneIsCamelCASE => and This One Is Camel CASE

This solution above does what the original post asks for, but I also needed a regex to find camel and pascal strings that included numbers, so I also came up with this variation to include numbers:

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

and an example of using it:

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

And here are some examples of how a string with numbers is transformed with this regex:

  • myVariable123 => my Variable 123
  • my2Variables => my 2 Variables
  • The3rdVariableIsHere => The 3 rdVariable Is Here
  • 12345NumsAtTheStartIncludedToo => 12345 Nums At The Start Included Too
Omari answered 11/3, 2012 at 6:40 Comment(2)
Too many unnecessary capturing groups. You could have written it as: (^[a-z]+|[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z]|$)) for the first one, and (^[a-z]+|[0-9]+|[A-Z][a-z]+|[A-Z]+(?=[A-Z][a-z]|$|[0-9])) for the second one. The outer most can also be removed, but the syntax to refer to the whole match is not portable between languages ($0 and $& are 2 possibilities).Broad
The same simplified regexp: ([A-Z]?[a-z]+)|([A-Z]+(?=[A-Z][a-z]))Flourish
H
6

To handle more letters than just A-Z:

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

Either:

  • Split after any lowercase letter, that is followed by uppercase letter.

E.g parseXML -> parse, XML.

or

  • Split after any letter, that is followed by upper case letter and lowercase letter.

E.g. XMLParser -> XML, Parser.


In more readable form:

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }    
}
Hourigan answered 11/2, 2013 at 16:9 Comment(0)
P
4

Brief

Both top answers here provide code using positive lookbehinds, which, is not supported by all regex flavours. The regex below will capture both PascalCase and camelCase and can be used in multiple languages.

Note: I do realize this question is regarding Java, however, I also see multiple mentions of this post in other questions tagged for different languages, as well as some comments on this question for the same.

Code

See this regex in use here

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

Results

Sample Input

eclipseRCPExt

SomethingIsWrittenHere

TEXTIsWrittenHERE

VALUE

loremIpsum

Sample Output

eclipse
RCP
Ext

Something
Is
Written
Here

TEXT
Is
Written
HERE

VALUE

lorem
Ipsum

Explanation

  • Match one or more uppercase alpha character [A-Z]+
  • Or match zero or one uppercase alpha character [A-Z]?, followed by one or more lowercase alpha characters [a-z]+
  • Ensure what follows is an uppercase alpha character [A-Z] or word boundary character \b
Palstave answered 25/9, 2017 at 15:54 Comment(0)
S
4

You can use StringUtils.splitByCharacterTypeCamelCase("loremIpsum") from Apache Commons Lang.

Scrophulariaceous answered 13/3, 2020 at 14:15 Comment(1)
This is the easiest solutionLeeann
S
0

You can use the expression below for Java:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|(?=[A-Z][a-z])|(?<=\\d)(?=\\D)|(?=\\d)(?<=\\D)
Stilbestrol answered 10/7, 2016 at 23:31 Comment(1)
Hi Maicon, welcome to StackOverflow and thank you for your answer. While this may answer the question, it doesn't provide any explanation for others to learn how it solves the problem. Could you edit your answer to include an explanation of your code? Thank you!Decadence
U
0

Instead of looking for separators that aren't there you might also considering finding the name components (those are certainly there):

String test = "_eclipse福福RCPExt";

Pattern componentPattern = Pattern.compile("_? (\\p{Upper}?\\p{Lower}+ | (?:\\p{Upper}(?!\\p{Lower}))+ \\p{Digit}*)", Pattern.COMMENTS);

Matcher componentMatcher = componentPattern.matcher(test);
List<String> components = new LinkedList<>();
int endOfLastMatch = 0;
while (componentMatcher.find()) {
    // matches should be consecutive
    if (componentMatcher.start() != endOfLastMatch) {
        // do something horrible if you don't want garbage in between

        // we're lenient though, any Chinese characters are lucky and get through as group
        String startOrInBetween = test.substring(endOfLastMatch, componentMatcher.start());
        components.add(startOrInBetween);
    }
    components.add(componentMatcher.group(1));
    endOfLastMatch = componentMatcher.end();
}

if (endOfLastMatch != test.length()) {
    String end = test.substring(endOfLastMatch, componentMatcher.start());
    components.add(end);
}

System.out.println(components);

This outputs [eclipse, 福福, RCP, Ext]. Conversion to an array is of course simple.

Ukase answered 3/6, 2017 at 16:16 Comment(0)
O
0

I can confirm that the regex string ([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b) given by ctwheels, above, works with the Microsoft flavour of regex.

I would also like to suggest the following alternative, based on ctwheels' regex, which handles numeric characters: ([A-Z0-9]+|[A-Z]?[a-z]+)(?=[A-Z0-9]|\b).

This able to split strings such as:

DrivingB2BTradeIn2019Onwards

to

Driving B2B Trade in 2019 Onwards

Onesided answered 2/1, 2019 at 16:31 Comment(0)
E
-1

A JavaScript Solution

/**
 * howToDoThis ===> ["", "how", "To", "Do", "This"]
 * @param word word to be split
 */
export const splitCamelCaseWords = (word: string) => {
    if (typeof word !== 'string') return [];
    return word.replace(/([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)/g, '!$&').split('!');
};
Enid answered 28/7, 2020 at 4:38 Comment(2)
They ask for a JavaScript solution.And why are you giving twice the same solution? If you think that those questions are indentical, vote to close one as duplicate.Noll
I was curious to try this on strings containing numbers and it seems to treat it as part of the previous strings. It doesn't seem to work well on this example: 'DrivingB2BTradeIn2019Onwards' would return ["", "DrivingB2", "B", "TradeIn2019", "Onwards"]Barbarize

© 2022 - 2024 — McMap. All rights reserved.