Is there a regex to grab all quotation marks?
Asked Answered
B

4

7

I know that in regex, there is \s to match all whitepsaces (space, tabs ...), \d for any number, etc.

Is there the same shortcut to match all different quotation marks: ' " “ ” ‘ ’ „ ” « »?

And more on Wikipedia ...

I can write my own regex, but I will probably miss some quotation marks from other languages, so I like to have a generic way to match all the quotation marks.

But may be they are considered as different characters so that it is impossible?

Bloated answered 13/9, 2017 at 7:38 Comment(1)
If copy past don't work you can use Unicode to match them.Liebknecht
Q
1

Is there the same shortcut to match all different quotation marks

There is no such short-cut, in Java ... or (AFAIK) in any other dialect of regexes.

I can write my own regex, but I will probably miss some quotation marks from other languages, so I like to have a generic way to match all the quotation marks.

Unfortunately, there is no Unicode character class that consists of all "quotation" characters.

And there is no simple / guaranteed heuristic based on character names either.

Quill answered 13/9, 2017 at 8:14 Comment(0)
C
5

you can use the regex

['"“”‘’„”«»]

see the regex101 demo

Consumptive answered 13/9, 2017 at 7:40 Comment(1)
As noted in other answers, this is not a complete set of all Unicode quotation mark symbols.Quill
S
3

Java Unicode support has a very detailed support, and even classifies punctuation. However not for quotes. And there are quotes that are neither types as starting or ending quotes. But you can collect them, and generate code. Advantage: completeness.

    for (int cp = 32; cp <= 0xFFFF; ++cp) {
        String name = Character.getName(cp);
        if(name != null && name.contains("QUOTATION")) {
            System.out.printf("\\u%04x = %s (%s %s)%n",
                    cp, name,
                    Character.getType(cp) == Character.INITIAL_QUOTE_PUNCTUATION,
                    Character.getType(cp) == Character.FINAL_QUOTE_PUNCTUATION);
        }
    }

This exploits code points almost being chars. Hence will not work for Asian scripts (stopping at U+FFFF). This results in:

\u0022 = QUOTATION MARK (false false)
\u00ab = LEFT-POINTING DOUBLE ANGLE QUOTATION MARK (true false)
\u00bb = RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (false true)
\u2018 = LEFT SINGLE QUOTATION MARK (true false)
\u2019 = RIGHT SINGLE QUOTATION MARK (false true)
\u201a = SINGLE LOW-9 QUOTATION MARK (false false)
\u201b = SINGLE HIGH-REVERSED-9 QUOTATION MARK (true false)
\u201c = LEFT DOUBLE QUOTATION MARK (true false)
\u201d = RIGHT DOUBLE QUOTATION MARK (false true)
\u201e = DOUBLE LOW-9 QUOTATION MARK (false false)
\u201f = DOUBLE HIGH-REVERSED-9 QUOTATION MARK (true false)
\u2039 = SINGLE LEFT-POINTING ANGLE QUOTATION MARK (true false)
\u203a = SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (false true)
\u275b = HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT (false false)
\u275c = HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT (false false)
\u275d = HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT (false false)
\u275e = HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT (false false)
\u275f = HEAVY LOW SINGLE COMMA QUOTATION MARK ORNAMENT (false false)
\u2760 = HEAVY LOW DOUBLE COMMA QUOTATION MARK ORNAMENT (false false)
\u276e = HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT (false false)
\u276f = HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT (false false)
\u301d = REVERSED DOUBLE PRIME QUOTATION MARK (false false)
\u301e = DOUBLE PRIME QUOTATION MARK (false false)
\u301f = LOW DOUBLE PRIME QUOTATION MARK (false false)
\uff02 = FULLWIDTH QUOTATION MARK (false false)
Silvana answered 13/9, 2017 at 8:17 Comment(2)
It also fails because there are a number of BMP characters (i.e. less than U+FFFF) that are quotes in Korean, Chinese Japanese, etc, but don't contain the word "quotation" in their official names. For example: U+3008 to U+300F, U+FE41 to U+FE44. Read the Wikipedia page: en.wikipedia.org/wiki/Quotation_mark.Quill
@StephenC again I learned something. I already expected as much as the list did not contain BMP quotes of Asian/Arabic/Hebrew scripts. Thanks for the link.Silvana
Q
1

Is there the same shortcut to match all different quotation marks

There is no such short-cut, in Java ... or (AFAIK) in any other dialect of regexes.

I can write my own regex, but I will probably miss some quotation marks from other languages, so I like to have a generic way to match all the quotation marks.

Unfortunately, there is no Unicode character class that consists of all "quotation" characters.

And there is no simple / guaranteed heuristic based on character names either.

Quill answered 13/9, 2017 at 8:14 Comment(0)
C
0

Approach:

If you are not sure about all quotation marks then you can write regex for what you need other than quotation marks. other wise write in this ['"“”‘’„”«»] all possible quotation marks.

Christal answered 13/9, 2017 at 7:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.