Regular expression to match non-ASCII characters?

Asked 29/9, 2008 at 18:37 Answered 20/11, 2023 at 16:38

317

What is the easiest way to match non-ASCII characters in a regex? I would like to match all words individually in an input string, but the language may not be English, so I will need to match things like ü, ö, ß, and ñ. Also, this is in Javascript/jQuery, so any solution will need to apply to that.

Hamster answered 29/9, 2008 at 18:37 Comment(1)

Do you want to match all letters or all characters? For instance do you want to include punctuation, digits, whitespace, and arbitrary signs and symbols? Saying you want to match all words makes it sound like you only want non-English letters and not all non-English characters as your question title states. – Weird 23/2, 2013 at 0:26

333

This should do it:

[^\x00-\x7F]+

It matches any character which is not contained in the ASCII character set (0-127, i.e. 0x0 to 0x7F).

You can do the same thing with Unicode:

[^\u0000-\u007F]+

For unicode you can look at this 2 resources:

Code charts list of Unicode ranges
This tool to create a regex filtered by Unicode block.

Lorenza answered 29/9, 2008 at 18:45 Comment(15)

What is with characters outside that range? There are more than 0x80 letters. – Ani 29/9, 2008 at 18:49

The circumflex negates the characterclass. – Stamey 29/9, 2008 at 19:18

Then it's still wrong in that it matches any "non-english" letters (and maybe other characters, but I don't know the full Unicode tables), but the question author needs to match all letters, even though the question starts with only the non-english characters. You know, because then it's complete. – Ani 29/9, 2008 at 21:24

This doesn't answer this question about matching words with non-english characters... – Zarathustra 17/1, 2010 at 15:59

There is no such thing as “a character whose ASCII code is greater than 128”!!! – Quiberon 9/11, 2010 at 18:13

This was very helpful. I also had to add some special characters that were allowable in my case. My regex ended up like this, if it helps anyone: [^\\x00-\\x80®¢™’‘¾⅛–£€“”°•…³±—α]+ – Total 9/2, 2012 at 22:19

It actually works pretty well in a real life scenario, but if you want to be more precise, this may help: en.wikipedia.org/wiki/… – Leban 5/1, 2013 at 19:27

I ended up defining [\u00BF-\u1FFF\u2C00-\uD7FF\w] as a letter. – Leban 5/1, 2013 at 20:43

@MarkusvonBroady thanks - that is working for all of my requirements tests (mostly European characters) - is there a way you can tell me exactly which characters are included in those ranges? – Pontiac 30/3, 2013 at 1:21

@Pontiac First a small fix: [\u00C0-\u1FFF\u2C00-\uD7FF\w] (without inverted question mark ¿), as for ranges, refer to BMP. 00C0 is À in Latin-1 Supplement, 1FFF is last character of Greek Extended, 2C00 is first letter in Glagolitic, and D7FF is last character in Hangul Jamo Extended-B. So it's everything except: symbols and special chars on 2 first blocks; symbols in middle blocks; surrogates, priv area and special chars in end blocks. – Leban 30/3, 2013 at 7:38

Thanks, didn't knew the unicode stuff. I ended up with the following regex to support non-english names: ^[\u00C0-\u017E a-zA-Z\']+$. – Pulitzer 29/1, 2014 at 9:57

This answer is incorrect. ASCII doesn’t include U+0080 PADDING CHARACTER*. If it would, then ASCII would consist of 129 characters instead of 128. – Linguist 7/4, 2014 at 16:57

You might want to use this tool for converting into most of codes r12a.github.io/apps/conversion – Flickertail 4/12, 2016 at 15:21

@Markus von Broady What is [\u00C0-\u1FFF\u2C00-\uD7FF\w] used for? – Pearl 3/8, 2021 at 4:46

@Pearl I described it in my last comment here, but let me try to visualize: i.imgur.com/GyKV73X.png - again, refer to the BMP. I marked green everything this formula includes. If it includes only a part of a block named in BMP, then I mark other parts red for clarity. Also keep in mind green/red parts probably aren't to scale. – Leban 3/8, 2021 at 8:27

195

var words_in_text = function (text) {
    var regex = /([\u0041-\u005A\u0061-\u007A\u00AA\u00B5\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u0527\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u08A0\u08A2-\u08AC\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0977\u0979-\u097F\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C33\u0C35-\u0C39\u0C3D\u0C58\u0C59\u0C60\u0C61\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D60\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F4\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191C\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19C1-\u19C7\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312D\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FCC\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA697\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA78E\uA790-\uA793\uA7A0-\uA7AA\uA7F8-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA80-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uABC0-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]+)/g;
    return text.match(regex);
};

words_in_text('Düsseldorf, Köln, Москва, 北京市, إسرائيل !@#$');

// returns array ["Düsseldorf", "Köln", "Москва", "北京市", "إسرائيل"]

This regex will match all words in the text of any language...

Schriever answered 27/2, 2014 at 16:53 Comment(12)

Why are you so sure that it matches all words in text? Any source? – Sharolynsharon 27/10, 2015 at 10:21

Hi ! How do you add multilines search with unicode charracter ranges ? – Provence 21/3, 2016 at 17:6

Ok, for those who need to match latin european languages in multiline texts, this might work using latin charset, latien extned A and latin extended B. Note this pattren excludes some punctuations or special chars like ;!<>&#$ /^[\u0061-\u007a \u0020 \r\n \u0027-\u002F \u0030-\u003A \u003F \u0041-\u005A \u0061-\u007A \u00C0-\u00FF \u0100-\u017F \u0180-\u024F]+$/ Feel free to add any other character ranges or character classes. – Provence 21/3, 2016 at 17:48

@Sharolynsharon Here is some source to chek to: fileformat.info/info/unicode/block/index.htm – Shalondashalt 23/8, 2016 at 0:32

From there you can found fileformat.info/info/unicode/category/Lu/list.htm, fileformat.info/info/unicode/category/Ll/list.htm, fileformat.info/info/unicode/category/Lo/index.htm, … – Shalondashalt 23/8, 2016 at 0:37

Add \u0E01-\u0E3A\u0E40-\u0E4F\u0E5A\u05B to the end for Thai. – Roundelay 23/11, 2016 at 10:22

It does not match Hindi, Telugu characters – Citrate 19/12, 2016 at 8:16

It does match cyrillic characters like this "Радио 106.9 FM, Тольятти" – Exigible 18/2, 2017 at 1:14

Matches all Characters in all languages [\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF] – Masquerade 11/4, 2017 at 8:58

OMG....you were here..... :), seriously you saved my life, the whole day I have spent on the same and finally it worked for me. Thanks a ton. – Astral 22/8, 2018 at 14:1

this answer is not working. 'முருகன்' this Tamil text return error – Submediant 3/11, 2020 at 9:55

This regex is missing a few cases, including diacritical marks - check out the answer and comments by Loilo. – Conservationist 21/8, 2021 at 19:56

134

Unicode Property Escapes are among the features of ES2018.

Basic Usage

With Unicode Property Escapes, you can match a letter from any language with the following simple regular expression:

/\p{Letter}/u

Or with the shorthand, even terser:

/\p{L}/u

Matching Words

Regarding the question's concrete use case (matching words), note that you can use Unicode Property Escapes in character classes, making it easy to match letters together with other word-characters like hyphens:

/[\p{L}-]/u

Stitching it all together, you could match words of all^[1] languages with this beautifully short RegEx:

/[\p{L}-]+/ug

Example (shamelessly plugged from the answer above):

'Düsseldorf, Köln, Москва, 北京市, إسرائيل !@#$'.match(/[\p{L}-]+/ug)

// ["Düsseldorf", "Köln", "Москва", "北京市", "إسرائيل"]

^[1] Note that I'm not an expert on languages. You may still want to do your own research regarding other characters that might be parts of words besides letters and hyphens.

Browser Support

This feature is available in all major evergreen browsers.

Transpiling

If support for older browsers is needed, Unicode Property Escapes can be transpiled to ES5 with a tool called regexpu. There's an online demo available here. As you can see in the demo, you can in fact match non-latin letters today with the following (horribly long) ES5 regular expression:

/(?:[A-Za-z\xAA\xB5\xBA\xC0-\xD6\xD8-\xF6\xF8-\u02C1\u02C6-\u02D1\u02E0-\u02E4\u02EC\u02EE\u0370-\u0374\u0376\u0377\u037A-\u037D\u037F\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03F5\u03F7-\u0481\u048A-\u052F\u0531-\u0556\u0559\u0561-\u0587\u05D0-\u05EA\u05F0-\u05F2\u0620-\u064A\u066E\u066F\u0671-\u06D3\u06D5\u06E5\u06E6\u06EE\u06EF\u06FA-\u06FC\u06FF\u0710\u0712-\u072F\u074D-\u07A5\u07B1\u07CA-\u07EA\u07F4\u07F5\u07FA\u0800-\u0815\u081A\u0824\u0828\u0840-\u0858\u0860-\u086A\u08A0-\u08B4\u08B6-\u08BD\u0904-\u0939\u093D\u0950\u0958-\u0961\u0971-\u0980\u0985-\u098C\u098F\u0990\u0993-\u09A8\u09AA-\u09B0\u09B2\u09B6-\u09B9\u09BD\u09CE\u09DC\u09DD\u09DF-\u09E1\u09F0\u09F1\u09FC\u0A05-\u0A0A\u0A0F\u0A10\u0A13-\u0A28\u0A2A-\u0A30\u0A32\u0A33\u0A35\u0A36\u0A38\u0A39\u0A59-\u0A5C\u0A5E\u0A72-\u0A74\u0A85-\u0A8D\u0A8F-\u0A91\u0A93-\u0AA8\u0AAA-\u0AB0\u0AB2\u0AB3\u0AB5-\u0AB9\u0ABD\u0AD0\u0AE0\u0AE1\u0AF9\u0B05-\u0B0C\u0B0F\u0B10\u0B13-\u0B28\u0B2A-\u0B30\u0B32\u0B33\u0B35-\u0B39\u0B3D\u0B5C\u0B5D\u0B5F-\u0B61\u0B71\u0B83\u0B85-\u0B8A\u0B8E-\u0B90\u0B92-\u0B95\u0B99\u0B9A\u0B9C\u0B9E\u0B9F\u0BA3\u0BA4\u0BA8-\u0BAA\u0BAE-\u0BB9\u0BD0\u0C05-\u0C0C\u0C0E-\u0C10\u0C12-\u0C28\u0C2A-\u0C39\u0C3D\u0C58-\u0C5A\u0C60\u0C61\u0C80\u0C85-\u0C8C\u0C8E-\u0C90\u0C92-\u0CA8\u0CAA-\u0CB3\u0CB5-\u0CB9\u0CBD\u0CDE\u0CE0\u0CE1\u0CF1\u0CF2\u0D05-\u0D0C\u0D0E-\u0D10\u0D12-\u0D3A\u0D3D\u0D4E\u0D54-\u0D56\u0D5F-\u0D61\u0D7A-\u0D7F\u0D85-\u0D96\u0D9A-\u0DB1\u0DB3-\u0DBB\u0DBD\u0DC0-\u0DC6\u0E01-\u0E30\u0E32\u0E33\u0E40-\u0E46\u0E81\u0E82\u0E84\u0E87\u0E88\u0E8A\u0E8D\u0E94-\u0E97\u0E99-\u0E9F\u0EA1-\u0EA3\u0EA5\u0EA7\u0EAA\u0EAB\u0EAD-\u0EB0\u0EB2\u0EB3\u0EBD\u0EC0-\u0EC4\u0EC6\u0EDC-\u0EDF\u0F00\u0F40-\u0F47\u0F49-\u0F6C\u0F88-\u0F8C\u1000-\u102A\u103F\u1050-\u1055\u105A-\u105D\u1061\u1065\u1066\u106E-\u1070\u1075-\u1081\u108E\u10A0-\u10C5\u10C7\u10CD\u10D0-\u10FA\u10FC-\u1248\u124A-\u124D\u1250-\u1256\u1258\u125A-\u125D\u1260-\u1288\u128A-\u128D\u1290-\u12B0\u12B2-\u12B5\u12B8-\u12BE\u12C0\u12C2-\u12C5\u12C8-\u12D6\u12D8-\u1310\u1312-\u1315\u1318-\u135A\u1380-\u138F\u13A0-\u13F5\u13F8-\u13FD\u1401-\u166C\u166F-\u167F\u1681-\u169A\u16A0-\u16EA\u16F1-\u16F8\u1700-\u170C\u170E-\u1711\u1720-\u1731\u1740-\u1751\u1760-\u176C\u176E-\u1770\u1780-\u17B3\u17D7\u17DC\u1820-\u1877\u1880-\u1884\u1887-\u18A8\u18AA\u18B0-\u18F5\u1900-\u191E\u1950-\u196D\u1970-\u1974\u1980-\u19AB\u19B0-\u19C9\u1A00-\u1A16\u1A20-\u1A54\u1AA7\u1B05-\u1B33\u1B45-\u1B4B\u1B83-\u1BA0\u1BAE\u1BAF\u1BBA-\u1BE5\u1C00-\u1C23\u1C4D-\u1C4F\u1C5A-\u1C7D\u1C80-\u1C88\u1CE9-\u1CEC\u1CEE-\u1CF1\u1CF5\u1CF6\u1D00-\u1DBF\u1E00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FBC\u1FBE\u1FC2-\u1FC4\u1FC6-\u1FCC\u1FD0-\u1FD3\u1FD6-\u1FDB\u1FE0-\u1FEC\u1FF2-\u1FF4\u1FF6-\u1FFC\u2071\u207F\u2090-\u209C\u2102\u2107\u210A-\u2113\u2115\u2119-\u211D\u2124\u2126\u2128\u212A-\u212D\u212F-\u2139\u213C-\u213F\u2145-\u2149\u214E\u2183\u2184\u2C00-\u2C2E\u2C30-\u2C5E\u2C60-\u2CE4\u2CEB-\u2CEE\u2CF2\u2CF3\u2D00-\u2D25\u2D27\u2D2D\u2D30-\u2D67\u2D6F\u2D80-\u2D96\u2DA0-\u2DA6\u2DA8-\u2DAE\u2DB0-\u2DB6\u2DB8-\u2DBE\u2DC0-\u2DC6\u2DC8-\u2DCE\u2DD0-\u2DD6\u2DD8-\u2DDE\u2E2F\u3005\u3006\u3031-\u3035\u303B\u303C\u3041-\u3096\u309D-\u309F\u30A1-\u30FA\u30FC-\u30FF\u3105-\u312E\u3131-\u318E\u31A0-\u31BA\u31F0-\u31FF\u3400-\u4DB5\u4E00-\u9FEA\uA000-\uA48C\uA4D0-\uA4FD\uA500-\uA60C\uA610-\uA61F\uA62A\uA62B\uA640-\uA66E\uA67F-\uA69D\uA6A0-\uA6E5\uA717-\uA71F\uA722-\uA788\uA78B-\uA7AE\uA7B0-\uA7B7\uA7F7-\uA801\uA803-\uA805\uA807-\uA80A\uA80C-\uA822\uA840-\uA873\uA882-\uA8B3\uA8F2-\uA8F7\uA8FB\uA8FD\uA90A-\uA925\uA930-\uA946\uA960-\uA97C\uA984-\uA9B2\uA9CF\uA9E0-\uA9E4\uA9E6-\uA9EF\uA9FA-\uA9FE\uAA00-\uAA28\uAA40-\uAA42\uAA44-\uAA4B\uAA60-\uAA76\uAA7A\uAA7E-\uAAAF\uAAB1\uAAB5\uAAB6\uAAB9-\uAABD\uAAC0\uAAC2\uAADB-\uAADD\uAAE0-\uAAEA\uAAF2-\uAAF4\uAB01-\uAB06\uAB09-\uAB0E\uAB11-\uAB16\uAB20-\uAB26\uAB28-\uAB2E\uAB30-\uAB5A\uAB5C-\uAB65\uAB70-\uABE2\uAC00-\uD7A3\uD7B0-\uD7C6\uD7CB-\uD7FB\uF900-\uFA6D\uFA70-\uFAD9\uFB00-\uFB06\uFB13-\uFB17\uFB1D\uFB1F-\uFB28\uFB2A-\uFB36\uFB38-\uFB3C\uFB3E\uFB40\uFB41\uFB43\uFB44\uFB46-\uFBB1\uFBD3-\uFD3D\uFD50-\uFD8F\uFD92-\uFDC7\uFDF0-\uFDFB\uFE70-\uFE74\uFE76-\uFEFC\uFF21-\uFF3A\uFF41-\uFF5A\uFF66-\uFFBE\uFFC2-\uFFC7\uFFCA-\uFFCF\uFFD2-\uFFD7\uFFDA-\uFFDC]|\uD800[\uDC00-\uDC0B\uDC0D-\uDC26\uDC28-\uDC3A\uDC3C\uDC3D\uDC3F-\uDC4D\uDC50-\uDC5D\uDC80-\uDCFA\uDE80-\uDE9C\uDEA0-\uDED0\uDF00-\uDF1F\uDF2D-\uDF40\uDF42-\uDF49\uDF50-\uDF75\uDF80-\uDF9D\uDFA0-\uDFC3\uDFC8-\uDFCF]|\uD801[\uDC00-\uDC9D\uDCB0-\uDCD3\uDCD8-\uDCFB\uDD00-\uDD27\uDD30-\uDD63\uDE00-\uDF36\uDF40-\uDF55\uDF60-\uDF67]|\uD802[\uDC00-\uDC05\uDC08\uDC0A-\uDC35\uDC37\uDC38\uDC3C\uDC3F-\uDC55\uDC60-\uDC76\uDC80-\uDC9E\uDCE0-\uDCF2\uDCF4\uDCF5\uDD00-\uDD15\uDD20-\uDD39\uDD80-\uDDB7\uDDBE\uDDBF\uDE00\uDE10-\uDE13\uDE15-\uDE17\uDE19-\uDE33\uDE60-\uDE7C\uDE80-\uDE9C\uDEC0-\uDEC7\uDEC9-\uDEE4\uDF00-\uDF35\uDF40-\uDF55\uDF60-\uDF72\uDF80-\uDF91]|\uD803[\uDC00-\uDC48\uDC80-\uDCB2\uDCC0-\uDCF2]|\uD804[\uDC03-\uDC37\uDC83-\uDCAF\uDCD0-\uDCE8\uDD03-\uDD26\uDD50-\uDD72\uDD76\uDD83-\uDDB2\uDDC1-\uDDC4\uDDDA\uDDDC\uDE00-\uDE11\uDE13-\uDE2B\uDE80-\uDE86\uDE88\uDE8A-\uDE8D\uDE8F-\uDE9D\uDE9F-\uDEA8\uDEB0-\uDEDE\uDF05-\uDF0C\uDF0F\uDF10\uDF13-\uDF28\uDF2A-\uDF30\uDF32\uDF33\uDF35-\uDF39\uDF3D\uDF50\uDF5D-\uDF61]|\uD805[\uDC00-\uDC34\uDC47-\uDC4A\uDC80-\uDCAF\uDCC4\uDCC5\uDCC7\uDD80-\uDDAE\uDDD8-\uDDDB\uDE00-\uDE2F\uDE44\uDE80-\uDEAA\uDF00-\uDF19]|\uD806[\uDCA0-\uDCDF\uDCFF\uDE00\uDE0B-\uDE32\uDE3A\uDE50\uDE5C-\uDE83\uDE86-\uDE89\uDEC0-\uDEF8]|\uD807[\uDC00-\uDC08\uDC0A-\uDC2E\uDC40\uDC72-\uDC8F\uDD00-\uDD06\uDD08\uDD09\uDD0B-\uDD30\uDD46]|\uD808[\uDC00-\uDF99]|\uD809[\uDC80-\uDD43]|[\uD80C\uD81C-\uD820\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD80D[\uDC00-\uDC2E]|\uD811[\uDC00-\uDE46]|\uD81A[\uDC00-\uDE38\uDE40-\uDE5E\uDED0-\uDEED\uDF00-\uDF2F\uDF40-\uDF43\uDF63-\uDF77\uDF7D-\uDF8F]|\uD81B[\uDF00-\uDF44\uDF50\uDF93-\uDF9F\uDFE0\uDFE1]|\uD821[\uDC00-\uDFEC]|\uD822[\uDC00-\uDEF2]|\uD82C[\uDC00-\uDD1E\uDD70-\uDEFB]|\uD82F[\uDC00-\uDC6A\uDC70-\uDC7C\uDC80-\uDC88\uDC90-\uDC99]|\uD835[\uDC00-\uDC54\uDC56-\uDC9C\uDC9E\uDC9F\uDCA2\uDCA5\uDCA6\uDCA9-\uDCAC\uDCAE-\uDCB9\uDCBB\uDCBD-\uDCC3\uDCC5-\uDD05\uDD07-\uDD0A\uDD0D-\uDD14\uDD16-\uDD1C\uDD1E-\uDD39\uDD3B-\uDD3E\uDD40-\uDD44\uDD46\uDD4A-\uDD50\uDD52-\uDEA5\uDEA8-\uDEC0\uDEC2-\uDEDA\uDEDC-\uDEFA\uDEFC-\uDF14\uDF16-\uDF34\uDF36-\uDF4E\uDF50-\uDF6E\uDF70-\uDF88\uDF8A-\uDFA8\uDFAA-\uDFC2\uDFC4-\uDFCB]|\uD83A[\uDC00-\uDCC4\uDD00-\uDD43]|\uD83B[\uDE00-\uDE03\uDE05-\uDE1F\uDE21\uDE22\uDE24\uDE27\uDE29-\uDE32\uDE34-\uDE37\uDE39\uDE3B\uDE42\uDE47\uDE49\uDE4B\uDE4D-\uDE4F\uDE51\uDE52\uDE54\uDE57\uDE59\uDE5B\uDE5D\uDE5F\uDE61\uDE62\uDE64\uDE67-\uDE6A\uDE6C-\uDE72\uDE74-\uDE77\uDE79-\uDE7C\uDE7E\uDE80-\uDE89\uDE8B-\uDE9B\uDEA1-\uDEA3\uDEA5-\uDEA9\uDEAB-\uDEBB]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])/

If you're using Babel, there's also a regexpu-powered plugin for that (Babel v6 plugin, Babel v7 plugin).

Splenetic answered 21/2, 2018 at 9:50 Comment(14)

This is the actual solution in my case. Specifically: [\p{L}\s\d] to accept any name in any language with spaces and numbers. Ref link – Fabricate 13/6, 2018 at 9:13

Names could still include hyphens (quite common here in Germany for example). To be honest, it's probably best to not make any assumptions at all about what characters a valid name includes. – Splenetic 27/6, 2018 at 7:49

Top answer for JS. Still not supported in Edge/Firefox today, so Transpiling stil is relevant. – Dehumidifier 22/8, 2018 at 12:4

Hooray! It's about time this was supported. – Shears 18/10, 2018 at 4:30

I think this is the most useful and true answer. At the end, probably what we want is to match a word in any language using regex – Turkic 26/12, 2018 at 13:26

This is the firefox feature request: bugzilla.mozilla.org/show_bug.cgi?id=1361876 – Luellaluelle 7/8, 2019 at 15:11

Thanks for transpiled version ! – Jimjimdandy 17/1, 2020 at 19:46

@Splenetic could you please what exactly /u means ? – Vaporific 27/5, 2020 at 11:33

@MatzHeri The u flag has been introduced in ES2015 and enables various features related to Unicode (e.g. the property escapes mentioned in my answer, even though those have only been introduced a couple years later). Since the Unicode-related features slightly change the semantics of some regex patterns, this has to be explicitly enabled — by adding that u flag. – Splenetic 27/5, 2020 at 11:42

/[\p{L}-]+/ug does not match e.g. städer. For some reason it is split in two. Why does this happen? – Antithesis 17/8, 2021 at 0:31

@Antithesis This seems to be a problem with regex101. The browsers I've tested this with do match the städer string perfectly well. (Also note how regex101 splits the string into "sta" and "der", which is not correct either.) – Splenetic 18/8, 2021 at 8:44

Actually, that ä is a decomposed unicode character. The diacritic will be matched by \p{M}. Your regex could be improved to handle this. E.g. /[\p{Alpha}\p{M}-]+/ug – Antithesis 18/8, 2021 at 9:12

I recommend adding \p{M} and ' to your regex, which makes it pass my tests. I believe Go and the Python regex module from pypi both support this syntax. – Conservationist 21/8, 2021 at 19:55

I decided on ['0-9\p{L}\p{M}]. You can also add a hypen depending on your use case. – Conservationist 19/2, 2022 at 16:27

The situation with regexes, Unicode, and Javascript sucks. It's ridiculous that programmers should have to rely on external libraries to recognize that "Αλφα" is a word, or even that "é" is a letter.

But so it goes.

This guy has written a good library for handling Unicode in Javascript Regexes:

http://blog.stevenlevithan.com/archives/javascript-regex-and-unicode

The Unicode stuff is a plugin to this regex library:

http://xregexp.com/

Here's a post about the Unicode extension:

http://blog.stevenlevithan.com/archives/xregexp-unicode-plugin

And the extension page itself:

http://xregexp.com/plugins/

Great work but it still bums me out that Javascript is so backwards in this regard.

(He wrote a book for O'Reilly about the topic so it's quite possible that he knows what he's talking about.)

The way he implemented it is by adding tables of characters with certain properties. Then, when you contruct a regex with his library, \p{charclass} gets replaced with [allthecharactersintheclass].

Sheryl answered 17/5, 2009 at 0:8 Comment(2)

Twitter also has a nice text parsing library that covers lots of languages, although (obviously) very hashtag-centric: github.com/twitter/twitter-text-js – Bedstead 16/1, 2014 at 20:48

This answer is now outdated; see Loilo's answer for how to do it in native JavaScript. – Penmanship 28/8, 2021 at 1:8

The answer given by Jeremy Ruten is great, but I think it's not exactly what Paul Wicks was searching for. If I understand correctly Paul asked about expression to match non-english words like können or móc. Jeremy's regex matches only non-english letters, so there's need for small improvement:

([^\x00-\x7F]|\w)+

([^\u0000-\u007F]|\w)+

This [^\x00-\x7F] and this [^\u0000-\u007F] parts allow regullar expression to match non-english letters.

This (|) is logical or and \w is english letter, so ([^\u0000-\u007F]|\w) will match single english or non-english letter.

+ at the end of the expression means it could be repeated, so the whole expression allows all english or non-english letters to match.

Here you can test the first expression with various strings and here is the second.

Kenna answered 2/3, 2016 at 9:47 Comment(1)

The characters from 0 to 31 are control characters and should not be part of the regexp imho. – Inconstant 1/10, 2020 at 10:37

You do the same way as any other character matching, but you use \uXXXX where XXXX is the unicode number of the character.

Look at: http://unicode.org/charts/charindex.html

http://unicode.org/charts/

http://www.decodeunicode.org/

Burgoyne answered 29/9, 2008 at 18:43 Comment(1)

except for the fact that looking up each and every thing that can be called a character while not adding any special symbols, such as spaces, punctuation etc is quite a task. – Gizmo 15/1, 2018 at 8:24

All Unicode-enabled Regex flavours should have a special character class like \w that match any Unicode letter. Take a look at your specific flavour here.

Ani answered 29/9, 2008 at 18:42 Comment(4)

This is correct for most flavors of regex, but not for JavaScript, at least according to regular-expressions.info/javascript.html – Hamster 29/9, 2008 at 18:52

Bad luck then, I guess. At least you can use then use the Unicode charts posted by olle to find your characters ;) – Ani 29/9, 2008 at 18:55

I think \w is dependents on the cultural settings on the client. – Stamey 29/9, 2008 at 19:19

I don't know, but in .NET, you can always specify the culture you want. Apart from that, what is a letter and what not is defined in the Unicode standard and is not dependent on culture. – Ani 29/9, 2008 at 20:56

I had a problem with \p working as expected, so I just used a different strategy like:

([^\t]+)\t

Find anything that is not a tab character until the next tab character... obviously this depends on your search source, but you get the idea. Now I don't have to figure out what unicode characters work and don't work etc.

Engler answered 1/6, 2020 at 3:27 Comment(0)

You can always exclude the printable characters tab and space to tilde. Accents should show up. (powershell). Note there's some weird unicode exceptions where a foreign capital letter has an english small letter İ K, so a case insensitive match will give unexpected results.

'ü' | select-string '[^\t -~]' -CaseSensitive

ü

Peru answered 20/11, 2023 at 16:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Basic Usage

Matching Words

Browser Support

Transpiling

Recommended topics

Hot tags