Searching for UUIDs in text with regex

Asked 25/9, 2008 at 22:8 Answered 10/4 at 17:21

382

I'm searching for UUIDs in blocks of text using a regex. Currently I'm relying on the assumption that all UUIDs will follow a patttern of 8-4-4-4-12 hexadecimal digits.

Can anyone think of a use case where this assumption would be invalid and would cause me to miss some UUIDs?

Palter answered 25/9, 2008 at 22:8 Comment(4)

This question from 6 years ago was to help me with a project to find credit cards in a block of text. I've subsequently open sourced the code which is linked from my blog post which explains the nuance that the UUIDs were causing when searching for credit cards guyellisrocks.com/2013/11/… – Palter 17/4, 2014 at 14:15

A search for UUID regular expression pattern matching brought me to this stack overflow post but the accepted answer actually isn't an answer. Additionally, the link you provided in the comment below your question also doesn't have the pattern (unless I'm missing something). Is one of these answer something you ended up using? – Fescennine 3/2, 2016 at 21:19

If you follow the rabbit warren of links starting with the one that I posted you might come across this line in GitHub which has the regex that I finally used. (Understandable that it is difficult to find.) That code and that file might help you: github.com/guyellis/CreditCard/blob/master/Company.CreditCard/… – Palter 4/2, 2016 at 14:20

None of these answers seem to give a single regex for all variants of only valid RFC 4122 UUIDs. But it looks like such an answer was given here: https://mcmap.net/q/80873/-how-to-test-valid-uuid-guid – Reiterant 23/2, 2017 at 0:49

I agree that by definition your regex does not miss any UUID. However it may be useful to note that if you are searching especially for Microsoft's Globally Unique Identifiers (GUIDs), there are five equivalent string representations for a GUID:

"ca761232ed4211cebacd00aa0057b223" 

"CA761232-ED42-11CE-BACD-00AA0057B223" 

"{CA761232-ED42-11CE-BACD-00AA0057B223}" 

"(CA761232-ED42-11CE-BACD-00AA0057B223)" 

"{0xCA761232, 0xED42, 0x11CE, {0xBA, 0xCD, 0x00, 0xAA, 0x00, 0x57, 0xB2, 0x23}}"

Remarkable answered 25/9, 2008 at 22:27 Comment(2)

Under what situations would the first pattern be found? i.e. Is there a .Net function that would strip the hyphens or return the GUID without hyphens? – Palter 25/9, 2008 at 22:32

You can get it with myGuid.ToString("N"). – Remarkable 25/9, 2008 at 22:38

702

The regex for uuid is:

[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}

If you want to enforce the full string to match this regex, you will sometimes (your matcher API may have a method) need to surround above expression with ^...$, that is

^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$

Anciently answered 10/7, 2011 at 11:39 Comment(10)

In some cases you might even want to make that [a-fA-F0-9] or [A-F0-9]. – Lampyrid 23/11, 2011 at 12:53

+1 for pattern, but I'm wondering [0-9a-f] might perform better as more random hex digits will be a number instead of alphabetic character? – Jag 2/4, 2012 at 15:46

@cyber-monk: [0-9a-f] is identical to [a-f0-9] and [0123456789abcdef] in meaning and in speed, since the regex is turned into a state machine anyway, with each hex digit turned into an entry in a state-table. For an entry point into how this works, see en.wikipedia.org/wiki/Nondeterministic_finite_automaton – Hygienics 3/7, 2012 at 12:7

@Hygienics indeed [0-9a-f] ~ [a-f0-9] but [0123456789abcdef] is ~1% slower probably because there's more "string" to get parsed. The setup:

timeit.timeit(stmt="re.match('[0123456789abcdef]{8}-[0123456789abcdef]{4}-[0123456789abcdef]{4}-[0123456789abcdef]{4}-[0123456789abcdef]{12}$','82b1510f-d735-4952-8a6d-0f7d6bfe7960')",setup='import re', number=100000)/timeit.timeit(stmt="re.match('[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$','82b1510f-d735-4952-8a6d-0f7d6bfe7960')",setup='import re', number=100000)

– Culch 23/1, 2013 at 8:49

This solution is not quite correct. It matches IDs that have invalid version and variant characters per RFC4122. @Gajus' solution is more correct in that regard. Also, the RFC allows upper-case characters on input, so adding [A-F] would be appropriate. – Saransk 6/2, 2013 at 18:35

@broofa, I see that you are really set on everyone matching only UUIDs that are consistent with the RFC. However, I think the fact that you have had to point this out so many times is a solid indicator that not all UUIDs will use the RFC version and variant indicators. The UUID definition en.wikipedia.org/wiki/Uuid#Definition states a simple 8-4-4-4-12 pattern and 2^128 possibilities. The RFC represents only a subset of that. So what do you want to match? The subset, or all of them? – Correlative 25/2, 2013 at 22:57

@RichardBronosky - A fair point. I guess it's not really clear from the OP's question whether or not RFC-compliance is an important distinction. (although his concern is more with false negatives so perhaps it's not.) Pick your poison, I suppose. :/ – Saransk 26/2, 2013 at 2:54

You can compress this regex quite a bit: [0-9a-f]{8}-(?:[0-9a-f]{4}-){3}[0-9a-f]{12}. – Quaky 14/10, 2016 at 16:18

@AndrewCoad The internal \b's are unnecessary, and if you care about boundaries at the ends of the UUID then the outer \b's should probably be replaced with ^..$ (or \A..\z if you're in Ruby). Depending on language, the /i switch removes the need for specifying both a-z and A-F. In summary: /^[0-9a-f]{8}-(?:[0-9a-f]{4}-){3}[0-9a-f]{12}$/i. Even this is incorrect though, because it allows invalid UUIDs through. See answer from @Gajus below. – Septime 10/12, 2018 at 6:5

@Dr.Hans-PeterStörr in fact, you can handle the cases with the` \i` flag without [a-fA-F0-9] check this regex demo. – Atrice 26/1 at 14:48

173

@ivelin: UUID can have capitals. So you'll either need to toLowerCase() the string or use:

[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}

Would have just commented this but not enough rep :)

Justly answered 11/10, 2012 at 15:32 Comment(4)

Usually you can handle this by defining the pattern as case insensitive with an i after the pattern, this makes a cleaner pattern: /[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i – Glenine 27/2, 2016 at 9:7

@ThomasBindzus That option isn't available in all languages. The original pattern in this answer worked for me in Go. The /.../i version didn't. – Dinar 1/5, 2020 at 23:3

For future readers: /i is not the only way. Go (and not only) supports "(?i)" at the beginning of the pattern, like (?i)[a-f0-9].... , which would also make the whole pattern case insensitive. (?i) makes everything to the right side case-insensitive. Counterpart (?-i). – Larrup 17/12, 2021 at 5:48

@ChrisRedford works on most regex engine check this demo. – Atrice 26/1 at 14:50

148

If you want to check or validate a specific UUID version, here are the corresponding regexes.

Note that the only difference is the version number, which is explained in 4.1.3. Version chapter of UUID 4122 RFC.

The version number is the first character of the third group : [VERSION_NUMBER][0-9A-F]{3} :

UUID v1 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[1][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v2 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[2][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v3 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[3][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v4 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[4][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v5 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[5][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

Ganesha answered 4/7, 2016 at 19:20 Comment(5)

The patterns do not include lower case letters. It should also contain a-f next to each A-F scope. – Prefabricate 26/6, 2017 at 22:21

The i at the end of the regex marks it as case insensitive. – Garlinda 30/6, 2017 at 3:0

A pattern modifier cannot always be used. For example, in a openapi definition, the pattern is case sensitive – Itu 25/3, 2020 at 13:15

@StephaneJanicaud In OpenAPI, you should rather use the format modifier by setting it to "uuid" instead of using a regex to test UUIDs: swagger.io/docs/specification/data-models/data-types/#format – Ganesha 27/3, 2020 at 12:3

Thank you @IvanGabriele for the tip, it was just an example,it's the same problem when you wan't to check any case insensitive pattern. – Itu 27/3, 2020 at 12:46

143

Version 4 UUIDs have the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit and y is one of 8, 9, A, or B. e.g. f47ac10b-58cc-4372-a567-0e02b2c3d479.

source: http://en.wikipedia.org/wiki/Uuid#Definition

Therefore, this is technically more correct:

/[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}/

Irv answered 4/1, 2013 at 22:42 Comment(8)

I don't think you mean a-z. – Correlative 5/2, 2013 at 16:6

Need to accept [A-F], too. Per section 3 of RFC4122: 'The hexadecimal values "a" through "f" are output as lower case characters and are case insensitive on input'. Also (:?8|9|A|B) is probably slightly more readable as [89aAbB] – Saransk 6/2, 2013 at 18:26

Need to copy @broofa's modification; as yours excludes lower-case A or B. – Expectant 18/5, 2013 at 22:26

@elliottcable Depending on your environment, just use i (case-insensitive) flag. – Irv 14/1, 2014 at 23:11

You're rejecting Version 1 to 3 and 5. Why? – Wizened 24/6, 2014 at 13:20

this regex fails for - 123e4567-e89b-12d3-a456-426655440001 since it's valid. – Supertanker 3/6, 2019 at 10:5

@ThangavelLoganathan right this is only for version 4 which iGEL mentioned, but you've got a v1 UUID. I think the only difference between UUIDs are the version numbers in the third group (i.e. 4[a-f0-9]{3}). I got that from Ivan's answer. – Whitesmith 15/9, 2020 at 15:58

@prostýčlověk check this demo works fine for your UUID string. – Atrice 26/1 at 14:54

"ca761232ed4211cebacd00aa0057b223" 

"CA761232-ED42-11CE-BACD-00AA0057B223" 

"{CA761232-ED42-11CE-BACD-00AA0057B223}" 

"(CA761232-ED42-11CE-BACD-00AA0057B223)" 

"{0xCA761232, 0xED42, 0x11CE, {0xBA, 0xCD, 0x00, 0xAA, 0x00, 0x57, 0xB2, 0x23}}"

Remarkable answered 25/9, 2008 at 22:27 Comment(2)

Under what situations would the first pattern be found? i.e. Is there a .Net function that would strip the hyphens or return the GUID without hyphens? – Palter 25/9, 2008 at 22:32

You can get it with myGuid.ToString("N"). – Remarkable 25/9, 2008 at 22:38

/^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12}$/i

Gajus' regexp rejects UUID V1-3 and 5, even though they are valid.

Wizened answered 24/6, 2014 at 13:19 Comment(4)

But it allows invalid versions (like 8 or A) and invalid variants. – Karb 13/2, 2018 at 10:33

Note that AB in [89AB][0-9a-f] is upper case and the rest of allowed characters are lower case. It has caught me out in Python – Elsey 19/7, 2018 at 13:21

but it reject UUID version 4, check this demo. – Atrice 26/1 at 14:58

@AmineKOUIS You didn't make the regex case-insensitive (See the /i at the end). So the [89AB] doesn't match. – Wizened 29/1 at 9:39

[\w]{8}(-[\w]{4}){3}-[\w]{12} has worked for me in most cases.

Or if you want to be really specific [\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}.

Conceited answered 22/10, 2010 at 16:45 Comment(6)

It it worth noting that \w, in Java at least, matches _ as well as hexadecimal digits. Replacing the \w with \p{XDigit} may be more appropriate as that is the POSIX class defined for matching hexadecimal digits. This may break when using other Unicode charsets tho. – Arson 7/3, 2011 at 21:41

@oconnor \w usually means "word characters" It will match much more than hex-digits. Your solution is much better. Or, for compatibility/readability you could use [a-f0-9] – Hoyle 25/9, 2011 at 9:23

Here is a string that looks like a regex and match those patterns, but is an invalid regex: 2wtu37k5-q174-4418-2cu2-276e4j82sv19 – Sama 1/12, 2016 at 19:37

@OleTraveler not true, works like a charm.

import re  def valid_uuid(uuid):     regex = re.compile('[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', re.I)     match = regex.match(uuid)     return bool(match)  valid_uuid('2wtu37k5-q174-4418-2cu2-276e4j82sv19')

– Pintsize 1/12, 2017 at 9:25

@tom That string (2wt...) is an invalid UUID, but the pattern given in this answer matches that string indicating falsely that it is a valid UUID. It's too bad I don't remember why that UUID is invalid. – Sama 2/12, 2017 at 15:1

@OleTraveler That's interesting. I don't know much about UUIDs in general but my UUIDs were generated by the UUID 4 generator and it matches what wikipedia says. EDIT: I read again what you wrote. I may understand what you mean, this code just counts the number of characters but UUID also consists of version and variant within itself. For me this code is sufficient, but indeed there are cases where invalid UUID will match this pattern. Thanks for contributing to the discussion. – Pintsize 4/12, 2017 at 13:1

In python re, you can span from numberic to upper case alpha. So..

import re
test = "01234ABCDEFGHIJKabcdefghijk01234abcdefghijkABCDEFGHIJK"
re.compile(r'[0-f]+').findall(test) # Bad: matches all uppercase alpha chars
## ['01234ABCDEFGHIJKabcdef', '01234abcdef', 'ABCDEFGHIJK']
re.compile(r'[0-F]+').findall(test) # Partial: does not match lowercase hex chars
## ['01234ABCDEF', '01234', 'ABCDEF']
re.compile(r'[0-F]+', re.I).findall(test) # Good
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-f]+', re.I).findall(test) # Good
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-Fa-f]+').findall(test) # Good (with uppercase-only magic)
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-9a-fA-F]+').findall(test) # Good (with no magic)
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']

That makes the simplest Python UUID regex:

re_uuid = re.compile("[0-F]{8}-([0-F]{4}-){3}[0-F]{12}", re.I)

I'll leave it as an exercise to the reader to use timeit to compare the performance of these.

Enjoy. Keep it Pythonic™!

NOTE: Those spans will also match :;<=>?@' so, if you suspect that could give you false positives, don't take the shortcut. (Thank you Oliver Aubert for pointing that out in the comments.)

Correlative answered 5/2, 2013 at 16:21 Comment(2)

[0-F] will indeed match 0-9 and A-F, but also any character whose ASCII code is between 57 (for 9) and 65 (for A), that is to say any of :;<=>?@'. – Anchises 19/10, 2015 at 8:40

So do no use the abovementionned code except if you want to consider :=>;?<;:-<@=:-@=;=-@;@:->==@?>=:?=@; as a valid UUID :-) – Anchises 19/10, 2015 at 8:48

By definition, a UUID is 32 hexadecimal digits, separated in 5 groups by hyphens, just as you have described. You shouldn't miss any with your regular expression.

http://en.wikipedia.org/wiki/Uuid#Definition

Buchalter answered 25/9, 2008 at 22:14 Comment(1)

Not correct. RFC4122 only allows [1-5] for the version digit, and [89aAbB] for the variant digit. – Saransk 6/2, 2013 at 18:36

If using POSIX regex (grep -E, MySQL, etc.), this may be easier to read & remember:

[[:xdigit:]]{8}(-[[:xdigit:]]{4}){3}-[[:xdigit:]]{12}

Perl & PCRE flavours also support POSIX character classes so that'll work with them. For those, change the (…) to a non-capturing subgroup (?:…).

JavaScript (and other syntaxes that support Unicode properties) can use a similarly legible version:

/\p{Hex_Digit}{8}(?:-\p{Hex_Digit}{4}){3}-\p{Hex_Digit}{12}/u

Italianize answered 3/4, 2020 at 23:57 Comment(1)

imo, this is the best answer as it uses the appropriate character class and repeats the 4+'-' pattern three times with the quantifier {3}. I wish this received a higher priority in the list of a responses. – Cute 29/9, 2023 at 21:14

Here is the working REGEX: https://www.regextester.com/99148

const regex = [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}

Acidity answered 13/7, 2020 at 8:34 Comment(0)

So, I think Richard Bronosky actually has the best answer to date, but I think you can do a bit to make it somewhat simpler (or at least terser):

re_uuid = re.compile(r'[0-9a-f]{8}(?:-[0-9a-f]{4}){3}-[0-9a-f]{12}', re.I)

Discounter answered 15/4, 2013 at 23:9 Comment(2)

Even terser: re_uuid = re.compile(r'[0-9a-f]{8}(?:-[0-9a-f]{4}){4}[0-9a-f]{8}', re.I) – Wenn 12/5, 2014 at 11:1

If you're looking to use capture groups to actually capture data out of a string, using this is NOT a great idea. It looks a little simpler, but complicates some usages. – Adamic 4/12, 2020 at 16:7

Variant for C++:

#include <regex>  // Required include

...

// Source string    
std::wstring srcStr = L"String with GIUD: {4d36e96e-e325-11ce-bfc1-08002be10318} any text";

// Regex and match
std::wsmatch match;
std::wregex rx(L"(\\{[A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12}\\})", std::regex_constants::icase);

// Search
std::regex_search(srcStr, match, rx);

// Result
std::wstring strGUID       = match[1];

Pinball answered 16/4, 2014 at 18:23 Comment(0)

For UUID generated on OS X with uuidgen, the regex pattern is

[A-F0-9]{8}-[A-F0-9]{4}-4[A-F0-9]{3}-[89AB][A-F0-9]{3}-[A-F0-9]{12}

Verify with

uuidgen | grep -E "[A-F0-9]{8}-[A-F0-9]{4}-4[A-F0-9]{3}-[89AB][A-F0-9]{3}-[A-F0-9]{12}"

Fictive answered 2/7, 2016 at 17:23 Comment(0)

For bash:

grep -E "[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}"

For example:

$> echo "f2575e6a-9bce-49e7-ae7c-bff6b555bda4" | grep -E "[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}"
f2575e6a-9bce-49e7-ae7c-bff6b555bda4

Baseborn answered 13/11, 2019 at 8:57 Comment(1)

You need to include grep's -i option for case-insensitive matching. – Brassbound 30/6, 2020 at 11:37

$UUID_RE = join '-', map { "[0-9a-f]{$_}" } 8, 4, 4, 4, 12;

BTW, allowing only 4 on one of the positions is only valid for UUIDv4. But v4 is not the only UUID version that exists. I have met v1 in my practice as well.

Reeta answered 17/1, 2016 at 17:4 Comment(0)

I just want to share the smallest regexp way to do the same of the good answers here.

^[a-f\d]{8}(-[a-f\d]{4}){3}-[a-f\d]{12}$

Please use with ignore case flag i for ignore case / case unsensitive:

const pattern = /^[a-f\d]{8}(-[a-f\d]{4}){3}-[a-f\d]{12}$/i // JavaScript

pattern = re.compile(r"^[a-f\d]{8}(-[a-f\d]{4}){3}-[a-f\d]{12}$", re.IGNORECASE) # Python

$pattern = '/^[a-f\d]{8}(-[a-f\d]{4}){3}-[a-f\d]{12}$/i' // php

Almsman answered 24/8, 2023 at 20:20 Comment(2)

Your regex doesn't match certains UUID, check this demo. – Atrice 26/1 at 15:3

@Amine KOUIS Sorry, my bad, I forgot to add ^ at the start, anyway ^ at the start and $ at the end means the start and the end of the string, if you remove the last $ in you link, it catches the first one, because the second is invalid, have non hex characters. – Almsman 1/2 at 18:31

Wanted to give my contribution, as my regex cover all cases from OP and correctly group all relevant data on the group method (you don't need to post process the string to get each part of the uuid, this regex already get it for you)

([\d\w]{8})-?([\d\w]{4})-?([\d\w]{4})-?([\d\w]{4})-?([\d\w]{12})|[{0x]*([\d\w]{8})[0x, ]{4}([\d\w]{4})[0x, ]{4}([\d\w]{4})[0x, {]{5}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})

Sarene answered 15/12, 2020 at 18:55 Comment(0)

Official uuid library uses following regex:

/^(?:[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}|00000000-0000-0000-0000-000000000000)$/i

See reference

Olsson answered 28/2, 2022 at 14:26 Comment(0)

The pattern I wrote in JavaScript for me is:

re = /^[\da-f]{8}-([\da-f]{4}-){3}[\da-f]{12}$/i;

This is smaller than the examples above and should work fine.

Note: Don't use the examples that use \w because they are less strict and might match improper uuid's.

Empathic answered 10/4 at 17:21 Comment(0)

-1

Here is a brief regex to match a valid UUID: /[\w]{8}(-[\w]{4}){3}-[\w]{12}/i that works in most regex engine:

Don't forget to use with ignore case flag i for ignore case / case unsensitive:

JavaScript:

const pattern = /[\w]{8}(-[\w]{4}){3}-[\w]{12}/i

Python:

pattern = re.compile(r"[\w]{8}(-[\w]{4}){3}-[\w]{12}", re.IGNORECASE)

PHP:

$pattern = '/[\w]{8}(-[\w]{4}){3}-[\w]{12}/i'

Regex DEMO.

Atrice answered 27/1 at 14:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags