I am looking for a Regex that allows me to validate json.
I am very new to Regex's and i know enough that parsing with Regex is bad but can it be used to validate?
I am looking for a Regex that allows me to validate json.
I am very new to Regex's and i know enough that parsing with Regex is bad but can it be used to validate?
Some modern regex implementations allow for recursive regular expressions, which can verify a complete JSON serialized structure. The json.org specification makes it quite straightforward.
$pcre_regex = '/
(?(DEFINE)
(?<ws> [\t\n\r ]* )
(?<number> -? (?: 0|[1-9]\d*) (?: \.\d+)? (?: [Ee] [+-]? \d++)? )
(?<boolean> true | false | null )
(?<string> " (?: [^\\\\"\x00-\x1f] | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9A-Fa-f]{4} )* " )
(?<pair> (?&ws) (?&string) (?&ws) : (?&value) )
(?<array> \[ (?: (?&value) (?: , (?&value) )* )? (?&ws) \] )
(?<object> \{ (?: (?&pair) (?: , (?&pair) )* )? (?&ws) \} )
(?<value> (?&ws) (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) (?&ws) )
)
\A (?&value) \Z
/sx';
The example above uses the Perl 5.10/PCRE2 subroutine call syntax to simplify the expression and improve readability. It works quite well in PHP with the PCRE functions. Should work almost unmodified in Perl (provided one replaces 4-backslash sequences '\\\\'
with 2-backslash sequences '\\'
in the <string>
subroutine); and can be adapted for other languages (e.g. Ruby, or those for which PCRE bindings are available).
This regex passes all tests from the JSON.org test suite (see link at the end of the page) as well as those from Nicolas Seriot's JSON Parser test suite.1
A simpler approach is the minimal consistency check as specified in RFC4627, section 6. It's however just intended as security test and basic non-validity precaution:
var jsonCode = /* untrusted input */;
var jsonObject = !(/[^,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]/.test(
jsonCode.replace(/"(\\.|[^"\\])*"/g, '')))
&& eval('(' + jsonCode + ')');
1 With the exception of two cases whose input is very large, causing the regex to time out. More generally, this approach is bound to fail on inputs large enough to hit the resource limits of the matching engine (either in time or space).
\d
is dangerous. In many regexp implementations \d
matches the Unicode definition of a digit that is not just [0-9]
but instead includes alternates scripts. –
Reminisce false
matches while the top level JSON value must be either an array or an object. It has also many issues in character set allowed in strings or in spaces. –
Reminisce json_decode
standpoint, where the three literal tokens, strings or numbers are also accepted. And obviously I did not care about the string validity; that would require at least the /u
flag and some further constraints in [^"\\\\]*
. As for \d
that depends on the locale and PCRE version obviously. –
Spire (?{..}?)
can build an actual JSON parse tree, not just validate it. –
Spire C#
version of this? –
Sealy true
and false
or a "plain string"
instead of an object/array as outer shell. Moreover it's a bit more JSOL than JSON, as it allows unescaped linebreaks/tabs. –
Spire fail25.json
, fail27.json
, but I've fixed them. –
Dwell {"libelle":"Cin\u00e9ma Gaumont Amiens"}
. regex101.com/r/kkMbN4/1 –
Toothwort \\\\ u [0-9a-f]+
over. For regex-only context, it's just 2 backslashes however. –
Spire trim()
to the pattern or it will be error unknow modifier... preg_match(trim($pcre_regex), 'json string here');
. –
Halmahera ["FABRICATION",[],
This input will cause catastrophic backtracking
error. snippt:regex101.com/r/Jj0bRX/1 There is a problem with the array part –
Trovillion <string>
subroutine (or making it possessive) fixes it. –
Graduate Yes, it's a common misconception that Regular Expressions can match only regular languages. In fact, the PCRE functions can match much more than regular languages, they can match even some non-context-free languages! Wikipedia's article on RegExps has a special section about it.
JSON can be recognized using PCRE in several ways! @mario showed one great solution using named subpatterns and back-references. Then he noted that there should be a solution using recursive patterns (?R)
. Here is an example of such regexp written in PHP:
$regexString = '"([^"\\\\]*|\\\\["\\\\bfnrt\/]|\\\\u[0-9a-f]{4})*"';
$regexNumber = '-?(?=[1-9]|0(?!\d))\d+(\.\d+)?([eE][+-]?\d+)?';
$regexBoolean= 'true|false|null'; // these are actually copied from Mario's answer
$regex = '/\A('.$regexString.'|'.$regexNumber.'|'.$regexBoolean.'|'; //string, number, boolean
$regex.= '\[(?:(?1)(?:,(?1))*)?\s*\]|'; //arrays
$regex.= '\{(?:\s*'.$regexString.'\s*:(?1)(?:,\s*'.$regexString.'\s*:(?1))*)?\s*\}'; //objects
$regex.= ')\Z/is';
I'm using (?1)
instead of (?R)
because the latter references the entire pattern, but we have \A
and \Z
sequences that should not be used inside subpatterns. (?1)
references to the regexp marked by the outermost parentheses (this is why the outermost ( )
does not start with ?:
). So, the RegExp becomes 268 characters long :)
/\A("([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"|-?(?=[1-9]|0(?!\d))\d+(\.\d+)?([eE][+-]?\d+)?|true|false|null|\[(?:(?1)(?:,(?1))*)?\s*\]|\{(?:\s*"([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"\s*:(?1)(?:,\s*"([^"\\]*|\\["\\bfnrt\/]|\\u[0-9a-f]{4})*"\s*:(?1))*)?\s*\})\Z/is
Anyway, this should be treated as a "technology demonstration", not as a practical solution. In PHP I'll validate the JSON string with calling the json_decode()
function (just like @Epcylon noted). If I'm going to use that JSON (if it's validated), then this is the best method.
\d
is dangerous. In many regexp implementations \d
matches the Unicode definition of a digit that is not just [0-9]
but instead includes alternates scripts. –
Reminisce \d
does not match unicode numbers in PHP's implementation of PCRE. For example ٩
symbol (0x669 arabic-indic digit nine) will be matched using pattern #\p{Nd}#u
but not #\d#u
–
Handmaid /u
flag. JSON is encoded in UTF-8. For a proper regexp you should use that flag. –
Reminisce u
modifier, please look again at the patterns in my previous comment :) Strings, numbers and booleans ARE correctly matched at the top level. You can paste the long regexp here quanetic.com/Regex and try yourself –
Handmaid Because of the recursive nature of JSON (nested {...}
-s), regex is not suited to validate it. Sure, some regex flavours can recursively match patterns* (and can therefor match JSON), but the resulting patterns are horrible to look at, and should never ever be used in production code IMO!
* Beware though, many regex implementations do not support recursive patterns. Of the popular programming languages, these support recursive patterns: Perl, .NET, PHP and Ruby 1.9.2
Looking at the documentation for JSON, it seems that the regex can simply be three parts if the goal is just to check for fitness:
[First] The string starts and ends with either []
or {}
[{\[]{1}
...[}\]]{1}
AND EITHER
[Second] The character is an allowed JSON control character (just one)
[,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]
...[Third] The set of characters contained in a ""
".*?"
...All together:
[{\[]{1}([,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]|".*?")+[}\]]{1}
If the JSON string contains newline
characters, then you should use the singleline
switch on your regex flavor so that .
matches newline
. Please note that this will not fail on all bad JSON, but it will fail if the basic JSON structure is invalid, which is a straight-forward way to do a basic sanity validation before passing it to a parser.
[{\[]{1}([,:{}\[\]0-9.\-+A-zr-u \n\r\t]|".*:?")+[}\]]{1}
–
Waggish {{"parentRelationField": "Project_Name__c", "employeeIdField": "Employee_Name__c"}
- did you find a way to prevent it matching when the open and close braces are not matching in count? –
Coniine {}
. –
Enyedy I tried @mario's answer, but it didn't work for me, because I've downloaded test suite from JSON.org (archive) and there were 4 failed tests (fail1.json, fail18.json, fail25.json, fail27.json).
I've investigated the errors and found out, that fail1.json
is actually correct (according to manual's note and RFC-7159 valid string is also a valid JSON). File fail18.json
was not the case either, cause it contains actually correct deeply-nested JSON:
[[[[[[[[[[[[[[[[[[[["Too deep"]]]]]]]]]]]]]]]]]]]]
So two files left: fail25.json
and fail27.json
:
[" tab character in string "]
and
["line
break"]
Both contains invalid characters. So I've updated the pattern like this (string subpattern updated):
$pcreRegex = '/
(?(DEFINE)
(?<number> -? (?= [1-9]|0(?!\d) ) \d+ (\.\d+)? ([eE] [+-]? \d+)? )
(?<boolean> true | false | null )
(?<string> " ([^"\n\r\t\\\\]* | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9a-f]{4} )* " )
(?<array> \[ (?: (?&json) (?: , (?&json) )* )? \s* \] )
(?<pair> \s* (?&string) \s* : (?&json) )
(?<object> \{ (?: (?&pair) (?: , (?&pair) )* )? \s* \} )
(?<json> \s* (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) \s* )
)
\A (?&json) \Z
/six';
So now all legal tests from json.org can be passed.
"\/"
as a valid json string but it is a valid json string value. can you fix this?. for example an escaped url such as "https:\/\/websit.com"
will not be matched by your string group. –
Trovillion I created a Ruby implementation of Mario's solution, which does work:
# encoding: utf-8
module Constants
JSON_VALIDATOR_RE = /(
# define subtypes and build up the json syntax, BNF-grammar-style
# The {0} is a hack to simply define them as named groups here but not match on them yet
# I added some atomic grouping to prevent catastrophic backtracking on invalid inputs
(?<number> -?(?=[1-9]|0(?!\d))\d+(\.\d+)?([eE][+-]?\d+)?){0}
(?<boolean> true | false | null ){0}
(?<string> " (?>[^"\\\\]* | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9a-f]{4} )* " ){0}
(?<array> \[ (?> \g<json> (?: , \g<json> )* )? \s* \] ){0}
(?<pair> \s* \g<string> \s* : \g<json> ){0}
(?<object> \{ (?> \g<pair> (?: , \g<pair> )* )? \s* \} ){0}
(?<json> \s* (?> \g<number> | \g<boolean> | \g<string> | \g<array> | \g<object> ) \s* ){0}
)
\A \g<json> \Z
/uix
end
########## inline test running
if __FILE__==$PROGRAM_NAME
# support
class String
def unindent
gsub(/^#{scan(/^(?!\n)\s*/).min_by{|l|l.length}}/u, "")
end
end
require 'test/unit' unless defined? Test::Unit
class JsonValidationTest < Test::Unit::TestCase
include Constants
def setup
end
def test_json_validator_simple_string
assert_not_nil %s[ {"somedata": 5 }].match(JSON_VALIDATOR_RE)
end
def test_json_validator_deep_string
long_json = <<-JSON.unindent
{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"id": 1918723,
"boolean": true,
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
}
}
JSON
assert_not_nil long_json.match(JSON_VALIDATOR_RE)
end
end
end
For "strings and numbers", I think that the partial regular expression for numbers:
-?(?:0|[1-9]\d*)(?:\.\d+)(?:[eE][+-]\d+)?
should be instead:
-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+\-]?\d+)?
since the decimal part of the number is optional, and also it is probably safer to escape the -
symbol in [+-]
since it has a special meaning between brackets
\d
is dangerous. In many regexp implementations \d
matches the Unicode definition of a digit that is not just [0-9]
but instead includes alternates scripts. –
Reminisce A trailing comma in a JSON array caused my Perl 5.16 to hang, possibly because it kept backtracking. I had to add a backtrack-terminating directive:
(?<json> \s* (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) )(*PRUNE) \s* )
^^^^^^^^
This way, once it identifies a construct that is not 'optional' (*
or ?
), it shouldn't try backtracking over it to try to identify it as something else.
it validate key(string):value(string,integer,[{key:value},{key:value}],{key:value})
^\{(\s|\n\s)*(("\w*"):(\s)*("\w*"|\d*|(\{(\s|\n\s)*(("\w*"):(\s)*("\w*(,\w+)*"|\d{1,}|\[(\s|\n\s)*(\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})){1}(\s|\n\s)*(,(\s|\n\s)*\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})?)*(\s|\n\s)*\]))((,(\s|\n\s)*"\w*"):(\s)*("\w*(,\w+)*"|\d{1,}|\[(\s|\n\s)*(\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})){1}(\s|\n\s)*(,(\s|\n\s)*\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):("\w*"|\d{1,}))*(\s|\n)*\})?)*(\s|\n\s)*\]))*(\s|\n\s)*\}){1}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d*|(\{(\s|\n\s)*(("\w*"):(\s)*("\w*(,\w+)*"|\d{1,}|\[(\s|\n\s)*(\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})){1}(\s|\n\s)*(,(\s|\n\s)*\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})?)*(\s|\n\s)*\]))((,(\s|\n\s)*"\w*"):(\s)*("\w*(,\w+)*"|\d{1,}|\[(\s|\n\s)*(\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):(\s)*("\w*"|\d{1,}))*(\s|\n)*\})){1}(\s|\n\s)*(,(\s|\n\s)*\{(\s|\n\s)*(("\w*"):(\s)*(("\w*"|\d{1,}))((,(\s|\n\s)*"\w*"):("\w*"|\d{1,}))*(\s|\n)*\})?)*(\s|\n\s)*\]))*(\s|\n\s)*\}){1}))*(\s|\n)*\}$
{
"key":"string",
"key": 56,
"key":{
"attr":"integer",
"attr": 12
},
"key":{
"key":[
{
"attr": 4,
"attr": "string"
}
]
}
}
As was written above, if the language you use has a JSON-library coming with it, use it to try decoding the string and catch the exception/error if it fails! If the language does not (just had such a case with FreeMarker) the following regex could at least provide some very basic validation (it's written for PHP/PCRE to be testable/usable for more users). It's not as foolproof as the accepted solution, but also not that scary =):
~^\{\s*\".*\}$|^\[\n?\{\s*\".*\}\n?\]$~s
short explanation:
// we have two possibilities in case the string is JSON
// 1. the string passed is "just" a JSON object, e.g. {"item": [], "anotheritem": "content"}
// this can be matched by the following regex which makes sure there is at least a {" at the
// beginning of the string and a } at the end of the string, whatever is inbetween is not checked!
^\{\s*\".*\}$
// OR (character "|" in the regex pattern)
// 2. the string passed is a JSON array, e.g. [{"item": "value"}, {"item": "value"}]
// which would be matched by the second part of the pattern above
^\[\n?\{\s*\".*\}\n?\]$
// the s modifier is used to make "." also match newline characters (can happen in prettyfied JSON)
if I missed something that would break this unintentionally, I'm grateful for comments!
Here my regexp for validate string:
^\"([^\"\\]*|\\(["\\\/bfnrt]{1}|u[a-f0-9]{4}))*\"$
Was written usign original syntax diagramm.
I realize that this is from over 6 years ago. However, I think there is a solution that nobody here has mentioned that is way easier than regexing
function isAJSON(string) {
try {
JSON.parse(string)
} catch(e) {
if(e instanceof SyntaxError) return false;
};
return true;
}
© 2022 - 2024 — McMap. All rights reserved.
:)
– Hydrated(?R)
version. – Spirejson_decode
, which despite the simplicity of JSON had around a dozen exploitabilities. Old PHP versions are still awfully widespread, so I'm using it as security addon. – Spire"\/"
as a valid json string but it is a valid json string value. can you fix this?. for example an escaped url such as"https:\/\/websit.com"
will not be matched by your string group. – Trovillion