Perhaps something like this?
(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)
Regex101
Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.
Breaking down the pattern
(\d+)[^,.\d\n]+?(?=arrest|custody)
First option if # comes before watched terms
(\d+)
the number to capture, with +
one or more digits
[^,.\d\n]+?
matches anything except a comma ,
, period .
, digit \d
, or new line \n
. These prevent FPs from different sentences (must be contained in the same sentence) - +?
one or more times (lazy)
(?=arrest|custody)
positive look ahead checking for either word:
(?<=arrest|custody)[^,.\d\n]+?(\d+)
Second option if # comes after watched terms
(?<=arrest|custody)
positive lookbehind checking that the word comes before #
[^,.\d\n]+?
matches anything except a comma ,
, period .
, digit \d
, or new line \n
. These prevent FPs from different sentences (must be contained in the same sentence) - +?
one or more times (lazy)
(\d+)
the number to capture, with +
one or more digits
Miscellaneous Notes
If you want to add textual representations of your numbers, then you would incorporate that into the (\d+)
capturing group.
If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups