Regular expression to extract number before/after word
Asked Answered
U

3

5

I have 10000 descriptions and I want to use regular expressions to extract the number associated with the phrase ``arrested''.

For example:

"police arrests 4 people"
"7 people were arrested". 

The numbers range from 1-99.

I have tried the following code:

gen arrest= regexm(description, "(^[1-9][0-9]$)[ ]*(arrests|arrested)")

I cannot simply extract just the number, because the descriptions also mention numbers that have nothing to do with arrests.

Unutterable answered 14/11, 2018 at 1:13 Comment(0)
A
4

You can use this regex:

(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))

It divides the search into 2 by alternation, whether the number is before or after 'arrests|arrested'.

It creates a non capturing Group, that matches a number from 1-9 (which is optional) and a number from 0-9. This is followed by matching 0 - 20 of any letter and Space (the other Words) before it matches 'arrests OR arrested. It then ORs that with the opposite situation (where the number comes last).

This will match, if the number is within 20 chars from 'arrests|arrested'.

Aureole answered 14/11, 2018 at 1:58 Comment(0)
B
2

Perhaps something like this?

(\d+)[^,.\d\n]+?(?=arrest|custody)|(?<=arrest|custody)[^,.\d\n]+?(\d+)

Regex101

Keep in mind, this will not match textual versions of the number (i.e., five people were arrested) - so you would have to incorporate that if desired.


Breaking down the pattern

  • (\d+)[^,.\d\n]+?(?=arrest|custody) First option if # comes before watched terms
    • (\d+) the number to capture, with + one or more digits
    • [^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
    • (?=arrest|custody) positive look ahead checking for either word:
  • (?<=arrest|custody)[^,.\d\n]+?(\d+) Second option if # comes after watched terms
    • (?<=arrest|custody) positive lookbehind checking that the word comes before #
    • [^,.\d\n]+? matches anything except a comma ,, period ., digit \d, or new line \n. These prevent FPs from different sentences (must be contained in the same sentence) - +? one or more times (lazy)
    • (\d+) the number to capture, with + one or more digits

Miscellaneous Notes

If you want to add textual representations of your numbers, then you would incorporate that into the (\d+) capturing group.

If you have any additional terms to watch for other than arrested or custody, you would add those terms to both lookaround groups

Birthstone answered 14/11, 2018 at 1:38 Comment(0)
G
2

The following works for me (solution based on @PoulBak's idea):

clear

input strL var1
"This is 1 long string saying that police arrests 4 people"
"3 news outlets today reported that 7 people were arrested"
"several witnesses saw 5 people arrested and other 3 killed"
end

generate var2 = ustrregexs(0) if ustrregexm(var1, "(?:([1-9]?[0-9])[a-zA-Z ]{0,20}(?:arrests|arrested))|(?:(?:arrests|arrested)[a-zA-Z ]{0,20}([1-9]?[0-9]))")

list

   +-------------------------------------------------------------------------------------+
   |                                                       var1                     var2 |
   |-------------------------------------------------------------------------------------|
1. |  This is 1 long string saying that police arrests 4 people                arrests 4 |
2. |  3 news outlets today reported that 7 people were arrested   7 people were arrested |
3. | several witnesses saw 5 people arrested and other 3 killed        5 people arrested |
   +-------------------------------------------------------------------------------------+
Garber answered 14/11, 2018 at 10:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.