Extracting numbers from sentences
Asked Answered
M

4

5

I need to extract some numbers from a text. Text is

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

The numbers to be extracted are 325 and 232. These are inside brackets and at the end of sentence. Other numbers are excluded. I tried strsplit(text, "[A-Za-z]+"), but is not getting what I needed.

Melbourne answered 24/8, 2014 at 18:20 Comment(1)
I'm curious on the downvote here?Aftertime
A
5

Here's a stringi approach

x <- "Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae; Claudii libidini, qui tum erat summo ne imperio, dederetur"

library(stringi)
stri_extract_all_regex(x, "(?<=[\\[(])\\d+(?=[\\])][.?!])")

## [[1]]
## [1] "325" "232"
Aftertime answered 24/8, 2014 at 18:32 Comment(2)
Curious, doesn't qdap have function for getting text between brackets? I though I was you use it a few times before.Bridegroom
Yeah it has bracketXtract but this regex is less general (forces digits between) thus more accurate. And I'm becoming a big fan of the stringi package with fast, consistent results.Aftertime
K
4

Another one:

r <- gregexpr("[[(]\\d+[])](?=\\.)", text, perl = TRUE)
(m <- regmatches(text, r)[[1]])
# [1] "(325)" "[232]"

as.integer(gsub("\\D", "", m))
# [1] 325 232
Klehm answered 24/8, 2014 at 18:34 Comment(0)
D
3

Here is a solution using strsplit....

> x <- 'Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;'
> strsplit(x, '[^0-9]+')[[1]][3:4]
## [1] "325" "232"

Or using base R to extract these values.

> regmatches(x, gregexpr('[[(]\\K\\d+(?=[])](?!,))', x, perl=T))[[1]]
## [1] "325" "232"
Deutoplasm answered 24/8, 2014 at 20:27 Comment(0)
M
0

With re module

import re

string="Lorem ipsum dolor sit amet[245], consectetur adipiscing (325). Deinde prima illa, quae in congressu[232]. solemus: Quid tu, inquit, huc? Sequitur disserendi ratio cognitioque 295. naturae;"

print string

pattern = re.compile(r'(?<=[\[(])\d+(?=[\])]\.)')

result = pattern.findall(string)

print result
Malone answered 30/10, 2014 at 17:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.