find and replace numeric sequence in r
Asked Answered
S

4

12

I have a dataframe with a sequence of numbers similar to below:

data <- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)

What I need is something to locate all instances of 1, 2 or 3 repetitions of 0 where the proceeding and following numbers are identical- i.e. both 1 or both 2 (for example 1,0,1 or 2,0,0,2 but NOT 2,0,1).

Then I need to fill the zeros only with the surrounding value.

I have managed to locate and count consecutive zeros

consec <- (!data) * unlist(lapply(rle(data)$lengths, seq_len))

then I have found the row where these consecutive zeros begin with:

consec <- as.matrix(consec)
first_na <- which(consec==1,arr.ind=TRUE)

But I'm stumped with the replacement process

I would really appreciate your help with this!

Carl

Scandium answered 25/2, 2013 at 12:33 Comment(0)
G
2

Since there seems to be a lot of interest in the answer to this question, I thought I would write up an alternative regular expressions method for posterity.

Using the 'gregexpr' function, you can search out patterns and use the resulting location matches and match lengths to call out which values to change in the original vector. The advantage of using regular expressions is that we can be explicit about exactly which patterns we want to match, and as a result, we won't have any exclusion cases to worry about.

Note: The following example works as written, because we are assuming single-digit values. We could easily adapt it for other patterns, but we can take a small shortcut with single characters. If we wanted to do this with possible multiple-digit values, we would want to add a separation character as part of the first concatenation ('paste') function.


The Code

str.values <- paste(data, collapse="") # String representation of vector
str.matches <- gregexpr("1[0]{1,3}1", str.values) # Pattern 101/1001/10001
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1 # Replace zeros with ones
str.matches <- gregexpr("2[0]{1,3}2", str.values) # Pattern 202/2002/20002
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2 # Replace zeros with twos

Step 1: Make a single string of all the data values.

str.values <- paste(data, collapse="")
# "11100112220002110102"

This collapses down the data into one long string, so we can use a regular expression on it.

Step 2: Apply a regular expression to find the locations and lengths of any matches within the string.

str.matches <- gregexpr("1[0]{1,3}1", str.values)
# [[1]]
# [1]  3 16
# attr(,"match.length")
# [1] 4 3
# attr(,"useBytes")
# [1] TRUE

In this case, we're using a regular expression to look for the first pattern, one to three zeros ([0]{2,}) with ones on either side (1[0]{1,3}1). We will have to match the entire pattern, in order to prevent having to check for matching ones or twos on the ends. We'll subtract those ends off in the next step.

Step 3: Write ones into all the matching locations in the original vector.

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1
# 1 1 1 1 1 1 1 2 2 2 0 0 0 2 1 1 1 1 0 2

We're doing a few steps all at once here. First, we are creating a list of number sequences from the numbers that matched in the regular expression. In this case, there are two matches, which start at indexes 3 and 16 and are 4 and 3 items long, respectively. This means our zeros are located at indexes (3+1):(3-2+4), or 4:5 and at (16+1):(16-2+3), or 17:17. We concatenate ('paste') these sequences using the 'collapse' option again, in case there are multiple matches. Then, we use a second concatenation to put the sequences inside of a combine (c()) function. Using the 'eval' and 'parse' functions, we turn this text into code and pass it as index values to the [data] array. We write all ones into those locations.

Step x: Repeat for each pattern. In this case, we need to do a second search and find one to three zeros with twos on either side and then run the same statement as Step 3, but assigning twos, instead of ones.

str.matches <- gregexpr("2[0]{1,3}2", str.values)
# [[1]]
# [1] 10
# attr(,"match.length")
# [1] 5
# attr(,"useBytes")
# [1] TRUE

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2
# 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

Update: I realized the original problem said to match one to three zeros in a row, rather than the "two or more" that I written into the original code. I have updated the regular expressions and the explanation, although the code remains the same.

Gerhart answered 25/2, 2013 at 17:22 Comment(1)
so, I actually went for this one in the end, I loved the ability to have the control over the patterns - but I appreciated all the suggestions. I will keep note of these different methods for different circumstances though. Really appreciate it.Scandium
S
14

Here is a loopless solution using rle() and inverse.rle().

data <- c(1,1,1,0,0,1,1,2,2,2,0,0,0,2,1,1,0,1,0,2)

local({
  r <- rle(data)
  x <- r$values
  x0 <- which(x==0) # index positions of zeroes
  xt <- x[x0-1]==x[x0+1] # zeroes surrounded by same value
  r$values[x0[xt]] <- x[x0[xt]-1] # substitute with surrounding value
  inverse.rle(r)
})

[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

PS. I use local() as a simple mechanism to not clobber the workspace with loads of new temporary objects. You could create a function instead of using local - I just find I use local a lot nowadays for this type of task.


PPS. You will have to modify this code to exclude leading or trailing zeroes in your original data.

Singularity answered 25/2, 2013 at 13:29 Comment(1)
This is exactly the way the 'rle' function should be used, and I'm glad you wrote it up so clearly. The 'local' function is a good tip, too. I do approximately the same thing by wrapping lots of my code in functions (also good for debugging), and I think it's a good thing for people to learn in general. Good work, Andrie.Gerhart
G
2

Since there seems to be a lot of interest in the answer to this question, I thought I would write up an alternative regular expressions method for posterity.

Using the 'gregexpr' function, you can search out patterns and use the resulting location matches and match lengths to call out which values to change in the original vector. The advantage of using regular expressions is that we can be explicit about exactly which patterns we want to match, and as a result, we won't have any exclusion cases to worry about.

Note: The following example works as written, because we are assuming single-digit values. We could easily adapt it for other patterns, but we can take a small shortcut with single characters. If we wanted to do this with possible multiple-digit values, we would want to add a separation character as part of the first concatenation ('paste') function.


The Code

str.values <- paste(data, collapse="") # String representation of vector
str.matches <- gregexpr("1[0]{1,3}1", str.values) # Pattern 101/1001/10001
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1 # Replace zeros with ones
str.matches <- gregexpr("2[0]{1,3}2", str.values) # Pattern 202/2002/20002
data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2 # Replace zeros with twos

Step 1: Make a single string of all the data values.

str.values <- paste(data, collapse="")
# "11100112220002110102"

This collapses down the data into one long string, so we can use a regular expression on it.

Step 2: Apply a regular expression to find the locations and lengths of any matches within the string.

str.matches <- gregexpr("1[0]{1,3}1", str.values)
# [[1]]
# [1]  3 16
# attr(,"match.length")
# [1] 4 3
# attr(,"useBytes")
# [1] TRUE

In this case, we're using a regular expression to look for the first pattern, one to three zeros ([0]{2,}) with ones on either side (1[0]{1,3}1). We will have to match the entire pattern, in order to prevent having to check for matching ones or twos on the ends. We'll subtract those ends off in the next step.

Step 3: Write ones into all the matching locations in the original vector.

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 1
# 1 1 1 1 1 1 1 2 2 2 0 0 0 2 1 1 1 1 0 2

We're doing a few steps all at once here. First, we are creating a list of number sequences from the numbers that matched in the regular expression. In this case, there are two matches, which start at indexes 3 and 16 and are 4 and 3 items long, respectively. This means our zeros are located at indexes (3+1):(3-2+4), or 4:5 and at (16+1):(16-2+3), or 17:17. We concatenate ('paste') these sequences using the 'collapse' option again, in case there are multiple matches. Then, we use a second concatenation to put the sequences inside of a combine (c()) function. Using the 'eval' and 'parse' functions, we turn this text into code and pass it as index values to the [data] array. We write all ones into those locations.

Step x: Repeat for each pattern. In this case, we need to do a second search and find one to three zeros with twos on either side and then run the same statement as Step 3, but assigning twos, instead of ones.

str.matches <- gregexpr("2[0]{1,3}2", str.values)
# [[1]]
# [1] 10
# attr(,"match.length")
# [1] 5
# attr(,"useBytes")
# [1] TRUE

data[eval(parse(text=paste("c(",paste(str.matches[[1]] + 1, str.matches[[1]] - 2 + attr(str.matches[[1]], "match.length"), sep=":", collapse=","), ")")))] <- 2
# 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2

Update: I realized the original problem said to match one to three zeros in a row, rather than the "two or more" that I written into the original code. I have updated the regular expressions and the explanation, although the code remains the same.

Gerhart answered 25/2, 2013 at 17:22 Comment(1)
so, I actually went for this one in the end, I loved the ability to have the control over the patterns - but I appreciated all the suggestions. I will keep note of these different methods for different circumstances though. Really appreciate it.Scandium
W
1

There may be a solution without a for loop, but you can try this :

tmp <- rle(data)
val <- tmp$values
for (i in 2:(length(val)-1)) {
  if (val[i]==0 & val[i-1]==val[i+1]) val[i] <- val[i-1]
}
tmp$values <- val
inverse.rle(tmp)  

Which gives :

[1] 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 0 2
Wira answered 25/2, 2013 at 12:44 Comment(7)
I think you can "tighten up" this by doing rle(as.logical(data)) which will fill your tmp with lengths of 'zero' and 'not-zero' , after which you can replace every run of zeroes with something like val[i-1]*(val[i-1]==val[i+1]) . (In case I screwed that up, the intent is to replace the zeroes with val[i-1] but only when the equality check is TRUE) -- tho' this would have to be rather carefully :-( un-rle-ed .Schrock
@CarlWitthoft Hmm if you use rle(as.logical(data)) you can't use your rle$values to test for values equality anymore ?Wira
Nevvamind -- Andrie 's answer does what I was thinking of even more compactly (and reliably).Schrock
A loop was my first thought but just couldn't get there! Thanks very much for this - worked a treat!Scandium
Maybe you should accept @Andrie's answer instead of mine ? it was much clearer and largely upvoted...Wira
It doesn't really matter. It's up to the OP which answer to choose, and we have some good answers on here for other people to find later.Gerhart
@Dinre, while the OP's choice is final, it's also the community's choice to steer/educate the OP as to what's the best answer (with the help of voting and if it doesn't help much, with comments). Of course after that the OP can choose what he wants as well. Someone who googles and lands on this page is not guaranteed to find the answer from Andrie (unless one spends quite sometime on SO).Graphy
A
0

For those who are looking into this in 2020 I did a sequence replacing by just using gsub.

str.values <- paste(YOUR$COLUMN, collapse="") 
str.values2 <- gsub("ORIGINAL PATTERN","PATTERN TO REPLACE", str.values)
Arevalo answered 25/5, 2020 at 14:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.