Remove/collapse consecutive duplicate values in sequence
Asked Answered
I

5

24

I have the following dataframe:

a a a b c c d e a a b b b e e d d

The required result should be

a b c d e a b e d 

It means no two consecutive rows should have same value. How it can be done without using loop.

As my data set is quite huge, looping is taking lot of time to execute.

The dataframe structure is like the following

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10

Result:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4

Its should delete the entire row.

Irritating answered 15/12, 2014 at 11:9 Comment(0)
S
28

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a data.frame

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4
Soilure answered 15/12, 2014 at 11:11 Comment(2)
How can i use this in the data.frame? If i would use this in list, then again i have to map this unique values with the old data.frame but then its not possible to map as the length is less. What my purpose is to remove the entire row of the dataframe whenever i would get the same value in consecutive rows for a particular column.Irritating
Nice answer! For you first data.frame solution, I found I needed X <- rle(as.numeric(mydf$V1)), as V1 is a factor. Other remark: I found that in some cases cumsum(X$lengths) would do the job, depending on which duplicate rows you wish to keep (top to bottom v. bottom to top), do you confirm?Yhvh
S
13
library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

EDIT: For data.frame

  mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)

dplyr solution is one liner:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4

post scriptum

lead(x,1) suggested by @Carl Witthoft iterates in reverse order.

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10
Stu answered 15/12, 2014 at 11:22 Comment(1)
Could it be that you're using dplyr::lag here? I tried your code with a clean session and it doenst work with stats::lagFarfetched
S
6

With base R, I like funny algorithmics:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Seventeen answered 15/12, 2014 at 11:26 Comment(3)
Similarly could use indexing instead of tail, something like x[x != c(x[-1], FALSE)]Neighborly
I struggled to adapt this to factors inside a data.frame. Within a dataframe, the rhs, FALSE constrains the vector to be of type integer, so on the lhs x must be so constrained for the comparison to be possible, with as.integer(). Correct me if I'm wrong!Yhvh
sorry but it's very unclear ... you can ask a question on SO if you are meeting an issue on such a topic, with your input, output and what you have done so far.Seventeen
T
3

Much as I like,... errr, love rle , here's a shootoff:

EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.

xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]



  microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
         expr      min       lq   median       uq      max neval
  rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
  lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
 tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20
Thorfinn answered 15/12, 2014 at 11:27 Comment(7)
Did lagit work for you? It returns nothing to me. Maybe this is why its the fastest....Neighborly
I think you need to re run this after loading dplyr, as I suspect lagit isnt doing anything currently and thus very fastNeighborly
Oops, yep-- I didn't check the outputs. But what's up? I see "lag" listed under "leadlag" but there's no actual function dplyr::lagThorfinn
The tailit function in your benchmark is missing the last "d" in the vector.. perhaps update to x[x!=c(x[-1], FALSE)] as in the answerFarfetched
Seems like lead is little slower than lag. Why is that so? github.com/hadley/dplyr/blob/master/R/lead-lag.RStu
@DavidArenburg Upon reflection, lagit returned zero because I had a wide range of values so there weren't any repeats. If you rerun my case with xlet<-sample(letters[1:7],1e5,rep=T) you'll get some hits.Thorfinn
Is there any LAG function equivalent in R ?? (#20637429) I have tried this but did n't work.Irritating
D
0

Tidyverse solution:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |> 
 mutate(id = consecutive_id(x)) |> 
 distinct(x, id)

In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |> 
    group_by(id = consecutive_id(x)) |> 
    slice_min(y)

We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.

This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.

Dilatation answered 22/1, 2023 at 22:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.