Remove/collapse consecutive duplicate values in sequence

Asked 15/12, 2014 at 11:9 Answered 22/1, 2023 at 22:4

I have the following dataframe:

a a a b c c d e a a b b b e e d d

The required result should be

a b c d e a b e d

It means no two consecutive rows should have same value. How it can be done without using loop.

As my data set is quite huge, looping is taking lot of time to execute.

The dataframe structure is like the following

a 1 
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10

Result:

a 1 
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4

Its should delete the entire row.

Irritating answered 15/12, 2014 at 11:9 Comment(0)

One easy way is to use rle:

Here's your sample data:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items

rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").

rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Update: For a `data.frame`

If you are working with a data.frame, try something like the following:

## Sample data
mydf <- data.frame(
  V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
  V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10)
)

## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1]  1  4  5  7  8  9 11 13 15
mydf[Y, ]
#    V1 V2
# 1   a  1
# 4   b  2
# 5   c  4
# 7   d  3
# 8   e  9
# 9   a  4
# 11  b 10
# 13  e  2
# 15  d  4

Update 2

The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:

library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
#    rleid V2
# 1:     1  1
# 2:     2  2
# 3:     3  4
# 4:     4  3
# 5:     5  9
# 6:     6  4
# 7:     7 10
# 8:     8  2
# 9:     9  4

Soilure answered 15/12, 2014 at 11:11 Comment(2)

How can i use this in the data.frame? If i would use this in list, then again i have to map this unique values with the old data.frame but then its not possible to map as the length is less. What my purpose is to remove the entire row of the dataframe whenever i would get the same value in consecutive rows for a particular column. – Irritating 15/12, 2014 at 12:7

Nice answer! For you first data.frame solution, I found I needed X <- rle(as.numeric(mydf$V1)), as V1 is a factor. Other remark: I found that in some cases cumsum(X$lengths) would do the job, depending on which duplicate rows you wish to keep (top to bottom v. bottom to top), do you confirm? – Yhvh 21/6, 2016 at 8:20

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

EDIT: For data.frame

  mydf <- data.frame(
    V1 = c("a", "a", "a", "b", "c", "c", "d", "e", 
         "a", "a", "b", "b", "e", "e", "d", "d"),
    V2 = c(1, 2, 3, 2, 4, 1, 3, 9, 
         4, 8, 10, 199, 2, 5, 4, 10),
   stringsAsFactors=FALSE)

dplyr solution is one liner:

mydf %>% filter(V1!= lag(V1, default="1"))
#  V1 V2
#1  a  1
#2  b  2
#3  c  4
#4  d  3
#5  e  9
#6  a  4
#7  b 10
#8  e  2
#9  d  4

post scriptum

lead(x,1) suggested by @Carl Witthoft iterates in reverse order.

leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]

#   V1  V2
#3   a   3
#4   b   2
#6   c   1
#7   d   3
#8   e   9
#10  a   8
#12  b 199
#14  e   5
#16  d  10

Stu answered 15/12, 2014 at 11:22 Comment(1)

Could it be that you're using dplyr::lag here? I tried your code with a clean session and it doenst work with stats::lag – Farfetched 15/12, 2014 at 11:31

With base R, I like funny algorithmics:

x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")

x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Seventeen answered 15/12, 2014 at 11:26 Comment(3)

Similarly could use indexing instead of tail, something like x[x != c(x[-1], FALSE)] – Neighborly 15/12, 2014 at 11:39

I struggled to adapt this to factors inside a data.frame. Within a dataframe, the rhs, FALSE constrains the vector to be of type integer, so on the lhs x must be so constrained for the comparison to be possible, with as.integer(). Correct me if I'm wrong! – Yhvh 21/6, 2016 at 8:32

sorry but it's very unclear ... you can ask a question on SO if you are meeting an issue on such a topic, with your input, output and what you have done so far. – Seventeen 21/6, 2016 at 8:35

Much as I like,... errr, love rle , here's a shootoff:

EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.

xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]



  microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
         expr      min       lq   median       uq      max neval
  rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657    20
  lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940    20
 tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840    20

Thorfinn answered 15/12, 2014 at 11:27 Comment(7)

Did lagit work for you? It returns nothing to me. Maybe this is why its the fastest.... – Neighborly 15/12, 2014 at 11:32

I think you need to re run this after loading dplyr, as I suspect lagit isnt doing anything currently and thus very fast – Neighborly 15/12, 2014 at 11:40

Oops, yep-- I didn't check the outputs. But what's up? I see "lag" listed under "leadlag" but there's no actual function dplyr::lag – Thorfinn 15/12, 2014 at 11:42

The tailit function in your benchmark is missing the last "d" in the vector.. perhaps update to x[x!=c(x[-1], FALSE)] as in the answer – Farfetched 15/12, 2014 at 11:54

Seems like lead is little slower than lag. Why is that so? github.com/hadley/dplyr/blob/master/R/lead-lag.R – Stu 15/12, 2014 at 13:43

@DavidArenburg Upon reflection, lagit returned zero because I had a wide range of values so there weren't any repeats. If you rerun my case with xlet<-sample(letters[1:7],1e5,rep=T) you'll get some hits. – Thorfinn 15/12, 2014 at 14:41

Is there any LAG function equivalent in R ?? (#20637429) I have tried this but did n't work. – Irritating 15/12, 2014 at 20:38

Tidyverse solution:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |> 
 mutate(id = consecutive_id(x)) |> 
 distinct(x, id)

In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:

x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |> 
    group_by(id = consecutive_id(x)) |> 
    slice_min(y)

We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.

This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.

Dilatation answered 22/1, 2023 at 22:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update: For a data.frame

Update 2

Recommended topics

Hot tags

Update: For a `data.frame`