How to prevent rows with duplicated indices / keys to be appended to a data.frame?

Asked 20/3, 2016 at 10:50 Answered 1/4, 2016 at 23:46

r dataframe data.table uniqueidentifier unique-key

I have data in which the combination of two variables ("ManufactererId" and "ProductId") constitute unique keys / identifiers. The data looks like this:

my.data <- data.frame(ManufactererId = c(1, 1, 2, 2),
                      ProductId = c(1, 2, 1, 7),
                      Price = c(12.99, 149.00, 0.99, 3.99))
my.data
#   ManufactererId ProductId  Price
# 1              1         1  12.99
# 2              1         2 149.00
# 3              2         1   0.99
# 4              2         7   3.99

I want to ensure that I cannot accidentally add another row with a pair of ManufactererId - ProductId equal to what is already present in the table (like the unique constraint on a database table).

That is, if I try to add a row with ManufactererId = 2 and ProductId = 7 to my data frame:

my.data <- rbind(my.data, data.frame(ManufactererId = 2, ProductId = 7, Price = 120.00))

...it should fail with an error. How can this be achieved?

Or should I use a different data type?

Wilda answered 20/3, 2016 at 10:50 Comment(0)

1) zoo Whether this is convenient or not depends on what operations you want to do but zoo objects have unique indexes. We can construct a text index by pasting together the two Id columns.

library(zoo)
z <- with(my.data, zoo(Price, paste(ManufactererId, ProductId)))

z <- c(z, zoo(90, "1 1")) # Error, not appended
z <- c(z, zoo(90, "1 3")) # OK

Note that the data part of a zoo object can be a vector as shown above or a matrix in case you have more than just Price in the data.

2) SQLite This could be done with any of a number of databases but we will use SQLite here. First we create a table with a unique index in an SQLite database and then we insert rows.

library(RSQLite)

con <- dbConnect(SQLite())
dbWriteTable(con, "my", my.data, row.names = FALSE)
dbGetQuery(con, "create unique index ix on my(ManufactererId, ProductId)")

dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 1, 99)) # error
dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 13, 90)) # OK

Blessington answered 20/3, 2016 at 11:30 Comment(1)

Added second approach. – Blessington 20/3, 2016 at 12:13

You can do something like this: Where keys is your unique-key

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(ncol(DF) != ncol(to_be_appended) || !all(names(DF) %in% names(to_be_appended))){
    stop("must have the same columns")
  }
  if(nrow(merge(DF, to_be_appended, by=keys))==0){
    rbind(DF, to_be_appended)
  } else {
    stop("Trying to append douplicated indices")
  }
}

Test it:

to_be_appended = data.frame(ManufactererId=2,ProductId=17,Price=3.99)
append_save(my.data, to_be_appended) # works
to_be_appended_err = data.frame(ManufactererId=2,ProductId=7,Price=3.99)
append_save(my.data, to_be_appended_err) # error

If you append data only based on key-columns you could use data.table as follows:

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(!all(keys %in% names(to_be_appended))){
    stop("key-columns must be present")
  }
  if(nrow(data.table::merge(DF, to_be_appended, on=keys))==0){
    data.table::setDF(data.table::rbindlist(list(DF, to_be_appended), fill = TRUE))[]
  } else {
    stop("Trying to append douplicated indices")
  }
}

Gahl answered 20/3, 2016 at 11:34 Comment(0)

One way to do this in base R is to use an environment as a dictionary or hash-map like object. my.dict <- new.env()

First, write some helper functions

make_key <- function(ManufactererId, ProductId)
  paste(ManufactererId, ProductId)

set_value <- function(key, value, dict){
         ## checking here assures desired behavior 
         if(any(key %in% names(dict)))
            stop("This key has been used")
         assign(key, value,  envir=dict)
}

then, you can generate keys like

keys <- make_key(my.data[[1]], my.data[[2]])

to set values, you need to be a bit more careful

# don't just do this as the first element is used by assign
# set_value(keys, my.data[[3]], dict=my.dict)

mapply(set_value, keys, my.data[[3]], MoreArgs = list(dict=my.dict))
ls.str(my.dict) # better than str for environments
# 1 1 :  num 13
# 1 2 :  num 149
# 2 1 :  num 0.99
# 2 7 :  num 3.99

set_value("1 1", 4, my.dict)
# Error in set_value("1 1", 4, my.dict) : This key has been used

Dainedainty answered 1/4, 2016 at 23:26 Comment(0)

Simple way to rbind new data excluding duplicates:

library(data.table)
my.data = data.table(ManufactererId = c(1, 1, 2, 2),
                     ProductId = c(1, 2, 1, 7),
                     Price = c(12.99, 149.00, 0.99, 3.99),
                     key = c("ManufactererId","ProductId"))
x = my.data # my data will be called 'x'
y = data.table(ManufactererId = 2, ProductId = 7, Price = 120.00)
rbind(x, y[!x, on=key(x)])
#   ManufactererId ProductId  Price
#1:              1         1  12.99
#2:              1         2 149.00
#3:              2         1   0.99
#4:              2         7   3.99

Although you don't need to set a key but just provide character vector to on argument directly. I think it's worth to use key which simply reflects our business expectations in regards to data structure.

If you want to raise an error on such case you can use the following:

unique.rbind = function(x, y, by=key(x)) {
    if (nrow(x[y, nomatch=0L, on=by])) stop("duplicates in 'y'")
    rbind(x, y)
}
unique.rbind(x, y)
# Error in unique.rbind(x, y) : duplicates in 'y'

None of y rows gets inserted in case of error.

Woermer answered 1/4, 2016 at 23:46 Comment(0)

Recommended topics

Hot tags