Using R, getting a "Can't bind data because some arguments have the same name" using dplyr:select

Asked 28/2, 2019 at 18:29 Answered 27/8, 2020 at 14:43

#use readtable to create data frames of following unzipped files below
x.train <- read.table("UCI HAR Dataset/train/X_train.txt")
subject.train <- read.table("UCI HAR Dataset/train/subject_train.txt")

y.train <- read.table("UCI HAR Dataset/train/y_train.txt")
x.test <- read.table("UCI HAR Dataset/test/X_test.txt")
subject.test <- read.table("UCI HAR Dataset/test/subject_test.txt")

y.test <- read.table("UCI HAR Dataset/test/y_test.txt")
features <- read.table("UCI HAR Dataset/features.txt")
activity.labels <- read.table("UCI HAR Dataset/activity_labels.txt")   



colnames(x.test) <- features[,2]
dataset_test <- cbind(subject.test,y.test,x.test)
colnames(dataset_test)[1] <- "subject"
colnames(dataset_test)[2] <- "activity"

test <- select(features, V2)

dataset_test <- select(dataset_test,subject,activity)

[1] Error: Can't bind data because some arguments have the same name

features is a two column dataframe with the second columns containing the names for x.test
subject.test is a single column data frame
y.test is a single column data frame
x.test is a wide data frame

After naming and binding these data frames I tried to use dplyr::select to select certain frames. However, I get an error returning dataset_test:

"Error: Can't bind data because some arguments have the same name"

However, test does not return an error and properly filters. Why is there the difference in behaviour?

The data I am using can be downloaded online. The data sources correspond to the variable names, except "_" are used instead of "."

dput

> dput(head(x.test[,1:5],2))
structure(list(V1 = c(0.25717778, 0.28602671), V2 = c(-0.02328523, 
-0.013163359), V3 = c(-0.014653762, -0.11908252), V4 = c(-0.938404, 
-0.97541469), V5 = c(-0.92009078, -0.9674579)), row.names = 1:2, class = "data.frame")

> dput(head(subject.test,2))
structure(list(V1 = c(2L, 2L)), row.names = 1:2, class = "data.frame")

> dput(head(y.test,2))
structure(list(V1 = c(5L, 5L)), row.names = 1:2, class = "data.frame")

> dput(head(features,2))
structure(list(V1 = 1:2, V2 = c("tBodyAcc-mean()-X", "tBodyAcc-mean()-Y"
)), row.names = 1:2, class = "data.frame")

Commander answered 28/2, 2019 at 18:29 Comment(9)

You should edit to add sample data for reproducibility. Use dput to provide some sample data. More details here – Irrelevance 28/2, 2019 at 18:31

What does names(dataset_test) return? – Myosin 28/2, 2019 at 18:39

It returns: [1] "subject" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y" "tBodyAcc-mean()-Z" [1] "subject" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y" "tBodyAcc-mean()-Z" [5] "tBodyAcc-std()-X" "tBodyAcc-std()-Y" "tBodyAcc-std()-Z" "tBodyAcc-mad()-X" [9] "tBodyAcc-mad()-Y" ..... – Commander 28/2, 2019 at 19:6

This is what I expected it would return – Commander 28/2, 2019 at 19:6

Can you share a sample of the data using dput – Myosin 28/2, 2019 at 20:2

Sure, so dput each variable? – Commander 28/2, 2019 at 21:12

posted dput as requested :) – Commander 28/2, 2019 at 21:47

I have seen the same error (with another dataset), after upgrading R lately. The data is unchanged. – Charil 26/3, 2019 at 8:40

I resolved the issue but don't exactly remember the issue. If I recall, somehow the prior binding and merging led to duplicate column names. See if there are any duplicate column names. – Commander 10/6, 2019 at 14:9

I had exactly the same problem and I think I'm looking at the same dataset as you. It's motion sensor data from a smart phone, isn't it?

The problem is exactly what the error message says! That dang set has duplicate column names. Here's how I explored it. I couldn't use your dput commands, so I couldn't try out your data. I'm showing my code and results. I suggest you substitute your variable, dataset_test, where I have samsungData.

Here's the error. If you just select the dataset, but don't indicate the columns, the error message identifies the duplicates.

select(samsungData)

That gave me this error, which is just what your own dplyr error was trying to tell you.

Error: Columns "fBodyAcc-bandsEnergy()-1,8", "fBodyAcc-bandsEnergy()-9,16", "fBodyAcc-bandsEnergy()-17,24", "fBodyAcc-bandsEnergy()-25,32", "fBodyAcc-bandsEnergy()-33,40", ... must have a unique name

Then I wanted to see where that first column was duplicated. (I don't think I'll ever work well with regular expressions, but this one made me mad and I wanted to find it.)

has_dupe_col <- grep("fBodyAcc\\-bandsEnergy\\(\\)\\-1,8", names(samsungData))
names(samsungData)[has_dupe_col]

Results:

[1] "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8" "fBodyAcc-bandsEnergy()-1,8"

That showed me that the same column name appears in three positions. That won't play nicely in dplyr.

Then I wanted to see a frequency table for all the column names and call out the duplicates.

names_freq <- as.data.frame(table(names(samsungData)))
names_freq[names_freq$Freq > 1, ]

A bunch of them appear three times! Here are just a few.

                                Var1 Freq
9        fBodyAcc-bandsEnergy()-1,16    3
10       fBodyAcc-bandsEnergy()-1,24    3
11        fBodyAcc-bandsEnergy()-1,8    3

Conclusion:

The tool (dplyr) isn't broken, the data is defective. If you want to use dplyr to select from this dataset, you're going to have to locate those duplicate column names and do something about them. Maybe you change the column name (dplyr's mutate will do it for you without grief). On the other hand, maybe they're supposed to be duplicated and they're there because they're a time series or some iteration of experimental observations. Maybe then what you need to do is merge those columns into one and provide another dimension (variable) to distinguish them.

That's the analysis part of data analysis. You'll have to dig into the data to see what the right answer is. Either that, or the question you're trying to answer need not even include those duplicate columns, in which case you throw them away and sleep peacefully.

Welcome to data science! At best, it's just 10% cool math and machine learning. 90% is putting on gloves and a mask and wiping up crap like this in your data.

Rattat answered 5/6, 2019 at 14:1 Comment(0)

I recently ran into this same problem with a different data set. My tidyverse solution to identifying duplicate column names in the dataframe (df) was:

tibble::enframe(names(df)) %>% count(value) %>% filter(n > 1)

Siftings answered 25/6, 2019 at 19:44 Comment(0)

This error is often caused by a data frame having columns with identical names, that should be the first thing to check. I was trying to check my own data frame with dplyr select helper functions (start_with, contains, etc.), but even those won't work, so you may need to export to a csv to check in Excel or some other program or use base functions to check for duplicate column names.

Tectonics answered 10/4, 2019 at 16:43 Comment(1)

Links to external resources are encouraged, but please add context around the link so your fellow users will have some idea what it is and why it’s there. Always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline. – Burck 10/4, 2019 at 17:6

Another possibility to find duplicate column names using Base R would be using duplicated:

colnames(df)[which(duplicated(colnames(df)))]

Phlox answered 27/8, 2020 at 14:43 Comment(0)

Recommended topics

Hot tags