Keep rows with certain values within a data frame and delete all others [R]

Asked 14/9, 2013 at 17:57 Answered 14/9, 2013 at 18:51

I am using R

set.seed(1)
Data <- data.frame(id = seq(1, 10), 
               Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE), 
               Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE), 
               Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE), 
               Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE), 
               Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data

I've got a data frame like this. In reality it has 34 variables and 1.5 Mio observations. It is a data frame with patient data. (ID & diagnoses (ICD10) A123 and B123 stand for certain diagnoses. I want to extract all the patients with these diagnoses. In fact i am looking for 6 diagnoses within 100s of different ICD10 diagnoses. Every of those diagnoses i look for can be appear in any column but they are mutually exclusive. In the end I will have a data frame of about 4000 observations instead of 1.5 Mio.

My goal is to get a data frame where I just keep the rows which contain A123 or B123. A123 and B123 cannot be in the same row. But they can appear in every column.

I manage to do that for one single variable when i do this:

DataA123 <- Data[Data$Diag1 == "A123", ]

But i want to do it for every variable and for A123 and B123 (there are actually 6 factors like this) together.

Is this possible?

Oscine answered 14/9, 2013 at 17:57 Comment(14)

What language or framework are you using? – Harte 14/9, 2013 at 17:59

I use R. Sorry i am new to this forum and to this stuff. I searched the forums and wasn't able to find an answer to this. – Oscine 14/9, 2013 at 18:6

Welcome to the SO. Do you mean you need the data fame where all variables have A123 or all variables have B123?Can you please provide the expected output? – Gainsborough 14/9, 2013 at 18:10

Thank you! It is a data frame with patient data. (ID & diagnoses (ICD10) A123 and B123 stand for a certain diagnoses. I want to extract all the patients with these diagnoses. In fact i am looking for 6 diagnoses within 100s of different ICD10 diagnoses. Every of those diagnoses i look for can be appear in any column but they are mutually exclusive. In the end I will have a data frame of about 4000 observations instead of 1.5 Mio. – Oscine 14/9, 2013 at 18:22

So you are looking for the patients which has Diag1=A123 or Diag2=123 or diag3=A123 or Diag4=A123 or Diag5=A123 or Diag6=123 (Assuming mutually exclusive as you said), right? – Gainsborough 14/9, 2013 at 18:32

@ Metrics Exactly or B123 instead of A123. – Oscine 14/9, 2013 at 18:36

Got that. See the answers. – Gainsborough 14/9, 2013 at 18:39

@Roccer, it is helpful to use set.seed when posting these types of questions that involve random data. – Mercedes 14/9, 2013 at 18:43

@ Anando Mahto, You are right. Thank you. I just updated that – Oscine 14/9, 2013 at 18:52

@Roccer, great. Now that we are all working with the same small sample dataset, it would also be nice if you can share your desired output from that sample data. – Mercedes 14/9, 2013 at 18:56

@ Ananda Mahto. I tried your solution. I changed it like this as I am looking for 6 diagnoses. Would that work as well? My output is a data.frame full of NAs Data2[rowSums(cbind(rowSums(Data2 == "G820"), rowSums(Data2 == "G821"), rowSums(Data2 == "G822"), rowSums(Data2 == "G823"), rowSums(Data2 == "G824"), rowSums(Data2 == "G825")) != 0) == 1, ] – Oscine 14/9, 2013 at 19:1

@Roccer, Just glancing at your code, it should work. By the way, if you keep putting a space between the @ and a user name, they won't get any notification that you've sent them a message. – Mercedes 14/9, 2013 at 19:10

@Roccer, to troubleshoot, you might want to break the code apart and see what it is doing at each step. – Mercedes 14/9, 2013 at 19:16

@Anada Mahto Thank you for the hint. Wow you guys are fast. I just realized my example is not good as in one row there can be just one of the diagnoses which is not the case in my example. – Oscine 14/9, 2013 at 19:20

How about this?

Select all rows with A123 and/or B123:

Data[apply(Data,1,function(x) {any(c("A123", "B123") %in% x)}),]

Select all rows with either A123 or B123:

Data[apply(Data,1,function(x) {Reduce(xor, c("A123", "B123") %in% x)}),]

Ahvenanmaa answered 14/9, 2013 at 18:51 Comment(2)

+1 for these alternatives. Both options match somewhat in my mind with what the OP might be looking for. Of course, only they know what that is! :p – Mercedes 14/9, 2013 at 19:55

@AnandaMahto, thank you for your help as well! Next questions will be better structured :) – Oscine 14/9, 2013 at 20:33

If I understand your question correctly, you might be able to use something like:

Data[rowSums(cbind(rowSums(Data == "A123"), 
                   rowSums(Data == "B123")) != 0) == 1, ]

(But I'm not sure how efficient it would be for your actual data, especially because you have to make several intermediate large matrices).

The basic idea is as follows:

rowSums(Data == "A123") tells us how many times "A123" appears in each row.
rowSums(Data == "B123") tells us how many times "B123" appears in each row.
cbind puts the two of them together as a two column matrix.
Since "A123" and "B123" can't be in the same row, we calculate rowSums again to find out how many rows have only one of those present (even if it is present more than once).
From there, it's basic subsetting.

Here's an example:

set.seed(1)
Data <- data.frame(id = seq(1, 10), 
               Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE), 
               Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE), 
               Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE), 
               Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE), 
               Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data
#    id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 1   1  A123  D123  I123  B123  L123  N123  R123
# 2   2  B123  D123  G123  B123  K123  O123  P123
# 3   3  B123  F123  H123  B123  L123  N123  Q123
# 4   4  C123  E123  G123  A123  K123  M123  P123
# 5   5  A123  F123  G123  C123  K123  M123  Q123
# 6   6  C123  E123  H123  C123  L123  M123  P123
# 7   7  C123  F123  G123  C123  J123  M123  Q123
# 8   8  B123  F123  H123  A123  K123  N123  R123
# 9   9  B123  E123  I123  C123  L123  N123  P123
# 10 10  A123  F123  H123  B123  L123  N123  R123

Data[rowSums(cbind(rowSums(Data == "A123"), 
                   rowSums(Data == "B123")) != 0) == 1, ]
#   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 2  2  B123  D123  G123  B123  K123  O123  P123
# 3  3  B123  F123  H123  B123  L123  N123  Q123
# 4  4  C123  E123  G123  A123  K123  M123  P123
# 5  5  A123  F123  G123  C123  K123  M123  Q123
# 9  9  B123  E123  I123  C123  L123  N123  P123

Note that from the source 10-row data.frame

rows 1, 8, and 10 were dropped because they contained both "A123" and "B123".
rows 6 and 7 were dropped because they contained neither "A123" nor "B123".

Mercedes answered 14/9, 2013 at 18:26 Comment(0)

set.seed(1)

  ll<-as.list(names(Data)[-1])

For A123:

Map(function(x) Data[Data[x][[1]]=="A123",],ll)


  [[1]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
5   5  A123  F123  G123  C123  K123  M123  Q123
10 10  A123  F123  H123  B123  L123  N123  R123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
  id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
4  4  C123  E123  G123  A123  K123  M123  P123
8  8  B123  F123  H123  A123  K123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

For B123:

Map(function(x) Data[Data[x][[1]]=="B123",],ll)



 [[1]]
  id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
2  2  B123  D123  G123  B123  K123  O123  P123
3  3  B123  F123  H123  B123  L123  N123  Q123
8  8  B123  F123  H123  A123  K123  N123  R123
9  9  B123  E123  I123  C123  L123  N123  P123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
10 10  A123  F123  H123  B123  L123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

For A123 or B123:

 Map(function(x) Data[Data[x][[1]]=="A123"|Data[x][[1]]=="B123",],ll)



 [[1]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
5   5  A123  F123  G123  C123  K123  M123  Q123
8   8  B123  F123  H123  A123  K123  N123  R123
9   9  B123  E123  I123  C123  L123  N123  P123
10 10  A123  F123  H123  B123  L123  N123  R123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
4   4  C123  E123  G123  A123  K123  M123  P123
8   8  B123  F123  H123  A123  K123  N123  R123
10 10  A123  F123  H123  B123  L123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

Gainsborough answered 14/9, 2013 at 18:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags