Counting number of instances of a condition per row R [duplicate]
Asked Answered
M

1

7

I have a large file with the first column being IDs, and the remaining 1304 columns being genotypes like below.

rsID    sample1    sample2    sample3...sample1304
abcd    aa         bb         nc        nc
efgh    nc         nc         nc        nc 
ijkl    aa         ab         aa        nc 

I would like to count the number of "nc" values per row and output the result of that to another column so that I get the following:

rsID    sample1    sample2    sample3...sample1304    no_calls
abcd    aa         bb         nc        nc            2
efgh    nc         nc         nc        nc            4
ijkl    aa         ab         aa        nc            1

The table function counts frequencies per column, not row and if I transpose the data to use in the table function, I would need the file to look like this:

abcd         aa[sample1]
abcd         bb[sample2]
abcd         nc[sample3] ...
abcd         nc[sample1304]
efgh         nc[sample1]
efgh         nc[sample2]
efgh         nc[sample3] ...
efgh         nc[sample1304]

With this format, I would get the following which is what I want:

ID    nc   aa   ab   bb
abcd  2    1    0    1
efgh  4    0    0    0

Does anybody have any idea of an simple way to get frequencies by row? I am trying this right now, but it is taking quite some time to run:

rsids$Number_of_no_calls <- apply(rsids, 1, function(x) sum(x=="NC"))
Mar answered 16/9, 2015 at 20:58 Comment(2)
R is case sensitive. The data shows "nc" but the apply "NC"...¿?Lippert
rowSums is probably the right functionMavis
M
17

You can use rowSums.

df$no_calls <- rowSums(df == "nc")
df
#  rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd      aa      bb      nc         nc        2
#2 efgh      nc      nc      nc         nc        4
#3 ijkl      aa      ab      aa         nc        1

Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to

df$no_calls <- rowSums(df[-1] == "nc")

Regarding the row names: They are not counted in rowSums and you can make a simple test to demonstrate it:

rownames(df)[1] <- "nc"  # name first row "nc"
rowSums(df == "nc")      # compute the row sums
#nc  2  3             
# 2  4  1        # still the same in first row
Mavis answered 16/9, 2015 at 21:3 Comment(4)
Maybe df$no_calls <- rowSums(df[,-1] == "nc") to ignore any "nc" values in the first column.Metastasize
@MrFlick, good point if there can be any in that columnMavis
@doc Will the original code you posted count "nc" values in the first column if the first column is read in as row names?Mar
@nchimato, no it won't.Mavis

© 2022 - 2024 — McMap. All rights reserved.