How to find lines containing any string from another file?
Asked Answered
K

2

7

I have 2 csv files. File A, with multiple columns. File B, with one column. eg.:

File A:

chr1 100000 100022 A C GeneX
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

File B:

GeneY
GeneZ

I would want my output to be:

chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

I have tried using grep (which crashes) and others.
I am certain there must be a very simple answer to this that I just can't see!

Kook answered 20/1, 2015 at 2:49 Comment(9)
Which platform are you on if grep crashes? How big are the files that you're working with? You said that you got an 'out of memory' error when you tried grep -f FileB FileA. Your best bet in that case is probably to split FileB into sections small enough to be processed without grep crashing. The obvious disadvantage of this is that you will end up with rows in the result set that are out of order compared with the original FileA. If two words from FileB can appear in a single line, then you could also end up with repeats.Reagan
Does sed work any better? What about Perl? If neither sed nor grep nor Perl works, then you may be able to find a better way to encode the information and write your own processing. But that's something of a last resort, depending on a lot of factors not yet described in the question.Reagan
Thanks. I haven't been able to get sed to work.Kook
Bad luck. Please identify the platform you're working on, and the sizes of the two files (line count and size in bytes for both files would be useful).Reagan
I've been trying to use unix in a bash terminal. File A is just 1 column of 1500 lines. File B is 1.2M, with 5800.Kook
Which version of Unix? Those are tiny files! I was assuming you meant millions of records in the list of names, and gigabytes of in the main file. OK; so maybe they aren't tiny, but they are not, by any stretch of the imagination, big. Maybe you need to get GNU grep installed? It will be quicker and simpler than most of the alternatives. (I just tried doing grep -f FileA with a file containing 1500 generated lines such as GZX6274256PQA (a seven digit random number sandwiched between two constant strings) and it started up without a problem on my Mac, using BSD grep, rather than GNU.Reagan
Yes, they are not that big, which is why I am struggling. I'm on Darwin Kernel Version 13.4.0.Kook
So that's Mac OS X Mavericks 10.9.5, I guess. I was able to run grep -f FileA with a similar file (new set of random numbers, different sandwiching letters) without problems. That's got 16 GiB main memory; I don't know if you're memory constrained -- the memory pressure on my machine is non-existent (11 GiB used, so 5 GiB available) -- see Activity Monitor / Memory tab. Have you rebooted since you ran into trouble? (I hate suggesting that, but it can help surprisingly/depressingly often.)Reagan
See this post: Fastest way to find lines of a file from another larger file in BashNodule
F
3

Use grep -f

grep -f FileB FileA
Fregger answered 20/1, 2015 at 2:51 Comment(2)
Thanks Amit. Unfortunately I get an out of memory error message with grep when I try to run this on large datasets.Kook
@ChrisDias - You can try after setting locale with export LC_ALL=C. Source - stackoverflow.com/a/11777835Fregger
R
0

Here is how to do it with awk

awk 'FNR==NR {a[$0];next} {for (i in a) if (i~$1) print i}' FileA FileB
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ

Or like this:

awk 'FNR==NR {a[$0];next} ($NF in a)' FileB FileA
chr2 200000 200033 X GeneY
chr3 300000 300055 G A GeneZ
Rockaway answered 20/1, 2015 at 7:39 Comment(2)
Thanks Jotne. I tried this, but got an empty output. I have the files as both .csv and tab delimited .txt; neither worked.Kook
@ChrisDias It does works fine with your data above. Then your data differs in format.Rockaway

© 2022 - 2024 — McMap. All rights reserved.