what's the fastest way to scan a very large file in java?
Asked Answered
E

8

7

Imagine I have a very large text file. Performance really matters.

All I want to do is to scan it to look for a certain string. Maybe I want to count how many I have of those, but it really is not the point.

The point is: what's the fastest way ?

I don't care about maintainance it needs to be fast.

Fast is key.

Euphemism answered 3/2, 2011 at 12:29 Comment(3)
So does it need to be fast?Conation
@Joel: I'm not really sure if that's what he meant.Bespeak
More importantly: does it need to be fast once or do you need to search the same source multiple times (for different Strings obviously)?Bespeak
C
16

For a one off search use a Scanner, as suggested here

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.

Conation answered 3/2, 2011 at 13:35 Comment(2)
Nice shortcut to get everything I suggested without implementing it manually!Gow
note: this may end up loading the whole file into memoryPya
G
4

First of all, use nio (FileChannel) rather than the java.io classes. Second, use an efficient string search algorithm like Boyer-Moore.

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.

Gow answered 3/2, 2011 at 12:44 Comment(3)
why nio instead of io? the nio classes are for scaling, not necessarily for speed.Thenna
@jtahlborn: You're mistaking one aspect (scalable networking via selectors) for the whole. The nio classes can also speed up file operations a lot by avoiding copy operations. For example (and relevant to this question), a MappedByteBuffer can directly use the disk page data as provided by the OS, whereas a BufferedInputStream has to copy it because it's built in top of the InputStream interface.Gow
in order to work with the data, you will still need to copy it into the java heap. so, for operations which are a one time read through the file, i doubt that this will make any significant difference.Thenna
C
1

Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.

Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.

Care answered 3/2, 2011 at 12:43 Comment(1)
Yes but to load it into memory you got to read it off disk first - unless you need to do more than one search (maybe, the OP doesn't specify) you should just parse the stream.Arieariel
J
0

Whatever may be the specifics, memory mapped IO is usually the answer.

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.

Jubbulpore answered 3/2, 2011 at 12:31 Comment(1)
How the hell could JDBC possibly help in any way? What "performance improvements" are you talking about?Gow
D
0

I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

Disaccord answered 3/2, 2011 at 12:38 Comment(0)
B
0

Use the right tool: full text-search library

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.

Bumptious answered 3/2, 2011 at 12:55 Comment(0)
K
0

I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using Runtime.getRuntime().exec(..)

Kamat answered 3/2, 2011 at 13:9 Comment(0)
A
0

It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.

Arieariel answered 3/2, 2011 at 13:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.