Java Fastest way to read through text file with 2 million lines
Asked Answered
V

10

41

Currently I am using scanner/filereader and using while hasnextline. I think this method is not highly efficient. Is there any other method to read file with the similar functionality of this?

public void Read(String file) {
        Scanner sc = null;


        try {
            sc = new Scanner(new FileReader(file));

            while (sc.hasNextLine()) {
                String text = sc.nextLine();
                String[] file_Array = text.split(" ", 3);

                if (file_Array[0].equalsIgnoreCase("case")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("object")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("classes")) {
                    //do something
                } else if (file_Array[0].equalsIgnoreCase("function")) {
                    //do something
                } 
                else if (file_Array[0].equalsIgnoreCase("ignore")) {
                    //do something
                }
                else if (file_Array[0].equalsIgnoreCase("display")) {
                    //do something
                }
            }

        } catch (FileNotFoundException e) {
            System.out.println("Input file " + file + " not found");
            System.exit(1);
        } finally {
            sc.close();
        }
    }
Vocalist answered 21/10, 2013 at 4:4 Comment(1)
This link has some good solutionsFurnivall
J
44

You will find that BufferedReader.readLine() is as fast as you need: you can read millions of lines a second with it. It is more probable that your string splitting and handling is causing whatever performance problems you are encountering.

Justiciary answered 21/10, 2013 at 4:43 Comment(3)
I didnt do a time check but when i use bufferedreader, i think the reading part is about 20% faster compared to scannerVocalist
In my case, the splitting was the most dominant factor in the file read. Simple use of indexOf/lastIndexOf and substring helped cut those costs to a bare minimum.Replenish
For me also the cost got reduced by around 50% once I replaced split() with substring()-indexOf() pair.Micromho
G
28

I made a gist comparing different methods:

import java.io.*;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Scanner;
import java.util.function.Function;

public class Main {

    public static void main(String[] args) {

        String path = "resources/testfile.txt";
        measureTime("BufferedReader.readLine() into LinkedList", Main::bufferReaderToLinkedList, path);
        measureTime("BufferedReader.readLine() into ArrayList", Main::bufferReaderToArrayList, path);
        measureTime("Files.readAllLines()", Main::readAllLines, path);
        measureTime("Scanner.nextLine() into ArrayList", Main::scannerArrayList, path);
        measureTime("Scanner.nextLine() into LinkedList", Main::scannerLinkedList, path);
        measureTime("RandomAccessFile.readLine() into ArrayList", Main::randomAccessFileArrayList, path);
        measureTime("RandomAccessFile.readLine() into LinkedList", Main::randomAccessFileLinkedList, path);
        System.out.println("-----------------------------------------------------------");
    }

    private static void measureTime(String name, Function<String, List<String>> fn, String path) {
        System.out.println("-----------------------------------------------------------");
        System.out.println("run: " + name);
        long startTime = System.nanoTime();
        List<String> l = fn.apply(path);
        long estimatedTime = System.nanoTime() - startTime;
        System.out.println("lines: " + l.size());
        System.out.println("estimatedTime: " + estimatedTime / 1_000_000_000.);
    }

    private static List<String> bufferReaderToLinkedList(String path) {
        return bufferReaderToList(path, new LinkedList<>());
    }

    private static List<String> bufferReaderToArrayList(String path) {
        return bufferReaderToList(path, new ArrayList<>());
    }

    private static List<String> bufferReaderToList(String path, List<String> list) {
        try {
            final BufferedReader in = new BufferedReader(
                new InputStreamReader(new FileInputStream(path), StandardCharsets.UTF_8));
            String line;
            while ((line = in.readLine()) != null) {
                list.add(line);
            }
            in.close();
        } catch (final IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> readAllLines(String path) {
        try {
            return Files.readAllLines(Paths.get(path));
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

    private static List<String> randomAccessFileLinkedList(String path) {
        return randomAccessFile(path, new LinkedList<>());
    }

    private static List<String> randomAccessFileArrayList(String path) {
        return randomAccessFile(path, new ArrayList<>());
    }

    private static List<String> randomAccessFile(String path, List<String> list) {
        try {
            RandomAccessFile file = new RandomAccessFile(path, "r");
            String str;
            while ((str = file.readLine()) != null) {
                list.add(str);
            }
            file.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return list;
    }

    private static List<String> scannerLinkedList(String path) {
        return scanner(path, new LinkedList<>());
    }

    private static List<String> scannerArrayList(String path) {
        return scanner(path, new ArrayList<>());
    }

    private static List<String> scanner(String path, List<String> list) {
        try {
            Scanner scanner = new Scanner(new File(path));
            while (scanner.hasNextLine()) {
                list.add(scanner.nextLine());
            }
            scanner.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        return list;
    }


}

run: BufferedReader.readLine() into LinkedList, lines: 1000000, estimatedTime: 0.105118655

run: BufferedReader.readLine() into ArrayList, lines: 1000000, estimatedTime: 0.072696934

run: Files.readAllLines(), lines: 1000000, estimatedTime: 0.087753316

run: Scanner.nextLine() into ArrayList, lines: 1000000, estimatedTime: 0.743121734

run: Scanner.nextLine() into LinkedList, lines: 1000000, estimatedTime: 0.867049885

run: RandomAccessFile.readLine() into ArrayList, lines: 1000000, estimatedTime: 11.413323046

run: RandomAccessFile.readLine() into LinkedList, lines: 1000000, estimatedTime: 11.423862897

BufferedReader is the fastest, Files.readAllLines() is also acceptable, Scanner is slow due to regex, RandomAccessFile is inacceptable

Gadolinium answered 4/11, 2018 at 19:42 Comment(2)
Hey @YAMM, in your gist, the System.out ("... into ArrayList") is actually using a linkedList instead of arrayList. So means, buffer reading into ArrayList is the fastest.Wert
thanks! I fixed it! I would also suggest to (almost) always use ArrayList since overall performance is just better.Gadolinium
T
9

Scanner can't be as fast as BufferedReader, as it uses regular expressions for reading text files, which makes it slower compared to BufferedReader. By using BufferedReader you can read a block from a text file.

BufferedReader bf = new BufferedReader(new FileReader("FileName"));

you can next use readLine() to read from bf.

Hope it serves your purpose.

Tirzah answered 8/6, 2015 at 14:16 Comment(1)
I think you meant "Scanner can't be as fast as BufferedReader"Ables
W
5

you can use FileChannel and ByteBuffer from JAVA NIO. ByteBuffer size is the most critical part in reading data faster what i have observed. Below code will read the content of the file.

static public void main( String args[] ) throws Exception 
    {
        FileInputStream fileInputStream = new FileInputStream(
                                        new File("sample4.txt"));
        FileChannel fileChannel = fileInputStream.getChannel();
        ByteBuffer byteBuffer = ByteBuffer.allocate(1024);

        fileChannel.read(byteBuffer);
        byteBuffer.flip();
        int limit = byteBuffer.limit();
        while(limit>0)
        {
            System.out.print((char)byteBuffer.get());
            limit--;
        }

        fileChannel.close();
    }

You can check for '\n' for new line here. Thanks.


Even you can scatter and getter way to read files faster i.e.

fileChannel.get(buffers);

where

      ByteBuffer b1 = ByteBuffer.allocate(B1);
      ByteBuffer b2 = ByteBuffer.allocate(B2);
      ByteBuffer b3 = ByteBuffer.allocate(B3);

      ByteBuffer[] buffers = {b1, b2, b3};

This saves the user process to from making several system calls (which can be expensive) and allows kernel to optimize handling of the data because it has information about the total transfer, If multiple CPUs available it may even be possible to fill and drain several buffers simultaneously.

From this book.

Walloping answered 21/10, 2013 at 4:54 Comment(8)
A direct byte buffer is of no benefit if the data is being read into the Java side of the JVM. Its benefit comes if you're just copying the data between two channels without looking at it in the Java code.Justiciary
@EJP i know. I deleted here the line and your comment came. :-)Walloping
@Walloping , I would like to try using FileChannel could you provide me any example from my codes above?Vocalist
It can't parallel read from a single disk unless it has multiple heads. There is nothing here that actually reads lines at all, so it really isn't an answer to the question at all.Justiciary
Not only i am reading the file, but i am searching for the words I want, using delimiter . Does this method work? if (file_Array[0].equalsIgnoreCase("case")) { //do something }Vocalist
@user2822351 you can do this.Walloping
Your edited code doesn't convert byte to char correctly. The correct technique is to use a CharsetDecoder.Justiciary
@Walloping Why? It's your answer. You're the one who's recommending NIO, so you're the one who is expected to know how to use it. The CharsetDecoder hint should be enough if you do. Apparently you don't. My answer is to use BufferedReader.Justiciary
B
3

Use BufferedReader for high performance file access. But the default buffer size of 8192 bytes is often too small. For huge files you can increase the buffer size by orders of magnitudes to boost your file reading performance. For example:

BufferedReader br = new BufferedReader("file.dat", 1000 * 8192);
while ((thisLine = br.readLine()) != null) {
    System.out.println(thisLine);
}  
Brazier answered 22/6, 2017 at 14:46 Comment(1)
But it won't have much effect. 8192 is surprisingly adequate.Justiciary
S
3

just updating this thread, now we have java 8 to do this job:

List<String> lines = Files.readAllLines(Paths.get(file_path);
Selfdevotion answered 13/2, 2019 at 15:16 Comment(0)
M
1

I am wondering why no one mentioned the MappedByteBuffer. I believe it's the most efficient way to read large files until 2GB.

Almost all projects require us to work with files. However, what if the file is excessively large? When heap becomes filled, the JVM generates an OutOfMemoryError as an error. Java offers the MappedByteBuffer (JavaNIO) class, which facilitates the manipulation of sizable files.

The class MappedByteBuffer is responsible for establishing a virtual-memory mapping using JVM memory. The contents of the file are loaded into virtual memory rather than the heap, and the JVM can receive and write data in JVM memory without requiring OS-specific read/write system calls. Additionally, we can map a subset of a file rather than the entire file.

Obtaining FileChannel from MappedByteBuffer enables us to map a file. The FileChannel link enables file manipulation, writing, and reading. FileChannel is accessible via FileOutputStream (for writing) and RandomAccessFile, as well as FileInputStream (for reading only).

To map a file, FileChannel provides the map() method. It requires three arguments.

  1. Mode of the map (PRIVATE, READ_ONLY, and READ_WRITE)

  2. Placement

  3. Size

Once MappedByteBuffer is obtained, the get() and put() methods can be used to receive and write data, respectively.

The file is located in the /resource directory so we can load it using the following function:

Path getFileURIFromResources(String fileName) throws Exception {
    return Paths.get(fileNamePath);
}

This is how we read from MappedBuffer:

CharBuffer charBuffer = null;
Path pathToRead = getFileURIFromResources("fileToRead.txt");

try (FileChannel fileChannel (FileChannel) Files.newByteChannel(
  pathToRead, EnumSet.of(StandardOpenOption.READ))) {
 
    MappedByteBuffer mappedByteBuffer = fileChannel
      .map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());

    if (mappedByteBuffer != null) {
        charBuffer = Charset.forName("UTF-8").decode(mappedByteBuffer);
    }
}

This is how we write:

CharBuffer charBuffer = CharBuffer
  .wrap("This will be written to the file");
Path pathToWrite = getFileURIFromResources("fileToWriteTo.txt");

try (FileChannel fileChannel = (FileChannel) Files
  .newByteChannel(pathToWrite, EnumSet.of(
    StandardOpenOption.READ, 
    StandardOpenOption.WRITE, 
    StandardOpenOption.TRUNCATE_EXISTING))) {
    
    MappedByteBuffer mappedByteBuffer = fileChannel
      .map(FileChannel.MapMode.READ_WRITE, 0, charBuffer.length());
    
    if (mappedByteBuffer != null) {
        mappedByteBuffer.put(
          Charset.forName("utf-8").encode(charBuffer));
    }
} 
Mockery answered 6/1 at 20:44 Comment(0)
S
0

You must investigate which part of program is taking time.

As per answer of EJP, you should use BufferedReader.

If really string processing is taking time, then you should consider using threads, one thread will read from file and queues lines. Other string processor threads will dequeue lines and process them. You will need to investigate how many threads to use, the number of threads you should use in application has to be related with number of cores in CPU, in that way will use full CPU.

Stercoricolous answered 21/10, 2013 at 5:7 Comment(6)
If string processing is taking time, then multiple treads doing same thing will decrease time, right, like parallel processing.Stercoricolous
This will be usable only when processing of one line does not depend on processing of other line.Stercoricolous
If string processing is the bottleneck, putting it into a separate thread will only move the bottleneck. Not eliminate it.Justiciary
Bottleneck can be eliminated if processing is done in multiple threads parallely.Stercoricolous
Concurrency isn't always the solution. The actual problem was either the performance of Scanner, String.split() or equalsIgnoreCase (as it has to deep compare the strings).Eurus
No, the bottleneck can be distributed if you process in multiple threads. You can't eliminate processing by distributing it.Justiciary
F
0

You can read the file in chunks if there are millions of records. That will avoid potential memory issue. You need to keep last pointer to calculate offset of file.

try (FileReader reader = new FileReader(filePath);
                BufferedReader bufferedReader = new BufferedReader(reader);) {

            int pageOffset = lastOffset + counter;
            int skipRecords = (pageOffset - 1) * batchSize;

            bufferedReader.lines().skip(skipRecords).forEach(cline -> {
                try {
                    // PRINT
                    
                }
Freeland answered 25/5, 2022 at 16:37 Comment(0)
A
-2

If you wish to read all lines together then you should have a look at the Files API of java 7. Its really simple to use.

But a better approach would be to process this file in a batch. Have a reader which reads chunks of lines from the file and a writer which does the required processing or persists the data. Having abatch will ensure that it will work even if the lines increase to billion in future. Also you can have a batch which uses a multithreading to increase theoverall performance of the batch. I would recpmmend that you have a look at spring batch.

Autum answered 21/10, 2013 at 5:11 Comment(1)
How exactly will a 'batch' help when he is reading and processing a line at a time?Justiciary

© 2022 - 2024 — McMap. All rights reserved.