Fastest way to write huge data in text file Java
Asked Answered
F

7

74

I have to write huge data in text[csv] file. I used BufferedWriter to write the data and it took around 40 secs to write 174 mb of data. Is this the fastest speed java can offer?

bufferedWriter = new BufferedWriter ( new FileWriter ( "fileName.csv" ) );

Note: These 40 secs include the time of iterating and fetching the records from resultset as well. :) . 174 mb is for 400000 rows in resultset.

Featherston answered 30/6, 2009 at 6:57 Comment(1)
You wouldn't happen to have anti-virus active on the machine where you run this code?Genethlialogy
T
110

You might try removing the BufferedWriter and just using the FileWriter directly. On a modern system there's a good chance you're just writing to the drive's cache memory anyway.

It takes me in the range of 4-5 seconds to write 175MB (4 million strings) -- this is on a dual-core 2.4GHz Dell running Windows XP with an 80GB, 7200-RPM Hitachi disk.

Can you isolate how much of the time is record retrieval and how much is file writing?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

public class FileWritingPerfTest {
    

private static final int ITERATIONS = 5;
private static final double MEG = (Math.pow(1024, 2));
private static final int RECORD_COUNT = 4000000;
private static final String RECORD = "Help I am trapped in a fortune cookie factory\n";
private static final int RECSIZE = RECORD.getBytes().length;

public static void main(String[] args) throws Exception {
    List<String> records = new ArrayList<String>(RECORD_COUNT);
    int size = 0;
    for (int i = 0; i < RECORD_COUNT; i++) {
        records.add(RECORD);
        size += RECSIZE;
    }
    System.out.println(records.size() + " 'records'");
    System.out.println(size / MEG + " MB");
    
    for (int i = 0; i < ITERATIONS; i++) {
        System.out.println("\nIteration " + i);
        
        writeRaw(records);
        writeBuffered(records, 8192);
        writeBuffered(records, (int) MEG);
        writeBuffered(records, 4 * (int) MEG);
    }
}

private static void writeRaw(List<String> records) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        System.out.print("Writing raw... ");
        write(records, writer);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void writeBuffered(List<String> records, int bufSize) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        BufferedWriter bufferedWriter = new BufferedWriter(writer, bufSize);
    
        System.out.print("Writing buffered (buffer size: " + bufSize + ")... ");
        write(records, bufferedWriter);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void write(List<String> records, Writer writer) throws IOException {
    long start = System.currentTimeMillis();
    for (String record: records) {
        writer.write(record);
    }
    // writer.flush(); // close() should take care of this
    writer.close(); 
    long end = System.currentTimeMillis();
    System.out.println((end - start) / 1000f + " seconds");
}
}
Testator answered 30/6, 2009 at 8:34 Comment(13)
@rozario each write call should only produce about 175MB and then delete itself. if not, you'll end up with 175MB x 4 different write calls x 5 iterations = 3.5GB of data. you might check the return value from file.delete() and if it's false, throw an exception.Testator
Notice that writer.flush() is not necessary in this case because writer.close() flushes memory implicity. BTW: best practices recommend using try resource close instead explicitly calling close().Feckless
FWIW, this was written for Java 5, which at least wasn't documented to flush on close, and which didn't have try-with-resources. It could probably use updating.Testator
Work perfectly for me!Maloney
I also have the same issue, but in my case, i have to create zip file having 100csv, each csv is of size 7-8mb. What is the possible class which can be used for fast downloading/creating the csv.Cerebral
@InduKaur Here's a good primer on writing CSV in Java, it's not that hard to do by hand. baeldung.com/java-csv If you want a library, I haven't tried this one myself but it claims to be very fast: github.com/osiegmar/FastCSVTestator
@David Moles I tried using the github code suggestion, still it takes 1sec for writing 6mb csv file. Is it the correct time taken or should it be less?Cerebral
@InduKaur It's really hard to say without knowing exactly what your code is doing and what the data looks like. I suggest posting a separate question with a minimal, reproducible example.Testator
Ok, i will do that.Cerebral
@DavidMoles i have updated my question here. If u can check and help me #62351039Cerebral
I just looked up the Java 1.1 documentation of Writer.flush() and it says “Close the stream, flushing it first.”. So calling flush() before close() was never needed. By the way, one of the reasons why BufferedWriter might be useless, is that FileWriter, a specialization of the OutputStreamWriter, has to have its own buffering anyway, when it does the conversion from char sequences to byte sequences in the target encoding. Having more buffers at the front-end doesn’t help when the charset encoder has to flush its smaller byte buffer at a higher rate anyway.Dominquedominquez
@Dominquedominquez You're right, close() is documented to flush, IDK how 2009-me missed that. Re: buffers, you might be right, but the OutputStreamWriter docs do recommend wrapping it in a BufferedWriter "for top efficiency". (And seem to have done so since 1.1.)Testator
Indeed, but the actual implications of additional buffering and how to decide whether to use it or not, has never been addressed well in the documentation or tutorials (as far as I know). Note that the NIO API does not even have a Buffered… counterpart for the channel types at all.Dominquedominquez
T
43

try memory mapped files (takes 300 m/s to write 174MB in my m/c, core 2 duo, 2.5GB RAM) :

byte[] buffer = "Help I am trapped in a fortune cookie factory\n".getBytes();
int number_of_lines = 400000;

FileChannel rwChannel = new RandomAccessFile("textfile.txt", "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, buffer.length * number_of_lines);
for (int i = 0; i < number_of_lines; i++)
{
    wrBuf.put(buffer);
}
rwChannel.close();
Theran answered 17/2, 2011 at 6:40 Comment(5)
what is aMessage.length() meant to represent be when you are instantiating the ByteBuffer?Ogden
Jut fyi, running this on MacBook Pro (late 2013), 2.6 Ghz Core i7, with Apple 1tb SSD takes about 140ms for a 185 meg (lines = 4million)Orgasm
@JerylCook Memory mapped is useful when you know the exact size. Here, we are reserving a buffer*numberoffiles space beforehand.Theran
Thanks you! Can I use it for over 2GB file? MappedByteBuffer map(MapMode var1, long var2, long var4): if (var4 > 2147483647L) { throw new IllegalArgumentException("Size exceeds Integer.MAX_VALUE")Smtih
What a magic method, 105ms on Dell core i5(1.6,2.3)GhzDisinfect
S
20

Only for the sake of statistics:

The machine is old Dell with new SSD

CPU: Intel Pentium D 2,8 Ghz

SSD: Patriot Inferno 120GB SSD

4000000 'records'
175.47607421875 MB

Iteration 0
Writing raw... 3.547 seconds
Writing buffered (buffer size: 8192)... 2.625 seconds
Writing buffered (buffer size: 1048576)... 2.203 seconds
Writing buffered (buffer size: 4194304)... 2.312 seconds

Iteration 1
Writing raw... 2.922 seconds
Writing buffered (buffer size: 8192)... 2.406 seconds
Writing buffered (buffer size: 1048576)... 2.015 seconds
Writing buffered (buffer size: 4194304)... 2.282 seconds

Iteration 2
Writing raw... 2.828 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.078 seconds
Writing buffered (buffer size: 4194304)... 2.015 seconds

Iteration 3
Writing raw... 3.187 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.094 seconds
Writing buffered (buffer size: 4194304)... 2.031 seconds

Iteration 4
Writing raw... 3.093 seconds
Writing buffered (buffer size: 8192)... 2.141 seconds
Writing buffered (buffer size: 1048576)... 2.063 seconds
Writing buffered (buffer size: 4194304)... 2.016 seconds

As we can see the raw method is slower the buffered.

Sorcery answered 23/5, 2011 at 16:12 Comment(1)
However, the buffered method become slower whenever the size of the text is bigger.Disinfect
M
5

Your transfer speed is likely not to be limited by Java. Instead I would suspect (in no particular order)

  1. the speed of transfer from the database
  2. the speed of transfer to the disk

If you read the complete dataset and then write it out to disk, then that will take longer, since the JVM will have to allocate memory, and the db rea/disk write will happen sequentially. Instead I would write out to the buffered writer for every read that you make from the db, and so the operation will be closer to a concurrent one (I don't know if you're doing that or not)

Menke answered 30/6, 2009 at 8:27 Comment(0)
K
4

For these bulky reads from DB you may want to tune your Statement's fetch size. It might save a lot of roundtrips to DB.

http://download.oracle.com/javase/1.5.0/docs/api/java/sql/Statement.html#setFetchSize%28int%29

Kaddish answered 30/8, 2010 at 12:55 Comment(0)
W
3

package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;

/**
 * @author Naresh Bhabat
 * 
Following  implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.


Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.



It uses random access file,which is almost like streaming API.


 * ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);

 *  	   for 10 threads:Total time required for reading and writing the text in
 *         :seconds 349.317
 * 
 *         For 100:Total time required for reading the text and writing   : seconds 464.042
 * 
 *         For 1000 : Total time required for reading and writing text :466.538 
 *         For 10000  Total time required for reading and writing in seconds 479.701
 *
 * 
 */
public class DealWithHugeRecordsinFile extends TestCase {

	static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
	static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
	static volatile RandomAccessFile fileToWrite;
	static volatile RandomAccessFile file;
	static volatile String fileContentsIter;
	static volatile int position = 0;

	public static void main(String[] args) throws IOException, InterruptedException {
		long currentTimeMillis = System.currentTimeMillis();

		try {
			fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles 
			file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles 
			seriouslyReadProcessAndWriteAsynch();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		Thread currentThread = Thread.currentThread();
		System.out.println(currentThread.getName());
		long currentTimeMillis2 = System.currentTimeMillis();
		double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
		System.out.println("Total time required for reading the text in seconds " + time_seconds);

	}

	/**
	 * @throws IOException
	 * Something  asynchronously serious
	 */
	public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
		ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
		while (true) {
			String readLine = file.readLine();
			if (readLine == null) {
				break;
			}
			Runnable genuineWorker = new Runnable() {
				@Override
				public void run() {
					// do hard processing here in this thread,i have consumed
					// some time and eat some exception in write method.
					writeToFile(FILEPATH_WRITE, readLine);
					// System.out.println(" :" +
					// Thread.currentThread().getName());

				}
			};
			executor.execute(genuineWorker);
		}
		executor.shutdown();
		while (!executor.isTerminated()) {
		}
		System.out.println("Finished all threads");
		file.close();
		fileToWrite.close();
	}

	/**
	 * @param filePath
	 * @param data
	 * @param position
	 */
	private static void writeToFile(String filePath, String data) {
		try {
			// fileToWrite.seek(position);
			data = "\n" + data;
			if (!data.contains("Randomization")) {
				return;
			}
			System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);
			System.out.println("Lets consume through this loop");
			int i=1000;
			while(i>0){
			
				i--;
			}
			fileToWrite.write(data.getBytes());
			throw new Exception();
		} catch (Exception exception) {
			System.out.println("exception was thrown but still we are able to proceeed further"
					+ " \n This can be used for marking failure of the records");
			//exception.printStackTrace();

		}

	}
}
Witmer answered 9/10, 2016 at 5:55 Comment(4)
Please add some text explaining why this answer is better than other answers. Having comments in the code is not sufficient.Pollute
The reason this could be better: it is a real time scenario and it is in a working state example. Other benefits of it,it does process reading ,processing and writing asynchronously...It uses efficient java api (i.e) Random Access file which is thread safe and multiple thread can read and write on it simultaneously. It does not cause memory overhead at runtime,it also does not crash the system...it is multipurpose solution to deal with failure of record processing that can be tracked in respective thread. Please let me know if i can help more.Witmer
Thank you, that is the information that your post needed. Perhaps consider adding it to the post body :)Pollute
If with 10 threads it takes 349.317 seconds to write 2GB data then It may qualify for the slowest way to write huge data (unless you mean millseconds)Theran
P
0

For those who want to improve the time for retrieval of records and dump into the file (i.e no processing on records), instead of putting them into an ArrayList, append those records into a StringBuffer. Apply toSring() function to get a single String and write it into the file at once.

For me, the retrieval time reduced from 22 seconds to 17 seconds.

Pattani answered 19/5, 2020 at 7:12 Comment(3)
That was just an example to create some fake "records" — I would assume that in the real world the records are coming from somewhere else (a database in the OP's case). But yes, if you need to read all the content into memory first, a StringBuffer would probably be faster. A raw String array (String[]) would also probably be faster.Testator
Using StringBuffer will waste lots of resources. Most standard java writers use StreamEncoder internally and it has its own buffer of 8192 bytes. Even if you create one String of all the data it is going in as chunks and being encoded from chars to byte[]. The best solution would be implement own Writer which directly uses write(byte[]) method of FileOutputStream which used underlying native writeBytes method .Empiricism
like @DavidMoles said source format of data is also very important in this scenario. If data is already available in bytes write directly to FileOutputSteam.Empiricism

© 2022 - 2024 — McMap. All rights reserved.