First, as @xtreme-biker commented, the performance greatly depend on your hardware. Specifically, my first advise would be checking whether you are running on a virtual machine or a native host. In my case with a CentOS VM on an i7 with an SDD drive I can read 123,000 docs per second but exactly the same code running on the Windows Host on the same drive reads up to 387,000 docs per second.
Next, let´s assume that you really need to read the full collection. This is to say that you must perform a full-scan. And let´s assume that you cannot change the configuration of your MongoDB server but only optimize your code.
Then everything comes down to what
collection.find().forEach((Block<Document>) document -> count.increment());
actually does.
A quick unrolling of MongoCollection.find() shows that it actually does this:
ReadPreference readPref = ReadPreference.primary();
ReadConcern concern = ReadConcern.DEFAULT;
MongoNamespace ns = new MongoNamespace(databaseName,collectionName);
Decoder<Document> codec = new DocumentCodec();
FindOperation<Document> fop = new FindOperation<Document>(ns,codec);
ReadWriteBinding readBinding = new ClusterBinding(getCluster(), readPref, concern);
QueryBatchCursor<Document> cursor = (QueryBatchCursor<Document>) fop.execute(readBinding);
AtomicInteger count = new AtomicInteger(0);
try (MongoBatchCursorAdapter<Document> cursorAdapter = new MongoBatchCursorAdapter<Document>(cursor)) {
while (cursorAdapter.hasNext()) {
Document doc = cursorAdapter.next();
count.incrementAndGet();
}
}
Here the FindOperation.execute()
is rather fast (under 10ms) and most of the time is spent inside the while loop, and specifically inside the private method QueryBatchCursor.getMore()
getMore()
calls DefaultServerConnection.command()
and it´s time is consumed basically in two operations: 1) fetching string data from the server and 2) converting string data into BsonDocument.
It turns out that Mongo is quite smart with regard of howw many network round trips it will do to fetch a large result set. It will first fetch 100 results with a firstBatch command and then fetch larger batches with nextBatch being the batch size depending on the collection size up to a limit.
So, under the wood something like this will happen to fetch the first batch.
ReadPreference readPref = ReadPreference.primary();
ReadConcern concern = ReadConcern.DEFAULT;
MongoNamespace ns = new MongoNamespace(databaseName,collectionName);
FieldNameValidator noOpValidator = new NoOpFieldNameValidator();
DocumentCodec payloadDecoder = new DocumentCodec();
Constructor<CodecProvider> providerConstructor = (Constructor<CodecProvider>) Class.forName("com.mongodb.operation.CommandResultCodecProvider").getDeclaredConstructor(Decoder.class, List.class);
providerConstructor.setAccessible(true);
CodecProvider firstBatchProvider = providerConstructor.newInstance(payloadDecoder, Collections.singletonList("firstBatch"));
CodecProvider nextBatchProvider = providerConstructor.newInstance(payloadDecoder, Collections.singletonList("nextBatch"));
Codec<BsonDocument> firstBatchCodec = fromProviders(Collections.singletonList(firstBatchProvider)).get(BsonDocument.class);
Codec<BsonDocument> nextBatchCodec = fromProviders(Collections.singletonList(nextBatchProvider)).get(BsonDocument.class);
ReadWriteBinding readBinding = new ClusterBinding(getCluster(), readPref, concern);
BsonDocument find = new BsonDocument("find", new BsonString(collectionName));
Connection conn = readBinding.getReadConnectionSource().getConnection();
BsonDocument results = conn.command(databaseName,find,noOpValidator,readPref,firstBatchCodec,readBinding.getReadConnectionSource().getSessionContext(), true, null, null);
BsonDocument cursor = results.getDocument("cursor");
long cursorId = cursor.getInt64("id").longValue();
BsonArray firstBatch = cursor.getArray("firstBatch");
Then the cursorId
is used to fetch each next batch.
In my opinion, the "problem" with the implementation of the driver is that the String to JSON decoder is injected but the JsonReader —in which the decode() method relies— is not. This is this way even down to com.mongodb.internal.connection.InternalStreamConnection
where you are already near the socket communication.
Therefore, I think that there is hardly anything that you could do to improve on MongoCollection.find()
unless you go as deep as InternalStreamConnection.sendAndReceiveAsync()
You can´t reduce the number of round trips and you can´t change the way the response is converted into BsonDocument. Not without bypassing the driver and writing your own client which I doubt is a good idea.
P.D. If you want to try some of the code above, you´ll need the getCluster() method which requires a dirty hack into mongo-java-driver.
private Cluster getCluster() {
Field cluster, delegate;
Cluster mongoCluster = null;
try {
delegate = mongoClient.getClass().getDeclaredField("delegate");
delegate.setAccessible(true);
Object clientDelegate = delegate.get(mongoClient);
cluster = clientDelegate.getClass().getDeclaredField("cluster");
cluster.setAccessible(true);
mongoCluster = (Cluster) cluster.get(clientDelegate);
} catch (NoSuchFieldException | SecurityException | IllegalArgumentException | IllegalAccessException e) {
System.err.println(e.getClass().getName()+" "+e.getMessage());
}
return mongoCluster;
}
that is really not impressive at all
.. Well, that depends also on your hardware. Please, provide more info about that. Which hardware do you use? OS? 800 million rows in 16 secs would be brilliant for a Raspberry Pi, I think! – Anasarcaatop
output? Which resource is depleted? – Cheeseburger