I have a Lambda using the AWS Java SDK that is triggered via SQS. The SQS message contains at least one S3 bucket/key combination. The lambda reads the specified file which is full of tomcat log events and sorts them by session ID, then writes each session's events to another, common S3 bucket.
Reads from the source buckets never hit any service limits so there is no problem there.
Writes to the single destination bucket result in thousands of 503 Slow Down exceptions. To get around this I am writing to >= 20k VERY random partitions in the destination bucket, but I still get these errors.
My most recent test showed that during any particular second the lambdas were only writing about 1,800 files TOTAL to the bucket, and those writes were spread evenly over many partitions.
Since AWS documentation states that 3,500 simultaneous write operations are possible per partition, per second, there should be no reason to get a 503 Slow Down exception. Because these writes are happening from up to 1,000 concurrently running lambdas I have no good way to implement any kind of exponential backoff strategy, but I really shouldn't need to. It doesn't make sense that S3 would be issuing these exceptions and the application really relies on maximum concurrent processing.
So, long story short, 200+ concurrent lambdas writing ~1,800 small files per second over 200+ distinct AWS S3 Clients and ~30,000 partitions is throwing thousands of S3 Slow Down exceptions. How can I get rid of them?
I have tried several different partition counts, from 5,000 up to 32,000 to try and distribute the writes. No change.
Partition examples: 80e49ec0903f1e8fa1a8/ a5468f4a8a184cd13696/ cf4d516ca85abafb7b26/
protected PutObjectResult saveFile(AmazonS3 s3, String bucket, String partition, List<Packet> packets) {
//noinspection deprecation
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
String fileName = Utils.getIntermediateName(partition);
byte[] content = packets.stream().map(Packet::getMessage).map(Utils::preparePacket)
.filter(Objects::nonNull).collect(ByteArrayOutputStream::new, (baos, bytes) -> {
try {
baos.write(bytes);
} catch (IOException e1) {
log("Could not write to a stream: " + e1.toString());
}
}, (baos1, baos2) -> baos1.write(baos2.toByteArray(), 0, baos2.size())).toByteArray();
PutObjectResult result = null;
try (InputStream stream = new ByteArrayInputStream(content)) {
ObjectMetadata metadata = new ObjectMetadata();
metadata.setContentLength(content.length);
PutObjectRequest request = new PutObjectRequest(bucket, fileName, stream, metadata);
request.setCannedAcl(CannedAccessControlList.BucketOwnerFullControl);
result = s3.putObject(request);
} catch (Exception e) {
log("In saveFile: " + e.toString());
}
return result;
}
Expected results are that I get no 503 Slow Down exceptions as I am way, way under the documented limit of 3,500 writes per partition.
Actual results are that I get thousands of 503 Slow Down exceptions.