The code in this answer is an attempt to optimize documents like the OP's example document, i.e. documents containing copies of exactly identical objects, in the case at hand completely identical, fully embedded fonts. It does not merge merely nearly identical objects, e.g. multiple subsets of the same font into one single union subset.
In the course of comments to the questions it became clear that the duplicate fonts in the OP's PDF indeed were identical full copies of a source font file. To merge such duplicate objects, one has to collect the complex objects (arrays, dictionaries, streams) of a document, compare them with each other, and then merge duplicates.
As actual pairwise comparison of all complex objects of a document can take too much time in case of large documents, the following code calculates a hash of these objects and only compares objects with identical hash.
To merge duplicates, the code selects one of the duplicates and replaces all references to any of the other duplicates with a reference to the chosen one, removing the other duplicates from the document object pool. To do this more effectively, the code initially not only collects all complex objects but also all references to each of them.
The optimization code
This is the method to call to optimize a PDDocument
:
public void optimize(PDDocument pdDocument) throws IOException {
Map<COSBase, Collection<Reference>> complexObjects = findComplexObjects(pdDocument);
for (int pass = 0; ; pass++) {
int merges = mergeDuplicates(complexObjects);
if (merges <= 0) {
System.out.printf("Pass %d - No merged objects\n\n", pass);
break;
}
System.out.printf("Pass %d - Merged objects: %d\n\n", pass, merges);
}
}
(OptimizeAfterMerge method under test)
The optimization takes multiple passes as the equality of some objects can only be recognized after duplicates they reference have been merged.
The following helper methods and classes collect the complex objects of a PDF and the references to each of them:
Map<COSBase, Collection<Reference>> findComplexObjects(PDDocument pdDocument) {
COSDictionary catalogDictionary = pdDocument.getDocumentCatalog().getCOSObject();
Map<COSBase, Collection<Reference>> incomingReferences = new HashMap<>();
incomingReferences.put(catalogDictionary, new ArrayList<>());
Set<COSBase> lastPass = Collections.<COSBase>singleton(catalogDictionary);
Set<COSBase> thisPass = new HashSet<>();
while(!lastPass.isEmpty()) {
for (COSBase object : lastPass) {
if (object instanceof COSArray) {
COSArray array = (COSArray) object;
for (int i = 0; i < array.size(); i++) {
addTarget(new ArrayReference(array, i), incomingReferences, thisPass);
}
} else if (object instanceof COSDictionary) {
COSDictionary dictionary = (COSDictionary) object;
for (COSName key : dictionary.keySet()) {
addTarget(new DictionaryReference(dictionary, key), incomingReferences, thisPass);
}
}
}
lastPass = thisPass;
thisPass = new HashSet<>();
}
return incomingReferences;
}
void addTarget(Reference reference, Map<COSBase, Collection<Reference>> incomingReferences, Set<COSBase> thisPass) {
COSBase object = reference.getTo();
if (object instanceof COSArray || object instanceof COSDictionary) {
Collection<Reference> incoming = incomingReferences.get(object);
if (incoming == null) {
incoming = new ArrayList<>();
incomingReferences.put(object, incoming);
thisPass.add(object);
}
incoming.add(reference);
}
}
(OptimizeAfterMerge helper methods findComplexObjects
and addTarget
)
interface Reference {
public COSBase getFrom();
public COSBase getTo();
public void setTo(COSBase to);
}
static class ArrayReference implements Reference {
public ArrayReference(COSArray array, int index) {
this.from = array;
this.index = index;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.get(index));
}
@Override
public void setTo(COSBase to) {
from.set(index, to);
}
final COSArray from;
final int index;
}
static class DictionaryReference implements Reference {
public DictionaryReference(COSDictionary dictionary, COSName key) {
this.from = dictionary;
this.key = key;
}
@Override
public COSBase getFrom() {
return from;
}
@Override
public COSBase getTo() {
return resolve(from.getDictionaryObject(key));
}
@Override
public void setTo(COSBase to) {
from.setItem(key, to);
}
final COSDictionary from;
final COSName key;
}
(OptimizeAfterMerge helper interface Reference
with implementations ArrayReference
and DictionaryReference
)
And the following helper methods and classes finally identify and merge duplicates:
int mergeDuplicates(Map<COSBase, Collection<Reference>> complexObjects) throws IOException {
List<HashOfCOSBase> hashes = new ArrayList<>(complexObjects.size());
for (COSBase object : complexObjects.keySet()) {
hashes.add(new HashOfCOSBase(object));
}
Collections.sort(hashes);
int removedDuplicates = 0;
if (!hashes.isEmpty()) {
int runStart = 0;
int runHash = hashes.get(0).hash;
for (int i = 1; i < hashes.size(); i++) {
int hash = hashes.get(i).hash;
if (hash != runHash) {
int runSize = i - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, i));
}
runHash = hash;
runStart = i;
}
}
int runSize = hashes.size() - runStart;
if (runSize != 1) {
System.out.printf("Equal hash %d for %d elements.\n", runHash, runSize);
removedDuplicates += mergeRun(complexObjects, hashes.subList(runStart, hashes.size()));
}
}
return removedDuplicates;
}
int mergeRun(Map<COSBase, Collection<Reference>> complexObjects, List<HashOfCOSBase> run) {
int removedDuplicates = 0;
List<List<COSBase>> duplicateSets = new ArrayList<>();
for (HashOfCOSBase entry : run) {
COSBase element = entry.object;
for (List<COSBase> duplicateSet : duplicateSets) {
if (equals(element, duplicateSet.get(0))) {
duplicateSet.add(element);
element = null;
break;
}
}
if (element != null) {
List<COSBase> duplicateSet = new ArrayList<>();
duplicateSet.add(element);
duplicateSets.add(duplicateSet);
}
}
System.out.printf("Identified %d set(s) of identical objects in run.\n", duplicateSets.size());
for (List<COSBase> duplicateSet : duplicateSets) {
if (duplicateSet.size() > 1) {
COSBase surviver = duplicateSet.remove(0);
Collection<Reference> surviverReferences = complexObjects.get(surviver);
for (COSBase object : duplicateSet) {
Collection<Reference> references = complexObjects.get(object);
for (Reference reference : references) {
reference.setTo(surviver);
surviverReferences.add(reference);
}
complexObjects.remove(object);
removedDuplicates++;
}
surviver.setDirect(false);
}
}
return removedDuplicates;
}
boolean equals(COSBase a, COSBase b) {
if (a instanceof COSArray) {
if (b instanceof COSArray) {
COSArray aArray = (COSArray) a;
COSArray bArray = (COSArray) b;
if (aArray.size() == bArray.size()) {
for (int i=0; i < aArray.size(); i++) {
if (!resolve(aArray.get(i)).equals(resolve(bArray.get(i))))
return false;
}
return true;
}
}
} else if (a instanceof COSDictionary) {
if (b instanceof COSDictionary) {
COSDictionary aDict = (COSDictionary) a;
COSDictionary bDict = (COSDictionary) b;
Set<COSName> keys = aDict.keySet();
if (keys.equals(bDict.keySet())) {
for (COSName key : keys) {
if (!resolve(aDict.getItem(key)).equals(bDict.getItem(key)))
return false;
}
// In case of COSStreams we strictly speaking should
// also compare the stream contents here. But apparently
// their hashes coincide well enough for the original
// hashing equality, so let's just assume...
return true;
}
}
}
return false;
}
static COSBase resolve(COSBase object) {
while (object instanceof COSObject)
object = ((COSObject)object).getObject();
return object;
}
(OptimizeAfterMerge helper methods mergeDuplicates
, mergeRun
, equals
, and resolve
)
static class HashOfCOSBase implements Comparable<HashOfCOSBase> {
public HashOfCOSBase(COSBase object) throws IOException {
this.object = object;
this.hash = calculateHash(object);
}
int calculateHash(COSBase object) throws IOException {
if (object instanceof COSArray) {
int result = 1;
for (COSBase member : (COSArray)object)
result = 31 * result + member.hashCode();
return result;
} else if (object instanceof COSDictionary) {
int result = 3;
for (Map.Entry<COSName, COSBase> entry : ((COSDictionary)object).entrySet())
result += entry.hashCode();
if (object instanceof COSStream) {
try ( InputStream data = ((COSStream)object).createRawInputStream() ) {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] buffer = new byte[8192];
int bytesRead = 0;
while((bytesRead = data.read(buffer)) >= 0)
md.update(buffer, 0, bytesRead);
result = 31 * result + Arrays.hashCode(md.digest());
} catch (NoSuchAlgorithmException e) {
throw new IOException(e);
}
}
return result;
} else {
throw new IllegalArgumentException(String.format("Unknown complex COSBase type %s", object.getClass().getName()));
}
}
final COSBase object;
final int hash;
@Override
public int compareTo(HashOfCOSBase o) {
int result = Integer.compare(hash, o.hash);
if (result == 0)
result = Integer.compare(hashCode(), o.hashCode());
return result;
}
}
(OptimizeAfterMerge helper class HashOfCOSBase
)
Applying the code to the OP's example document
The OP's example document is about 6.5 MB in size. Applying the above code like this
PDDocument pdDocument = PDDocument.load(SOURCE);
optimize(pdDocument);
pdDocument.save(RESULT);
results in a PDF less than 700 KB in size, and it appears to be complete.
(If something's missing, please tell, I'll try and fix that.)
Words of warning
On one hand this optimizer will not recognize all identical duplicates. In particular in case of circular references duplicate circles of objects won't be recognized because the code only recognizes duplicates if their contents are identical which usually does not happen in duplicate object circles.
On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.
Furthermore, this program touches all kinds of objects in the file, even those defining the inner structures of the PDF, but it does not attempt to update any PDFBox classes managing this structure (PDDocument
, PDDocumentCatalog
, PDAcroForm
, ...). To not have any pending changes screw up the whole document, therefore, please only apply this program to freshly loaded, unmodified PDDocument
instances and save it soon after without further ado.