base64 decoded file is not equal to the original unencoded file
Asked Answered
S

1

2

I have a normal pdf file A.pdf , a third party encodes the file in base64 and sends it to me in a webservice as a long string (i have no control on the third party).

My problem is that when i decode the string with java org.apache.commons.codec.binary.Base64 and right the output to a file called B.pdf I expect B.pdf to be identical to A.pdf, but B.pdf turns out a little different then A.pdf. As a result B.pdf is not recognized as a valid pdf by acrobat.

Does base64 have different types of encoding\charset mechanisms? can i detect how the string I received is encoded so that B.pdf=A.pdf ?

EDIT- this is the file I want to decode, after decoding it should open as a pdf

my encoded file


this is the header of the files opened in notepad++

**A.pdf**
        %PDF-1.4
        %±²³´
        %Created by Wnv/EP PDF Tools v6.1
        1 0 obj
        <<
        /PageMode /UseNone
        /ViewerPreferences 2 0 R
        /Type /Catalog

  **B.pdf**
        %PDF-1.4
        %±²³´
        %Created by Wnv/EP PDF Tools v6.1
        1 0! bj
        <<
        /PageMode /UseNone
        /ViewerPreferences 2 0 R
        /]
        pe /Catalog

this is how I decode the string

private static void decodeStringToFile(String encodedInputStr,
            String outputFileName) throws IOException {
        BufferedReader in = null;
        BufferedOutputStream out = null;
        try {
            in = new BufferedReader(new StringReader(encodedInputStr));
        out = new BufferedOutputStream(new FileOutputStream(outputFileName));
            decodeStream(in, out);
            out.flush();
        } finally {
            if (in != null)
                in.close();
            if (out != null)
                out.close();
        }
    }

    private static void decodeStream(BufferedReader in, OutputStream out)
            throws IOException {
        while (true) {
            String s = in.readLine();
            if (s == null)
                break;
            //System.out.println(s);
            byte[] buf = Base64.decodeBase64(s);
            out.write(buf);
        }

    }
Straddle answered 24/1, 2012 at 17:13 Comment(4)
I've seen similar results in the past when using Strings. You might just try using the raw byte[]s instead and see if it makes a difference.Wootan
You need to show the block of code that's doing the base64 encoding as well.Transverse
I only geta string from the third party. should i convert the string to bytes with String.getBytes(charset)? how do I know what charset to use?Straddle
I dont have the encoding code, as I said, its from a third party that is not available to me ()its not even in java.Straddle
D
2
  1. You are breaking your decoding by working line-by-line. Base64 decoders simply ignore whitespace, which means that a byte in the original content could very well be broken into two Base64 text lines. You should concatenate all the lines together and decode the file in one go.

  2. Prefer using byte[] rather than String when supplying content to the Base64 class methods. String implies character set encoding, which may not do what you want.

Distorted answered 24/1, 2012 at 17:23 Comment(5)
concatenating all the lines to one string did not work, also using String.getBytes() did not work. actually these 2 approches gave a worse result then the original result (b was very different then a)Straddle
@dov.amir: Perhaps you should post a sample of the Base64 data sent by the server, so that we could see what is going on. Regardless of that, decoding Base64 content line-by-line is still broken in the general case, unless the lines are split in multiples of 4 characters.Distorted
@dov.amir: 1. You definitely have an issue with the newlines, since there are exactly 74 base64 characters in each line. This is probably the reason for the mangling you see in the PDF header. 2. Are you certain about the base64 stream? The file that you uploaded contains exactly 118,602 characters, which is not a multiple of 4 as it should. If your link is really supposed to contain an entire file, then the problem seems to be somewhere before the Base64 decoding. What is the size of the source PDF file?Distorted
solved it! my decoding worked after I managed to get the 3rd party to encode the text in lines of length 72 instead of lines of length 74.Straddle
@dov.amir: That would make the line length a multiple of 4 which would work around the issue in your code. IMO, though, it would be better to fix your own code: Be liberal in what you accept, and conservative in what you send. After all, 74 is a rather typical line length for Base64 encoders...Distorted

© 2022 - 2024 — McMap. All rights reserved.