How efficient is the encoding/decoding algorithm of BASE64 class in Java?
Asked Answered
P

2

5

I am about to use an algorithm to encode a variable length but very long String field retrieved from an XML file, then that encoded data should be persisted in the database.

Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate.

I tried org.apache.commons.codec.binary.Base64 class it has 2 methods:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

which works perfectly fine and solves my problem. But it converts 55 char string to just 6 char String.

So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

I donot know about the Base64 class much but if anyone can help me out it will be really helpful.

If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it.

Thanks in advance.

Preventive answered 15/6, 2011 at 9:34 Comment(6)
Any correct implementation of base64 will make a String larger not smaller. Are you trying to compact the String?Weinberg
yes actually I need some algo to compress a long String into smaller one and store in db and later inflate it to get the original String. I saw the ZipOutputStream class implementation over the internet, but I didn't try it out.Preventive
I think you're looking for a hash function such as MD5 (which converts all input into a 128-byte output). Base64 encoding will generally result in output which is four-thirds the size of the input - so it doesn't result in fixed-length output at all.Lhary
Okay I see. So what algorithm do u suggest as a solution to my problem ? Can you send me a link to any examplePreventive
Are you simply trying to determine whether an input string has already been stored in the database? If so, selecting an appropriate hash function and storing the hash value may be sufficient. Different hash functions have different propeties (performance, likelihood of collision, etc) and you would need to dos some research to select one that is appropriate to your needs.Gravois
I've just noticed that you say the Base64 class is encoding a 55-char input into a 6-char output. If that class really is doing base64 encoding, that implies its encoded a 4-char input (without == padding on the end). It might help if you posted a short sample of how you're using the class, because I suspect it might not be doing what you think it's doing (have you tried decoding the 6-char output as well?).Lhary
G
13

Not very efficient.

Also, using sun.misc classes gives a non-portable application.

Check out the following performance comparisons from MiGBase64:

enter image description here


So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

Solution 1

Use lossless compression

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.

Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger

Solution 2

Use a hashing algorithm

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions

Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string

Gravois answered 15/6, 2011 at 10:1 Comment(3)
@Gravois can you tell me a way to effectively compress and decompress Strings ?Preventive
@Gravois I am happy with the answer ....accepted, though I used java.util.zip.Deflater and Inflater classes in order to copress/decompress the string effectively. Now comes another issue, the output string generated after compressing contains characters which the eclipse console cant display even if I tried to compress it in UTF8 format, i have to check out whether my database will support the compressed string output or not. Any way thanks a lot for the answer.Preventive
@Subhadip Compression with java.util.zip.Deflater will produce an array of bytes - not a string. You can store this in a SQL BLOB or BINARY column.Gravois
L
1

I just saw your comment - it seems you're actually looking for compression rather than hashing as I initially thought. Though in that case, you won't be able to get fixed length output for arbitrary input (think about it, an infinite number of inputs cannot map bijectively to a finite number of outputs), so I hope that wasn't a strong requirement.

In any case, the performance of your chosen compression algorithm will depend on the characteristics of the input text. In the absence of further information, DEFLATE compression (as used by the Zip input streams, IIRC) is a good general-purpose algorithm to start with, and at least use as a basis for comparison. For ease of implementation, though, you can use the Deflator class built into the JDK, which uses ZLib compression.

If your input strings have particular patterns, then different compression algorithms may be more or less efficient. In one respect it doesn't matter which one you use, if you don't intend the compressed data to be read by any other processes - so long as you can compress and decompress yourself, it'll be transparent to your clients.

These other questions may be of interest:

Lhary answered 15/6, 2011 at 10:2 Comment(3)
Firstly thanks for your suggestion, I understand what you said .Base64 here solves the purpose for me, but the thing that I am concerned about is can I rely that the Base64 algo will be capable enough to encode and decode effectively for all larger strings. Will there be any case in which the Base64 algorithm generates same output for large strings which are different only by one char or so ?Preventive
Does base64 solve your purpose? I thought you wanted to make the strings smaller - base64 will result in larger output. And it also doesn't place any cap on the output size - if input is 3000 chars, the encoded output will be 4000 chars - so this may fail your requirements for "encoding efficiently for all larger strings". In answer to your last question though, there won't be any collisions; base64 is fully bidirectional.Lhary
yes I am sorry that I initially thought that Base64 class compresses the output. But later found out that the java.util.zip.Deflater and java.util.zip.Inflater class is what I need but again got stuck in the output of the compressed String it's not in Unicode format. I tried refatoring it to UTF8 but the string literal displayed on my eclipse console is not in UTF8 format though. I will have to see if that data can be persisted onto my Oracle DB. Anyway thanks for your help :) cheers.Preventive

© 2022 - 2024 — McMap. All rights reserved.