Java: Universal BaseN encoder/decoder working with large data sizes
Asked Answered
P

3

5

I'm looking for a decent BaseN encoder (with custom charset) in Java, that is not limited by input data size (array of bytes).

Something like this:

https://github.com/mklemm/base-n-codec-java

But for "unlimited" data length without any unnecessary memory/performance penalty and "BigInteger abuse magic". Simply something that works as standard BASE64 encoders, but universally for any base/charset. Any solution, or idea how to achieve that is welcomed.

Maybe, if someone has experiences with apache BaseNCodec:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/BaseNCodec.html

It looked promising, however it's an Abstract class, and available implemetations look harder to make, than start from scratch.


I need it for a binary data to custom character set encoder (where the number of characters in the set is mutable, "ABCDE" = Base5, "ABCDE-+*/." = Base10, ...).
Update: The "Base N Codec" from GitHub (above) seems to be buggy, so I used the following code at the end:

https://dzone.com/articles/base-x-encoding

Polymorphism answered 9/11, 2016 at 12:44 Comment(8)
If your base is not a power of two, how do you define an encoding that doesn't treat the data as a potentially (very) large integer, and thus need BigInteger or equivalent?Barros
@Barros Yes, but it should be done "virtually", sequence by sequence, with some sort of overflow to another sequence, if necessary (not directly using BigInteger). Otherwise it will be "targeted at relatively short byte sequences (up to ~1000 bytes)" (as states in github.com/mklemm/base-n-codec-java).Polymorphism
Near impossible if by "universal" you mean to decode/encode existing standards, like Base85, Base64 etc. (different padding strategies and grouping method). If you mean like Integer.toString(i, radix) then whats the use case for this?Fetch
@Fetch Usecase: I need it for a binary data to custom character set encoder (where the number of characters in the set is mutable, "ABCDE" = Base5, "ABCDE-+*/." = Base10, ...).Polymorphism
Perhaps my question was not precise enough. The decisive point is, do you absolutely need to encode the binary data with as little wasted bits as possible (treating the entire stream as a number), or is some wastage not a problem (e.g. the Base85 encoding encodes 4 bytes in 5 symbols, which yields only 2^32 of the 85^5 possibilities used). If you just seek to encode for transport with a limited symbol set, accepting some wasted bits, it simplifies the problem to just picking a good number of symbols to encode a reasonably small number of bytes.Fetch
Then the problem can be reduced to handle groups of symbols, handling a number of bytes that can be handled within the value range that is covered with long integer math, removing the need for BigDecimal. Also this would facilitate relatively simple implementation in streaming fashion in a relatively simple state machine. Of course it will be incompatible with defined standards.Fetch
@Fetch The encoding should be as effective as possible.Polymorphism
Then Glen Bests answer is the answer. EnjoyFetch
O
4

General answer: No. Special case: Yes, for bases a power of 2.

Why? Because thoughts in the Q are in "strong competition" (actually probably "contradiction").

  1. As input, you want to support an unlimited integer in some base N (think of it as a BigIntegerBaseN). As output, you wat to support an unlimited integer in some base M (think of it as a BigIntegerBaseM).
  2. You want to carry out base conversion - which is mathematically defined as a series of (multiplications & additions) and divisions. See http://www.cut-the-knot.org/recurrence/conversion.shtml and https://math.stackexchange.com/questions/48968/how-to-change-from-base-n-to-m.
  3. You want to find a way of calculating such results without doing multiplications and divisions on BigIntegers (in any base of implementation).

Can you determine results of multiplication and division operations without carrying out multiplication and division calculations? NO. It's a contradiction. When you get the results, by definition, you've carried out the calculation.

So it's not a question of can you avoid the calcuations, but a question of how to streamline them.

  • If N and/or M are in bases a power of 2, then multiplication/division can be calculated by simple bit-shifting = same calculation with major stream-lining. That can be done by avoiding BigInteger calcs.
  • Otherwise, you can cache certain repeated calculations, storing interim results in an array or HashMap, then you get the same calculations with streamlining. But BigInteger calcs are still required (but redundant repetitions are avoided).

Hope that helps your approach. :)

Octavie answered 15/11, 2016 at 13:21 Comment(0)
E
4

A base N encoding is quite efficient if N is a power of 2, as then conversion can happen between fixed size groups of digits and a fixed size of bytes.

Base64: 26 - 6 bits per digit, hence 4 digits = 24 bits = 3 bytes.

Otherwise school multiplication must happen over the entire length, resulting in much "BigInteger" calculation.

A bit faster instead of for instance repeatedly multiplying/dividing by the base N, is having an array of powers of N.

For encoding of a byte array to digits you could use N0, N1, N2, N3, ... as byte arrays of lesser or equal lengths, and do repeated subtractions.

As byte is signed, short might be more suited. Say if the highest byte of the number is 98 and the lessequal N-power is 12 then circa 7 is that digit.

For decoding of digits to a byte array the same powers might be used.

Have fun.

Earthworm answered 9/11, 2016 at 22:43 Comment(2)
If I understand well the second option, it would not be very resource friendly for large data sizes.Polymorphism
Yes: all powers of N upto the number. Could be done lazy and static.For encoding/decoding once quite an overhead. On the other hand the space need is less than N-log(n).(n/256) bytes (n being the number), and you already need (n/256) bytes for the number. Fine for large N.Earthworm
O
4

General answer: No. Special case: Yes, for bases a power of 2.

Why? Because thoughts in the Q are in "strong competition" (actually probably "contradiction").

  1. As input, you want to support an unlimited integer in some base N (think of it as a BigIntegerBaseN). As output, you wat to support an unlimited integer in some base M (think of it as a BigIntegerBaseM).
  2. You want to carry out base conversion - which is mathematically defined as a series of (multiplications & additions) and divisions. See http://www.cut-the-knot.org/recurrence/conversion.shtml and https://math.stackexchange.com/questions/48968/how-to-change-from-base-n-to-m.
  3. You want to find a way of calculating such results without doing multiplications and divisions on BigIntegers (in any base of implementation).

Can you determine results of multiplication and division operations without carrying out multiplication and division calculations? NO. It's a contradiction. When you get the results, by definition, you've carried out the calculation.

So it's not a question of can you avoid the calcuations, but a question of how to streamline them.

  • If N and/or M are in bases a power of 2, then multiplication/division can be calculated by simple bit-shifting = same calculation with major stream-lining. That can be done by avoiding BigInteger calcs.
  • Otherwise, you can cache certain repeated calculations, storing interim results in an array or HashMap, then you get the same calculations with streamlining. But BigInteger calcs are still required (but redundant repetitions are avoided).

Hope that helps your approach. :)

Octavie answered 15/11, 2016 at 13:21 Comment(0)
H
1

You mention two very different approaches. The BaseN algorithm used in Github implementation is using the mathematical notation of converting an integer between bases. This is equivalent to saying that 10 is the same as 12 in base-8 arithmetic or 1010 in base-2 arithmetic. The algorithm interprets the byte stream as a large number and converts to the assigned base.

Base64 is a very different approach, and you can see an example in Wikipedia Base64 page. The algorithm basically splits the input stream into an array of 6 bits to each element. 2^6 = 64, thus the name Base64. It has a table with the 64 different characters and displays each element in the array (6-bit) to the corresponding conversion table.

I think that you need to select one of the two approaches, since they are very different and not compatible with each other. As for the implementation details, if opting for the second method, this would easier to implement I think, since you basically split into fixed-size parts the stream and encode it according to your own table.

The first method can get quite complicated, since arbitrary arithmetic operations rely on quite complex constructs. You can have a look at exist software, again @ Wikipedia' s list of arbitrary-precision arithmetic software.

Realistically, I think at some point you will find it hard to get characters for your conversions (as the base goes up or the number of bits goes up), unless you will be using the whole Unicode alphabet :).

Hope I helped a bit

Haematoxylin answered 19/11, 2016 at 9:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.