Why does '(int)(char)(byte)-2' produce 65534 in Java?
Asked Answered
D

4

71

I encountered this question in technical test for a job. Given the following code example:

public class Manager {
    public static void main (String args[]) {
        System.out.println((int) (char) (byte) -2);
    }
}

It gives the output as 65534.

This behavior shows for negative values only; 0 and positive numbers yield the same value, meaning the one entered in SOP. The byte cast here is insignificant; I have tried without it.

So my question is: what exactly is going on here?

Decrypt answered 8/7, 2014 at 15:35 Comment(9)
The fact that the byte cast doesn't change the result doesn't mean its not doing anything...Mallina
char cast is doing everything here , i don't have a clue what byte cast is up to ... can you tell me what it is doing here?Decrypt
Try System.out.println((int)(char)(byte)-130) and see if it's "just" 65536-130. Then read @Chris K answer and work it out! :)Mallina
Oh, and rerun it without the byte cast!Mallina
@Mallina Here the (byte) indeed changes the result, so it's a different situation.Cruzeiro
@Cruzeiro That's exactly my point. The byte cast isn't insignificant as the OP stated. He was "lucky" that in his case the cast had no effect on the result, but as you can see from various answers something is indeed going on there...Mallina
@Mallina Ok, so in the general case, the (byte) is needed, but if we are already in the range -128..127, we don't need it, so then it is insignificant.Cruzeiro
@Cruzeiro The (byte) cast does nothing, except constrain the number to the byte range. The overall effect does not rely on the byte cast.Sedentary
Why do trivial questions tend to make the hit list and garner so many upvotes?Bel
F
132

There are some prerequisites that we need to agree upon before you can understand what is happening here. With understanding the following bullet points, the rest is simple deduction:

  1. All primitive types within the JVM are represented as a sequence of bits. The int type is represented by 32 bits, the char and short types by 16 bits and the byte type is represented by 8 bits.

  2. All JVM numbers are signed, where the char type is the only unsigned "number". When a number is signed, the highest bit is used to represent the sign of this number. For this highest bit, 0 represents a non-negative number (positive or zero) and 1 represents a negative number. Also, with signed numbers, a negative value is inverted (technically known as two's complement notation) to the incrementation order of positive numbers. For example, a positive byte value is represented in bits as follows:

    00 00 00 00 => (byte) 0
    00 00 00 01 => (byte) 1
    00 00 00 10 => (byte) 2
    ...
    01 11 11 11 => (byte) Byte.MAX_VALUE
    

    while the bit order for negative numbers is inverted:

    11 11 11 11 => (byte) -1
    11 11 11 10 => (byte) -2
    11 11 11 01 => (byte) -3
    ...
    10 00 00 00 => (byte) Byte.MIN_VALUE
    

    This inverted notation also explains why the negative range can host an additional number compared to the positive range where the latter includes the representation of the number 0. Remember, all this is only a matter of interpreting a bit pattern. You can note negative numbers differently, but this inverted notation for negative numbers is quite handy because it allows for some rather fast transformations as we will be able to see in a small example later on.

    As mentioned, this does not apply for the char type. The char type represents a Unicode character with a non-negative "numeric range" of 0 to 65535. Each of this number refers to a 16-bits Unicode value.

  3. When converting between the int, byte, short, char and boolean types the JVM needs to either add or truncate bits.

    If the target type is represented by more bits than the type from which it is converted, then the JVM simply fills the additional slots with the value of the highest bit of the given value (which represents the signature):

    |     short   |     byte    |
    |             | 00 00 00 01 | => (byte) 1
    | 00 00 00 00 | 00 00 00 01 | => (short) 1
    

    Thanks to the inverted notation, this strategy also works for negative numbers:

    |     short   |     byte    |
    |             | 11 11 11 11 | => (byte) -1
    | 11 11 11 11 | 11 11 11 11 | => (short) -1
    

    This way, the value's sign is retained. Without going into details of implementing this for a JVM, note that this model allows for a casting being performed by a cheap shift operation what is obviously advantageous.

    An exception from this rule is widening a char type which is, as we said before, unsigned. A conversion from a char is always applied by filling the additional bits with 0 because we said there is no sign and thus no need for an inverted notation. A conversion of a char to an int is therefore performed as:

    |            int            |    char     |     byte    |
    |                           | 11 11 11 11 | 11 11 11 11 | => (char) \uFFFF
    | 00 00 00 00 | 00 00 00 00 | 11 11 11 11 | 11 11 11 11 | => (int) 65535
    

    When the original type has more bits than the target type, the additional bits are merely cut off. As long as the original value would have fit into the target value, this works fine, as for example for the following conversion of a short to a byte:

    |     short   |     byte    |
    | 00 00 00 00 | 00 00 00 01 | => (short) 1
    |             | 00 00 00 01 | => (byte) 1
    | 11 11 11 11 | 11 11 11 11 | => (short) -1
    |             | 11 11 11 11 | => (byte) -1
    

    However, if the value is too big or too small, this does not longer work:

    |     short   |     byte    |
    | 00 00 00 01 | 00 00 00 01 | => (short) 257
    |             | 00 00 00 01 | => (byte) 1
    | 11 11 11 11 | 00 00 00 00 | => (short) -32512
    |             | 00 00 00 00 | => (byte) 0
    

    This is why narrowing castings sometimes lead to strange results. You might wonder why narrowing is implemented this way. You could argue that it would be more intuitive if the JVM checked a number's range and would rather cast an incompatible number to the biggest representable value of the same sign. However, this would require branching what is a costly operation. This is specifically important, as this two's complement notation allows for cheap arithmetic operations.

With all this information, we can see what happens with the number -2 in your example:

|           int           |    char     |     byte    |
| 11 11 11 11 11 11 11 11 | 11 11 11 11 | 11 11 11 10 | => (int) -2
|                         |             | 11 11 11 10 | => (byte) -2
|                         | 11 11 11 11 | 11 11 11 10 | => (char) \uFFFE
| 00 00 00 00 00 00 00 00 | 11 11 11 11 | 11 11 11 10 | => (int) 65534

As you can see, the byte cast is redundant as the cast to the char would cut the same bits.

All this is also specified by the JVMS, if you prefer a more formal definition of all these rules.

One final remark: A type's bit size does not necessarily represent the amount of bits that are reserved by the JVM for representing this type in its memory. As a matter of fact, the JVM does not distinguish between boolean, byte, short, char and int types. All of them are represented by the same JVM-type where the virtual machine merely emulates these castings. On a method's operand stack (i.e. any variable within a method), all values of the named types consumes 32 bits. This is however not true for arrays and object fields which any JVM implementer can handle at will.

Flop answered 8/7, 2014 at 17:0 Comment(10)
You might use a link to two’s complement (also on SO). The biggest advantage is IMO that you can perform subtraction by addition (a - b = a + (-b)). Addition works exactly the same way as on unsigned integers.Tameshatamez
Should you not have written (char) 65534 or (char) 0xFFFE instead of (char) 0x65534 in the last table?Dendritic
@FrankPI: I meant to write unicode notation, thanks for the hint. I also added the link. In general,simply edit my post if you can think of an improvement.Flop
This line may have a mistake: 00 00 00 00 | => (byte) -1Sup
A great summary of how casting works. People forget in these days of cheap memory what the sizes of types really mean.Twaddle
@Michael Shopsin Good that you mention it. I added a final paragraph to clarify this. As a matter of fact, the memory consumption of a type is not strictly determined by the bits that are required to represent it.Flop
Out of curiosity, is there any good reason to have the "smaller" data types actually trim off bits since they're all 32-bit anyway? Or is it just for legacy/standardization purposes?Sing
A user that inserts an explicit down-casting expects this truncation. The JVM is merely virtual, a user expects it to do what it was specified to do. You should not worry about the implementation. I only mentioned the internal layout to avoid that people take this answer as a suggestion to improve performance by choosing smaller types.Flop
The explanation here of the byte-to-char conversion omitted the part about going to int in between. And the term "inverted" isn't specific, despite the Wikipedia link. And char isn't always interpreted as a 16-bit Unicode value. It's sometimes interpreted as half of a Unicode value or as a 16-bit unsigned integer.Unstep
Chars in Java are defined as UTF16 values in the specification. (docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.1) Please be more specific about your claims.Flop
G
35

There are two important things to note here,

  1. a char is unsigned, and cannot be negative
  2. casting a byte to a char first involves a hidden cast to an int as per the Java Language Spec.

Thus casting -2 to an int gives us 11111111111111111111111111111110. Notice how the two's complement value has been sign extended with a one; that only happens for negative values. When we then narrow it to a char, the int is truncated to

1111111111111110

Finally, casting 1111111111111110 to an int is bit extended with zero, rather than a one because the value is now considered to be positive (because chars can only be positive). Thus widening the bits leaves the value unchanged, but unlike the negative value case unchanged in value. And that binary value when printed in decimal is 65534.

Guth answered 8/7, 2014 at 15:44 Comment(5)
Why casting a 8-bit byte on a 16-bit char produces a 16-bit two complement of -2, resolving in a 65534 int? Is this all related to two complement? I mean, the filling of 1 in the char cast how is done?Mallina
Thank you @Narmer, an excellent point. I have updated the answer with a reference to the Java Language Spec that explains how the casting of byte to char occurs. It goes via an int.Guth
Yep, your is the most informative and explicative answer, it should be the answer for this question.Mallina
Sign extension happens for all numbers in this case. It just so happens that when you have a positive number, the sign bit is 0. There's no special rule for negative numbers.Epigrammatist
@indiv, I have tweeked the answer to make the bit extension of zero and one clearer.Guth
P
30

A char has a value between 0 and 65535, so when you cast a negative to char, the result is the same as subtracting that number from 65536, resulting in 65534. If you printed it as a char, it would try to display whatever unicode character is represented by 65534, but then when you cast to int, you actually get 65534. If you started with a number that was above 65536, you'd see similarly "confusing" results in which a big number (e.g. 65538) would end up small (2).

Patrica answered 8/7, 2014 at 15:39 Comment(2)
Isn't the range of a char 0-65535?Sukin
You're right -- changed that. The subtraction is from the total range, which is 65536, but that means that the high end is 65535.Patrica
M
6

I think the simplest way to explain this would just be to break it down into the order of operations you are performing

Instance | #          int            |     char    | #   byte    |    result   |
Source   | 11 11 11 11 | 11 11 11 11 | 11 11 11 11 | 11 11 11 10 | -2          |
byte     |(11 11 11 11)|(11 11 11 11)|(11 11 11 11)| 11 11 11 10 | -2          |
int      | 11 11 11 11 | 11 11 11 11 | 11 11 11 11 | 11 11 11 10 | -2          |
char     |(00 00 00 00)|(00 00 00 00)| 11 11 11 11 | 11 11 11 10 | 65534       |
int      | 00 00 00 00 | 00 00 00 00 | 11 11 11 11 | 11 11 11 10 | 65534       |
  1. You are simply taking a 32bit signed value.
  2. You are then converting it to an 8bit signed value.
  3. When you attempt to convert it to a 16bit unsigned value, the compiler sneaks in a quick conversion to 32bit signed value,
  4. Then converting it to 16bit without maintaining sign.
  5. When the final conversion to 32bit occurs, there is no sign, so the value adds zero bits to maintain value.

So, yes, when you look at it this way, the byte cast is significant (academically speaking), though the result is insignificant (joy to programming, a significant action can have an insignificant effect). The effect of narrowing and widening while maintaining sign. Where, the conversion to char narrows, but does not widen to sign.

(Please note, I used a # to denote the Signed bit, and as noted, there is no signed bit for char, as it is an unsigned value).

I used parens to represent what is actually happening internally. The data types are actually trunked in their logical blocks, but if viewed as in int, their results would be what the parens symbolize.

Signed values always widen with value of the signed bit. Unsigned always widen with the bit off.

*So, the trick (or gotchas) to this, is that the expansion to int from byte, maintains the signed value when widened. Which is then narrowed the moment it touches the char. This then turns off the signed bit.

If the conversion to int did not occur, the value would have been 254. But, it does, so it isn't.

Microcircuit answered 9/7, 2014 at 20:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.