In java, is it more efficient to use byte or short instead of int and float instead of double?
Asked Answered
H

7

112

I've noticed I've always used int and doubles no matter how small or big the number needs to be. So in java, is it more efficient to use byte or short instead of int and float instead of double?

So assume I have a program with plenty of ints and doubles. Would it be worth going through and changing my ints to bytes or shorts if I knew the number would fit?

I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?

By efficient I mostly mean processing. I'd assume the garbage collector would be a lot faster if all the variables would be half size and that calculations would probably be somewhat faster too. ( I guess since I am working on android I need to somewhat worry about ram too)

(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )

I tried it with a small android app I have but didn't really notice a difference at all. (Though I didn't "scientifically" measure anything.)

Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.

Would it be worth doing from the beginning when I start a new project? (I mean I think every little bit would help but then again if so, why doesn't it seem like anyone does it.)

Hibernicism answered 25/1, 2013 at 20:22 Comment(0)
S
132

Am I wrong in assuming it should be faster and more efficient? I'd hate to go through and change everything in a massive program to find out I wasted my time.

Short answer

Yes, you are wrong. In most cases, it makes little difference in terms of space used.

It is not worth trying to optimize this ... unless you have clear evidence that optimization is needed. And if you do need to optimize memory usage of object fields in particular, you will probably need to take other (more effective) measures.

Longer answer

The Java Virtual Machine models stacks and object fields using offsets that are (in effect) multiples of a 32 bit primitive cell size. So when you declare a local variable or object field as (say) a byte, the variable / field will be stored in a 32 bit cell, just like an int.

There are two exceptions to this:

  • long and double values require 2 primitive 32-bit cells
  • arrays of primitive types are represent in packed form, so that (for example) an array of bytes hold 4 bytes per 32bit word.

So it might be worth optimizing use of long and double ... and large arrays of primitives. But in general no.

In theory, a JIT might be able to optimize this, but in practice I've never heard of a JIT that does. One impediment is that the JIT typically cannot run until after there instances of the class being compiled have been created. If the JIT optimized the memory layout, you could have two (or more) "flavors" of object of the same class ... and that would present huge difficulties.


Revisitation

Looking at the benchmark results in @meriton's answer, it appears that using short and byte instead of int incurs a performance penalty for multiplication. Indeed, if you consider the operations in isolation, the penalty is significant. (You shouldn't consider them in isolation ... but that's another topic.)

I think the explanation is that JIT is probably doing the multiplications using 32bit multiply instructions in each case. But in the byte and short case, it executes extra instructions to convert the intermediate 32 bit value to a byte or short in each loop iteration. (In theory, that conversion could be done once at the end of the loop ... but I doubt that the optimizer would be able to figure that out.)

Anyway, this does point to another problem with switching to short and byte as an optimization. It could make performance worse ... in an algorithm that is arithmetic and compute intensive.


Secondary questions

I know java doesn't have unsigned types but is there anything extra I could do if I knew the number would be positive only?

No. Not in terms of performance anyway. (There are some methods in Integer, Long, etc for dealing with int, long, etc as unsigned. But these don't give any performance advantage. That is not their purpose.)

(I'd assume the garbage collector only deals with Objects and not primitive but still deletes all the primitives in abandoned objects right? )

Correct. A field of an object is part of the object. It goes away when the object is garbage collected. Likewise the cells of an array go away when the array is collected. When the field or cell type is a primitive type, then the value is stored in the field / cell ... which is part of the object / array ... and that has been deleted.

Styracaceous answered 26/1, 2013 at 0:8 Comment(8)
+1 don't optimize unless you have clear evidence of a performance problemPaynter
Erm, why does the JVM have to wait for JIT compilation to pack the memory layout of a class? Since the types of fields are written to the class file, couldn't the JVM pick a memory layout at class load time, then resolve field names as byte rather than word offsets?File
@File - I'm pretty sure that object layouts are determined at class load time, and they don't change after that. See the "fine-print" part of my answer. If actual memory layouts changed when the code was JITed, it would be really difficult for the JVM to deal with. (When I said the JIT might optimize layout, that is hypothetical and impractical ... which could explain why I've never heard of a JIT actually doing it.)Styracaceous
I know. I was just trying to point out that even though the memory layouts are hard to change once objects are created, a JVM might still optimize memory layout before that, i.e. at class load time. Put differently, that the JVM spec describes the behaviour of a JVM with word offsets doesn't necessarily imply that a JVM is required to be implemented that way - though most probably are.File
@File - The JVM spec is talking about "virtual machine word offets" within local frames / objects. How these are mapped to physical machine offsets is NOT specified. Indeed, it cannot specify it ... since there may be hardware-specific field alignment requirements.Styracaceous
I noticed shorts[i] = (short)(bytes[i] & 0xFF) is about 10% faster than shorts[i] = bytes[i]. I changed to int[] because of your advice, but still ints[i] = bytes[i] & 0xFF is about 12% faster than ints[i] = bytes[i]. Any idea why? Does it have to do with sign extension, which ought to be be a single replaced instruction MOVSX r32,r/m8 on x86?Marquand
My guess (and it is just a guess) is that it is to do with sign extension. You can maybe confirm this by getting the JIT compiler to dump the native code. (Another possible explanation is a benchmarking flaw.)Styracaceous
The packing of instance fields (and even array elements) is implementation specific and the runtime has enough type information to make that decision. In contrast, local variables and arithmetic instructions do not carry the information whether they are using byte, char, short, or int (or even boolean when you use a different source language than Java) on the bytecode level. The fact that integer multiplication is done in int is even in the source language and requires a typecast to store the result into a short, even if both inputs were byte. This is what the JIT compiler seesClasp
F
33

That depends on the implementation of the JVM, as well as the underlying hardware. Most modern hardware will not fetch single bytes from memory (or even from the first level cache), i.e. using the smaller primitive types generally does not reduce memory bandwidth consumption. Likewise, modern CPU have a word size of 64 bits. They can perform operations on less bits, but that works by discarding the extra bits, which isn't faster either.

The only benefit is that smaller primitive types can result in a more compact memory layout, most notably when using arrays. This saves memory, which can improve locality of reference (thus reducing the number of cache misses) and reduce garbage collection overhead.

Generally speaking however, using the smaller primitive types is not faster.

To demonstrate that, behold the following benchmark:

public class Benchmark {

    public static void benchmark(String label, Code code) {
        print(25, label);
        
        try {
            for (int iterations = 1; ; iterations *= 2) { // detect reasonable iteration count and warm up the code under test
                System.gc(); // clean up previous runs, so we don't benchmark their cleanup
                long previouslyUsedMemory = usedMemory();
                long start = System.nanoTime();
                code.execute(iterations);
                long duration = System.nanoTime() - start;
                long memoryUsed = usedMemory() - previouslyUsedMemory;
                
                if (iterations > 1E8 || duration > 1E9) { 
                    print(25, new BigDecimal(duration * 1000 / iterations).movePointLeft(3) + " ns / iteration");
                    print(30, new BigDecimal(memoryUsed * 1000 / iterations).movePointLeft(3) + " bytes / iteration\n");
                    return;
                }
            }
        } catch (Throwable e) {
            throw new RuntimeException(e);
        }
    }
    
    private static void print(int desiredLength, String message) {
        System.out.print(" ".repeat(Math.max(1, desiredLength - message.length())) + message);
    }
    
    private static long usedMemory() {
        return Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    }

    @FunctionalInterface
    interface Code {
        /**
         * Executes the code under test.
         * 
         * @param iterations
         *            number of iterations to perform
         * @return any value that requires the entire code to be executed (to
         *         prevent dead code elimination by the just in time compiler)
         * @throws Throwable
         *             if the test could not complete successfully
         */
        Object execute(int iterations);
    }

    public static void main(String[] args) {
        benchmark("long[] traversal", (iterations) -> {
            long[] array = new long[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = i;
            }
            return array;
        });
        benchmark("int[] traversal", (iterations) -> {
            int[] array = new int[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = i;
            }
            return array;
        });
        benchmark("short[] traversal", (iterations) -> {
            short[] array = new short[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = (short) i;
            }
            return array;
        });
        benchmark("byte[] traversal", (iterations) -> {
            byte[] array = new byte[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = (byte) i;
            }
            return array;
        });
        
        benchmark("long fields", (iterations) -> {
            class C {
                long a = 1;
                long b = 2;
            }
            
            C[] array = new C[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = new C();
            }
            return array;
        });
        benchmark("int fields", (iterations) -> {
            class C {
                int a = 1;
                int b = 2;
            }
            
            C[] array = new C[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = new C();
            }
            return array;
        });
        benchmark("short fields", (iterations) -> {
            class C {
                short a = 1;
                short b = 2;
            }
            
            C[] array = new C[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = new C();
            }
            return array;
        });
        benchmark("byte fields", (iterations) -> {
            class C {
                byte a = 1;
                byte b = 2;
            }
            
            C[] array = new C[iterations];
            for (int i = 0; i < iterations; i++) {
                array[i] = new C();
            }
            return array;
        });

        benchmark("long multiplication", (iterations) -> {
            long result = 1;
            for (int i = 0; i < iterations; i++) {
                result *= 3;
            }
            return result;
        });
        benchmark("int multiplication", (iterations) -> {
            int result = 1;
            for (int i = 0; i < iterations; i++) {
                result *= 3;
            }
            return result;
        });
        benchmark("short multiplication", (iterations) -> {
            short result = 1;
            for (int i = 0; i < iterations; i++) {
                result *= 3;
            }
            return result;
        });
        benchmark("byte multiplication", (iterations) -> {
            byte result = 1;
            for (int i = 0; i < iterations; i++) {
                result *= 3;
            }
            return result;
        });
    }
}

Run with OpenJDK 14 on my Intel Core i7 CPU @ 3.5 GHz, this prints:

     long[] traversal     3.206 ns / iteration      8.007 bytes / iteration
      int[] traversal     1.557 ns / iteration      4.007 bytes / iteration
    short[] traversal     0.881 ns / iteration      2.007 bytes / iteration
     byte[] traversal     0.584 ns / iteration      1.007 bytes / iteration
          long fields    25.485 ns / iteration     36.359 bytes / iteration
           int fields    23.126 ns / iteration     28.304 bytes / iteration
         short fields    21.717 ns / iteration     20.296 bytes / iteration
          byte fields    21.767 ns / iteration     20.273 bytes / iteration
  long multiplication     0.538 ns / iteration      0.000 bytes / iteration
   int multiplication     0.526 ns / iteration      0.000 bytes / iteration
 short multiplication     0.786 ns / iteration      0.000 bytes / iteration
  byte multiplication     0.784 ns / iteration      0.000 bytes / iteration

As you can see, the only significant speed savings occur when traversing large arrays; using smaller object fields yields negligible benefit, and computations are actually slightly slower on the small datatypes.

Overall, the performance differences are quite minor. Optimizing algorithms is far more important than the choice of primitive type.

File answered 26/1, 2013 at 0:50 Comment(2)
Rather than saying "most notably when using arrays", I think it might be simpler to say that short and byte are more efficient when stored in arrays that are large enough to matter (the bigger the array, the bigger the efficiency difference; a byte[2] might be more or less efficient than an int[2], but not by enough to matter either way), but that individual values are more efficiently stored as int.Fortunia
What I checked: Those benchmarks always used an int ('3') as factor or assignment operand (the loop variant, then casted). What i did was to used typed factors / assignment operands depending on the lvalue type: int mult 76.481 ns int mult (typed) 72.581 ns short mult 87.908 ns short mult (typed) 90.772 ns byte mult 87.859 ns byte mult (typed) 89.524 ns int[] trav 88.905 ns int[] trav (typed) 89.126 ns short[] trav 10.563 ns short[] trav (typed) 10.039 ns byte[] trav 8.356 ns byte[] trav (typed) 8.338 ns I suppose there is a lot of unnecessary casting. those test were run on an android tab.Bales
O
6

Using byte instead of int can increase performance if you are using them in a huge amount. Here is an experiment:

import java.lang.management.*;

public class SpeedTest {

    /** Get CPU time in nanoseconds. */
    public static long getCpuTime() {
        ThreadMXBean bean = ManagementFactory.getThreadMXBean();
        return bean.isCurrentThreadCpuTimeSupported() ? bean
                .getCurrentThreadCpuTime() : 0L;
    }

    public static void main(String[] args) {
        long durationTotal = 0;
        int numberOfTests=0;

        for (int j = 1; j < 51; j++) {
            long beforeTask = getCpuTime();
            // MEASURES THIS AREA------------------------------------------
            long x = 20000000;// 20 millions
            for (long i = 0; i < x; i++) {
                               TestClass s = new TestClass(); 
                
            }
            // MEASURES THIS AREA------------------------------------------
            long duration = getCpuTime() - beforeTask;
            System.out.println("TEST " + j + ": duration = " + duration + "ns = "
                    + (int) duration / 1000000);
            durationTotal += duration;
            numberOfTests++;
        }
        double average = durationTotal/numberOfTests;
        System.out.println("-----------------------------------");
        System.out.println("Average Duration = " + average + " ns = "
                + (int)average / 1000000 +" ms (Approximately)");
        
    }
}

This class tests the speed of creating a new TestClass. Each tests does it 20 million times and there are 50 tests.

Here is the TestClass:

 public class TestClass {
     int a1= 5;
     int a2= 5; 
     int a3= 5;
     int a4= 5; 
     int a5= 5;
     int a6= 5; 
     int a7= 5;
     int a8= 5; 
     int a9= 5;
     int a10= 5; 
     int a11= 5;
     int a12=5; 
     int a13= 5;
     int a14= 5; 
 }

I've run the SpeedTest class and in the end got this:

 Average Duration = 8.9625E8 ns = 896 ms (Approximately)

Now I'm changing the ints into bytes in the TestClass and running it again. Here is the result:

 Average Duration = 6.94375E8 ns = 694 ms (Approximately)

I believe this experiment shows that if you are instancing a huge amount of variables, using byte instead of int can increase efficiency

Overpay answered 12/5, 2014 at 13:41 Comment(2)
Note that this benchmark is only measuring costs associated with allocation and construction, and only the case of a class with lots of individual fields. If arithmetic / update operations were performed on the fields, @meriton's results suggest that byte could be >>slower<< than int.Styracaceous
True, I should have worded it better to clarify it.Overpay
B
2

byte is generally considered to be 8 bits. short is generally considered to be 16 bits.

In a "pure" environment, which isn't java as all implementation of bytes and longs, and shorts, and other fun things is generally hidden from you, byte makes better use of space.

However, your computer is probably not 8 bit, and it is probably not 16 bit. this means that to obtain 16 or 8 bits in particular, it would need to resort to "trickery" which wastes time in order to pretend that it has the ability to access those types when needed.

At this point, it depends on how hardware is implemented. However from I've been tought, the best speed is achieved from storing things in chunks which are comfortable for your CPU to use. A 64 bit processor likes dealing with 64 bit elements, and anything less than that often requires "engineering magic" to pretend that it likes dealing with them.

Blimp answered 26/1, 2013 at 0:17 Comment(5)
I'm not sure what you mean by "engineering magic"... most/all modern processors have fast instructions to load a byte and sign-extend it, to store one from a full-width register, and to do byte-width or short-width arithmetic in a portion of a full-width register. If you were right, it would make sense, where feasible, to replace all ints with longs on a 64-bit processor.Breechcloth
I can imagine that being true. I just remember that in the Motorola 68k simulator we used, most operations could work with 16 bit values while not with 32 bit nor 64 bit. I was thinking that this meant that systems had a preferred value size that it can fetch optimally. Although I can imagine that modern 64bit processors can fetch 8bit, 16 bit, 32bit, and 64bit with equal ease, in this case it is a nonissue. Thanks for pointing that out.Blimp
"... is generally considered to be..." - Actually, it is clearly, unambiguously >>specified<< to be those sizes. In Java. And the context of this question is Java.Styracaceous
A large number of processors even use the same number of cycles to manipulate and access data that isn't word sized, so it's not really worth worrying about unless you measure on a particular JVM and platform.Chronicle
I am trying to be saying in all generality. That said I'm not actually sure about Java's standard with regards to byte size, but at this point i'm pretty convinced that if any heretic decides non 8 bit bytes, Java won't want to touch them with a ten foot pole. However, some processors require multibyte alignment, and if Java platform supports them, it will need to do things slower to accomodate dealing with these smaller types, or magically represent them with larger representations than you requested. That always prefer int over other types since it always uses the system's favorite size.Blimp
V
2

One of the reason for short/byte/char being less performant is for lack of direct support for these data types. By direct support, it means, JVM specifications do not mention any instruction set for these data types. Instructions like store, load, add etc. have versions for int data type. But they do not have versions for short/byte/char. E.g. consider below java code:

void spin() {
 int i;
 for (i = 0; i < 100; i++) {
 ; // Loop body is empty
 }
}

Same gets converted into machine code as below.

0 iconst_0 // Push int constant 0
1 istore_1 // Store into local variable 1 (i=0)
2 goto 8 // First time through don't increment
5 iinc 1 1 // Increment local variable 1 by 1 (i++)
8 iload_1 // Push local variable 1 (i)
9 bipush 100 // Push int constant 100
11 if_icmplt 5 // Compare and loop if less than (i < 100)
14 return // Return void when done

Now, consider changing int to short as below.

void sspin() {
 short i;
 for (i = 0; i < 100; i++) {
 ; // Loop body is empty
 }
}

The corresponding machine code will change as follows:

0 iconst_0
1 istore_1
2 goto 10
5 iload_1 // The short is treated as though an int
6 iconst_1
7 iadd
8 i2s // Truncate int to short
9 istore_1
10 iload_1
11 bipush 100
13 if_icmplt 5
16 return

As you can observe, to manipulate short data type, it is still using int data type instruction version and explicitly converting int to short when required. Now, due to this, performance gets reduced.

Now, reason cited for not giving direct support as follows:

The Java Virtual Machine provides the most direct support for data of type int. This is partly in anticipation of efficient implementations of the Java Virtual Machine's operand stacks and local variable arrays. It is also motivated by the frequency of int data in typical programs. Other integral types have less direct support. There are no byte, char, or short versions of the store, load, or add instructions, for instance.

Quoted from JVM specification present here (Page 58).

Venepuncture answered 19/9, 2018 at 11:51 Comment(6)
These are disassembled bytecodes; i.e. JVM virtual instructions. They are not optimized by the javac compiler, and you cannot draw any reliable inferences from them on how the program will perform in real life. The JIT compiler compiles these bytecodes to actual native machine instructions, and does some pretty serious optimization in the process. If you want to analyze the performance of the code, you need to examine the native code instructions. (And it is complicated because you need to take into account the timing behavior of a multi-stage x86_64 pipeline.)Styracaceous
I believe the java specifications are for the javac implementors to implement. So i do not think there is any more optimisations done at that level. Anyway, i could be completely wrong also. Please share some reference link to support your statement.Venepuncture
Well here is one fact to support my statement. You won't find any (credible) timing figures that tell you how many clock cycles each JVM bytecode instruction takes. Certainly not published by Oracle or other JVM suppliers. Also, read stackoverflow.com/questions/1397009Styracaceous
I did find an old (2008) paper where someone tried to develop a platform independent model for predicting performance of bytecode sequences. They claim that their predictions were off by 25% compared to RDTSC measurements .... on a Pentium. And they were running the JVM with JIT compilation disabled! Reference: sciencedirect.com/science/article/pii/S1571066108004581Styracaceous
I am just confused here. Isn't my answer supporting the facts you stated under revisitation section?Venepuncture
No it isn't. Your answer is making assertions based on the bytecodes. As my comments say, the bytecodes don't allow you to infer performance so your assertions are not based on a logically sound foundation. Now, if you dumped the native code and analyzed them and saw extra native instructions to do short <-> long conversion, that would be supporting evidence. But not this. For all we know, that i2s bytecode instruction could be optimized away by the JIT compilerStyracaceous
G
1

I would say that accepted answer is somewhat wrong saying "it makes little difference in terms of space used". Here is the example showing that difference in some cases is very different:

Baseline usage 4.90MB, java: 11.0.12
Mem usage - bytes : +202.60 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints  : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints  : +363.02 MB
Mem usage - bytes : +203.02 MB
Mem usage - shorts: +283.02 MB
Mem usage - ints  : +363.02 MB

The code to verify:

static class Bytes {
    public byte f1;
    public byte f2;
    public byte f3;
    public byte f4;
}

static class Shorts {
    public short f1;
    public short f2;
    public short f3;
    public short f4;
}

static class Ints {
    public int f1;
    public int f2;
    public int f3;
    public int f4;
}

@Test
public void memUsageTest() throws Exception {
    int countOfItems = 10 * 1024 * 1024;
    float MB = 1024*1024;
    Runtime rt = Runtime.getRuntime();

    System.gc();
    Thread.sleep(1000);
    long baseLineUsage = rt.totalMemory() - rt.freeMemory();

    trace("Baseline usage %.2fMB, java: %s", (baseLineUsage / MB), System.getProperty("java.version"));

    for( int j = 0; j < 3; j++ ) {
        Bytes[] bytes = new Bytes[countOfItems];
        for( int i = 0; i < bytes.length; i++ ) {
            bytes[i] = new Bytes();
        }
        System.gc();
        Thread.sleep(1000);
        trace("Mem usage - bytes : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
        bytes = null;

        Shorts[] shorts = new Shorts[countOfItems];
        for( int i = 0; i < shorts.length; i++ ) {
            shorts[i] = new Shorts();
        }
        System.gc();
        Thread.sleep(1000);
        trace("Mem usage - shorts: +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
        shorts = null;

        Ints[] ints = new Ints[countOfItems];
        for( int i = 0; i < ints.length; i++ ) {
            ints[i] = new Ints();
        }
        System.gc();
        Thread.sleep(1000);
        trace("Mem usage - ints  : +%.2f MB", (rt.totalMemory() - rt.freeMemory() - baseLineUsage) / MB);
        ints = null;
    }
}

private static void trace(String message, Object... args) {
    String line = String.format(US, message, args);
    System.out.println(line);
}
Godinez answered 15/8, 2022 at 13:28 Comment(0)
K
0

The difference is hardly noticeable! It's more a question of design, appropriateness, uniformity, habit, etc... Sometimes it's just a matter of taste. When all you care about is that your program gets up and running and substituting a float for an int would not harm correctness, I see no advantage in going for one or another unless you can demonstrate that using either type alters performance. Tuning performance based on types that are different in 2 or 3 bytes is really the last thing you should care about; Donald Knuth once said: "Premature optimization is the root of all evil" (not sure it was him, edit if you have the answer).

Klystron answered 25/1, 2013 at 21:0 Comment(3)
Nit: A float cannot represent all integers an int can; nor can an int represent any non-integer value that float can. That is, while all int values are a subset of long values, an int is not a subset of a float and a float is not a subset of an int.Resigned
I expect the answerer meant to write substituting a float for a double, if so answerer should edit the answer. If not answerer should hang head in shame and go back to basics for reasons outlined by @pst and for many other reasons.Rope
@HighPerformanceMark No I put int and float because that's what I was thinking. My answer is not specific to Java although I was thinking C... It's meant to be general. Mean comment you got there.Klystron

© 2022 - 2024 — McMap. All rights reserved.