[ACCEPTED]-On 32-bit CPUs, is an 'integer' type more efficient than a 'short' type?-cpu-architecture

Accepted answer
Score: 19

If you have a large array of numbers, then 12 go with the smallest size that works. It 11 will be more efficient to work with an array 10 of 16 bit shorts than 32 bit ints since 9 you get twice the cache density. The cost 8 of any sign extension the CPU has to do 7 to work with 16 bit values in 32 bit registers 6 is trivially negligible compared to the 5 cost of a cache miss.

If you are simply using 4 member variables in classes mixed with other 3 data types then it is less clear cut as 2 the padding requirements will likely remove 1 any space saving benefit of the 16 bit values.

Score: 15

Yes, you should definitely use a 32 bit 19 integer on a 32 bit CPU, otherwise it may 18 end up masking off the unused bits (i.e., it 17 will always do the maths in 32 bits, then 16 convert the answer to 16 bits)

It won't do 15 two 16 bit operations at once for you, but 14 if you write the code yourself and you're 13 sure it won't overflow, you can do it yourself.

Edit: I 12 should add that it also depends somewhat 11 on your definition of "efficient". While 10 it will be able to do 32-bit operations 9 more quickly, you will of course use twice 8 as much memory.

If these are being used 7 for intermediate calculations in an inner 6 loop somewhere, then use 32-bit. If, however, you're 5 reading this from disk, or even if you just 4 have to pay for a cache miss, it may still 3 work out better to use 16-bit integers. As 2 with all optimizations, there's only one 1 way to know: profile it.

Score: 8

If you're using "many" integer 8 values, the bottleneck in your processing 7 is liable to be bandwidth to memory. 16 6 bit integers pack more tightly into the 5 data cache, and would therefore be a performance 4 win.

If you are number crunching on a very 3 large amount of data, you should read What Every Programmer Should Know About Memory by 2 Ulrich Drepper. Concentrate on chapter 6, about 1 maximizing the efficiency of the data cache.

Score: 5

A 32 bit CPU is a CPU that usually operates 92 on 32 bit values internally, yet that does 91 not mean that it is any slower when performing 90 the same operation on a 8/16 bit value. x86 89 for example, still backward compatible up 88 to the 8086, can operate on fractions of 87 a register. That means even if a register 86 is 32 bit wide, it can operate only on the 85 first 16 or the first 8 bit of that register 84 and there will be no slow down at all. This 83 concept has even been adopted by x86_64, where 82 the registers are 64 bit, yet they still 81 can operate only on the first 32, 16, or 80 8 bit.

Also x86 CPUs always load a whole 79 cache line from memory, if not already in 78 cache, and a cache line is bigger than 4 77 byte anyway (for 32 bit CPUs rather 8 or 76 16 bytes) and thus loading 2 byte from memory 75 is equally fast as loading 4 byte from memory. If 74 processing many values from memory, 16 bit 73 values may actually be much faster than 72 32 bit values, since there are less memory 71 transfers. If a cache line is 8 byte, there 70 are four 16 bit values per cache line, yet 69 only two 32 bit values, thus when using 68 16 bit ints you have one memory access every 67 four values, using 32 bit ints you have 66 one every two values, resulting in twice 65 as many transfers for processing a large 64 int array.

Other CPUs, like PPC for example, cannot 63 process only a fraction of a register, they 62 always process the full register. Yet these 61 CPUs usually have special load operations 60 that allow them to, e.g. load a 16 bit value 59 from memory, expand it to 32 bit and write 58 it to a register. Later on they have a special 57 store operation that takes the value from 56 the register and only stores the last 16 55 bit back to memory; both operation need 54 only one CPU cycle, just like a 32 bit load/store 53 would need, so there is no speed difference 52 either. And since PPC can only perform arithmetic 51 operations on registers (unlike x86, which 50 can also operate on memory directly), this 49 load/store procedure takes place anyway 48 whether you use 32 bit ints or 16 bit ints.

The 47 only disadvantage, if you chain multiple 46 operations on a 32 bit CPU that can only 45 operate on full registers, is that the 32 44 bit result of the last operation may have 43 to be "cut back" to 16 bit before the next 42 operation is performed, otherwise the result 41 may not be correct. Such a cut back is only 40 a single CPU cycle, though (a simple AND 39 operation), and compilers are very good 38 at figuring out when such a cut back is 37 really necessary and when leaving it out 36 won't have any influence on the final result, so 35 such a cut back is not performed after every 34 instruction, it is only performed if really 33 unavoidable. Some CPUs offers various "enhanced" instructions 32 which make such a cut back unnecessary and 31 I've seen plenty of code in my life, where 30 I had expected such a cut back, yet looking 29 at the generated assembly code, the compiler 28 found a way to avoid it entirely.

So if you 27 expect a general rule here, I'll have to 26 disappoint you. Neither can one say for 25 sure that 16 bit operations are equally 24 fast to 32 bit operations, nor can anyone 23 say for sure that 32 bit operations will 22 always be faster. It depends also what exactly 21 your code is doing with those numbers and 20 how it is doing that. I've seen benchmarks 19 where 32 bit operations were faster on certain 18 32 bit CPUs than the same code with 16 bit 17 operations, however I also already saw the 16 opposite being true. Even switching from 15 one compiler to another one or upgrading 14 your compiler version may already turn everything 13 around again. I can only say the following: Whoever 12 claims that working with shorts is significantly 11 slower than working with ints, shall please 10 provide a sample source code for that claim 9 and name CPU and compiler he used for testing, since 8 I have never experienced anything like that 7 within about the past 10 years. There may 6 be some situations, where working with ints 5 is maybe 1-5% faster, yet anything below 4 10% is not "significant" and the question 3 is, is it worth to waste twice the memory 2 in some cases only because it may buy you 1 2% performance? I don't think so.

Score: 3

It depends. If you are CPU bound, 32 bit 14 operations on a 32 bit CPU will be faster 13 than 16 bit. If you are memory bound (specifically 12 if you have too many L2 cache misses), then 11 use the smallest data you can squeeze into.

You 10 can find out which one you are using a profiler 9 that will measure both CPU and L2 misses 8 like Intel's VTune. You will run your app 2 times with 7 the same load, and it will merge the 2 runs 6 into one view of the hotspots in your app, and 5 you can see for each line of code how many 4 cycles were spent on that line. If at an 3 expensive line of code, you see 0 cache 2 misses, you are CPU bound. If you see tons 1 of misses, you are memory bound.

Score: 3

Don't listen to the advice, try it.

This 5 is probably going to depend heavily on the 4 hardware/compiler that you're using. A quick 3 test should make short work of this question. Probably 2 less time to write the test than it is to 1 write the question here.

Score: 1

If you are operating on a large dataset, the 16 biggest concern is memory footprint. A 15 good model in this case is to assume that 14 the CPU is infinitely fast, and spend your 13 time worrying about how much data has to 12 be moved to/from memory. In fact, CPUs 11 are now so fast that it is sometimes more 10 efficient to encode (e.g., compress) the 9 data. That way, the CPU does (potentially 8 much) more work (decoding/coding), but the 7 memory bandwidth is substantially reduced.

Thus, if 6 your dataset is large, you are probably 5 better off using 16 bit integers. If your 4 list is sorted, you might design a coding 3 scheme that involves differential or run-length 2 encoding, which will reduce memory bandwidth 1 even more.

Score: 1

When you say 32bit, I'll assume you mean 23 x86. 16 bit arithmetic is quite slow: the 22 operand-size prefix makes decoding really slow. So 21 don't make your temp variables short int 20 or int16_t.

However, x86 can efficiently 19 load 16 and 8 bit integers into 32 or 64 18 bit registers. (movzx / movsx: zero and 17 sign extension). So feel free to use short 16 int for arrays and struct fields, but make 15 sure you use int or long for your temp variables.

However, if 14 I am adding together two short integers, would 13 the CPU package both values in a single 12 pass in parallel (thus spanning the 4 byte 11 bandwidth of the bus)?

That is nonsense. load/store 10 instructions interact with L1 cache, and 9 the limiting factor is number of ops; width 8 is irrelevant. e.g. on core2: 1 load and 7 1 store per cycle, regardless of width. L1 6 cache has a 128 or 256bit path to L2 cache.

If 5 loads are your bottleneck, one wide load 4 which you split up with shifts or masks 3 after loading can help. Or use SIMD to 2 process data in parallel without unpacking 1 after loading in parallel.

More Related questions