🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Math and Physics

3 quick ways to calculate the square root in c++

gustavo rincones · 2019-11-28T12:22:52

Hi, this is my first forum and I want to do it: quick way to calculate the square root in c ++ with floating data types. These types of functions are very useful to gain some CPU time, especially when used continuously. I will show you 3 similar functions and indicate the advantages and disadvantages of each of them. The last of these three functions was written by me. If you notice that the writing is a bit out of grammar, it is because I do not speak English and I am using a support tool. My native language is Spanish. Well, let's start: The First method is very famous was used in the video game quake III arena and you can find a reference in Wikipedia as: :https://en.wikipedia.org/wiki/Fast_inverse_square_root. The Function was optimized for improvements in computing times. float sqrt1(const float &n) { static union{int i; float f;} u; u.i = 0x5F375A86 - (*(int*)&n >> 1); return (int(3) - n * u.f * u.f) * n * u.f * 0.5f; } -Advantages: * When Root of 0 is calculated the function returns 0. * The convergence of the function is acceptable enough for games. * It generates very good times. * The Reciprocal of the root can be calculated by removing the second “n” from the third line. According to the property of: 1 / sqrt (n) * n = sqrt (n). -Disadvantages: * Convergence decreases when the root to be calculated is very large. The second method is not as famous as the first. But it does the same function calculate the root. float sqrt2(const float& n) { union {int i; float f;} u; u.i = 0x1FB5AD00 + (*(int*)&n >> 1); u.f = n / u.f + u.f; return n / u.f + u.f * 0.25f; } -Advantages: * The convergence of the function is high enough to be used in applications other than games. -Disadvantages: * Computing times are much larger. * The square root of “0” is a number very close to “0” but never “0”. * The division operation is the bottleneck in this function. because the division operation is more expensive than the other arithmetic operations of Arithmetic Logic Units (ALU). The third method takes certain characteristics of the two previous methods. float sqrt3(const float& n) { static union {int i; float f;} u; u.i = 0x2035AD0C + (*(int*)&n >> 1); return n / u.f + u.f * 0.25f; } Advantages: * The convergence of the function is greater than that of the first method. * Generates times equal to or greater than the first method. Disadvantages: * The square root of “0” is a number very close to “0” but never “0”. The 3 previous methods have something in common. They are based on the definition of the Newton-Raphson Method. according to the function of the square root > f (x) = x ^ 2 - s. well thanks to you for reading my forum. well thanks to you for reading my forum.

Math and Physics Programming Algorithm C++ Optimization

Started by gustavo rincones October 22, 2019 01:17 AM

37 comments, last by l0calh05t 4 years, 7 months ago

l0calh05t

1,829

November 26, 2019 09:59 AM

Furthermore, there may be a good reason that the compiler's heuristics do not select vdpps: https://unix4lyfe.org/vdpps-is-slow/

DerTroll

252

November 26, 2019 02:39 PM

l0calh05t said:
Furthermore, there may be a good reason that the compiler's heuristics do not select vdpps: https://unix4lyfe.org/vdpps-is-slow/

I can recall, that I benchmarked the dot product intrinsics vs a handwritten version. If I remember correctly, the intrinsics were faster. Maybe I'll redo that later and post the results ;)

However, at least if you are messing around with intrinsics, I think one should avoid calculating single dot products whenever possible. I always try to calculate multiple dot products at once to take maximal advantage of vectorization.

Greetings

bzt

November 26, 2019 03:19 PM

"float x[4] when passed as a function parameter in C or C++ is always actually a pointer by definition/specification (independent of the ABI)!"
"On both Windows and Linux ABIs a struct containing four floats (whether as scalars or as an array) will NOT be passed in registers."

You are wrong about this. Please read https://www.uclibc.org/docs/psABI-x86_64.pdf Section 3.2.3 Parameter Passing:

SSE The class consists of types that fit into a vector register.
MEMORY This class consists of types that will be passed and returned in mem-ory via the stack.

Arguments of types float,double,Decimal32,Decimal64 and __m64 are in class SSE.
The classification of aggregate (structures and arrays) and union types works as follows: 1. If the size of an object is larger than four eightbytes, or it contains unaligned fields, it has class MEMORY.
3. If the class is SSE, the next available vector register is used, the registers are taken in the order from %xmm0 to %xmm7.

A foot note on page 18 clearifies that even double[4] arrays can be passed in registers on modern processors: This in turn ensures that for processors that do support the__m256type, if the size of an object is four eightbytes and the first eightbyte is SSE and all other eightbytes are SSEUP, it can be passed in a register.

Cheers,
bzt

l0calh05t

1,829

November 26, 2019 08:05 PM

The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition! To quote section 8.3.5 of the C++ Standard (the C standard has a similar clause):

After determining the type of each parameter, any parameter of type “array of T” or “function returning T” is adjusted to be “pointer to T” or “pointer to function returning T,” respectively.

You can even see it in the compiler output of the float dot(float a[4], float b[4]) function on both GCC (Linux) and MSVC (Windows):

https://godbolt.org/z/wdvB87
hhttps://godbolt.org/z/kU-UuY

W.r.t. the second point I will concede that I was wrong about Linux (to which the SysV ABI applies), but not about Windows (to which the SysV ABI does not apply). See the output in the above links. Everything is passed via memory. Even with the "new" __vectorcall ABI, the first struct is passed in xmm0-xmm3 as individual floats, but not as a vector and the second one remains in memory. Note also that neither compiler aligns the struct to a 16-byte boundary (and doing so would break ABI compatibility).

frob

46,235

November 26, 2019 09:31 PM

However, at least if you are messing around with intrinsics, I think one should avoid calculating single dot products whenever possible. I always try to calculate multiple dot products at once to take maximal advantage of vectorization.

Reiterating from the comment several pages ago, that seems to be the common issue in this entire discussion.

Micro-optimizations do have their place. There was an era where square root times were a significant bottleneck in some code, particularly in graphics and physics code. However, these days the bottlenecks are generally elsewhere.

On modern hardware the biggest bottlenecks tend to be caches and keeping the CPU and GPU fed. I haven't noticed square roots as a blip in a profiler for nearly two decades. There is so much asynchronous processing and out-of-order processing internally to the chips that the individual operations aren't blocking.

Better data batching and broad algorithmic changes will give several orders of magnitude of improvements versus a micro-optimization tuning a single command very specific single chip, single compiler, single ABI.

l0calh05t

1,829

November 27, 2019 11:47 AM

W.r.t. the second point I will concede that I was wrong about Linux (to which the SysV ABI applies), but not about Windows (to which the SysV ABI does not apply)

Correction, I was only partially wrong. The vec4 structs are passed in xmm* registers... as individual floats, not as a vector!

Bregma

9,461

November 28, 2019 11:52 AM

The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition!

Another part of the C language definition is that the generated code has to behave "as if" it were following the original code. One of the things that means in practice is that the compiler does a pointer propagation analysis pass and if it detects that the pointer is not written to (and that includes an indexed offset to the pointer) then it's treated as a dereferenced const rvalue instead. In other words, the compiler can emit code to pass the float[4] in an f128 register instead of as a pointer on the stack, and still be conforming to the language standard. The ABI explicitly allows this as well, by specifying which registers can be clobberd and which must be saved on context switches.

Stephen M. Webb
Professional Free Software Developer

l0calh05t

1,829

November 28, 2019 12:22 PM

Bregma said:
The SysV ABI is entirely irrelevant to the first point, as that is part of the C (and C++) language definition!
Another part of the C language definition is that the generated code has to behave "as if" it were following the original code. One of the things that means in practice is that the compiler does a pointer propagation analysis pass and if it detects that the pointer is not written to (and that includes an indexed offset to the pointer) then it's treated as a dereferenced const rvalue instead. In other words, the compiler can emit code to pass the float[4] in an f128 register instead of as a pointer on the stack, and still be conforming to the language standard. The ABI explicitly allows this as well, by specifying which registers can be clobberd and which must be saved on context switches.

While the "as if" rule allows this within a compilation unit (or when using LTO/LTCG) - which I already said a few posts ago - this cannot be done across normal linking boundaries due to nasty little things like const_cast. Also pointer aliasing will make this impossible in many, many cases.

🎉 Celebrating 25 Years of GameDev.net! 🎉

3 quick ways to calculate the square root in c++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

3 quick ways to calculate the square root in c++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines