Fastest way to determine if an integer's square root is an integer

Asked 17/11, 2008 at 13:43 Answered 17/11, 2008 at 13:43

Solved java math optimization perfect-square

1611

I'm looking for the fastest way to determine if a long value is a perfect square (i.e. its square root is another integer):

I've done it the easy way, by using the built-in Math.sqrt() function, but I'm wondering if there is a way to do it faster by restricting yourself to integer-only domain.
Maintaining a lookup table is impractical (since there are about 2^31.5 integers whose square is less than 2⁶³).

Here is the very simple and straightforward way I'm doing it now:

public final static boolean isPerfectSquare(long n)
{
  if (n < 0)
    return false;

  long tst = (long)(Math.sqrt(n) + 0.5);
  return tst*tst == n;
}

_{Note: I'm using this function in many Project Euler problems. So no one else will ever have to maintain this code. And this kind of micro-optimization could actually make a difference, since part of the challenge is to do every algorithm in less than a minute, and this function will need to be called millions of times in some problems.}

I've tried the different solutions to the problem:

After exhaustive testing, I found that adding 0.5 to the result of Math.sqrt() is not necessary, at least not on my machine.
The fast inverse square root was faster, but it gave incorrect results for n >= 410881. However, as suggested by BobbyShaftoe, we can use the FISR hack for n < 410881.
Newton's method was a good bit slower than Math.sqrt(). This is probably because Math.sqrt() uses something similar to Newton's Method, but implemented in the hardware so it's much faster than in Java. Also, Newton's Method still required use of doubles.
A modified Newton's method, which used a few tricks so that only integer math was involved, required some hacks to avoid overflow (I want this function to work with all positive 64-bit signed integers), and it was still slower than Math.sqrt().
Binary chop was even slower. This makes sense because the binary chop will on average require 16 passes to find the square root of a 64-bit number.
According to John's tests, using or statements is faster in C++ than using a switch, but in Java and C# there appears to be no difference between or and switch.
I also tried making a lookup table (as a private static array of 64 boolean values). Then instead of either switch or or statement, I would just say if(lookup[(int)(n&0x3F)]) { test } else return false;. To my surprise, this was (just slightly) slower. This is because array bounds are checked in Java.

Brunildabruning answered 17/11, 2008 at 13:43 Comment(32)

Since Integer and Long son't really have a specific length specified (in most C-ish langauges, which is what your code looks like), better to say that, for a 32-bit integer, there are 2**16 perfect squares. – Cnossus 17/11, 2008 at 14:23

This is Java code, where int==32 bits and long==64 bits, and both are signed. – Brunildabruning 17/11, 2008 at 14:35

Which is faster: "long tst = (long)Math.sqrt(n); return tst*tst == n;" (what you have) or "double tst = Math.sqrt(n); return tst ==(double)Math.round(tst);" ? – Schauer 18/11, 2008 at 16:52

Which JVM did you use for the testing? In my experience algorithm performance is dependant on the JVM. – Governance 21/11, 2008 at 14:23

@Shreevasta: I've done some testing on large values (greater than 2^53), and your method gives some false positives. The first one encountered is for n=9007199326062755, which is not a perfect square but is returned as one. – Brunildabruning 2/12, 2008 at 20:3

There is a bug in your code: sqrt = (long)Math.sqrt(n); return sqrt*sqrt == n; Math.sqrt(n) returns a floating point representation, e.g. Math.sqrt(9) might return 2.99999999999 if you are unlucky, and when you cast it with (long) you could easily end up with a too low number. – Commonable 1/1, 2009 at 23:48

You can use bit tricks to check the last 6 bits: if( (x&2) || ((x & 7) == 5) || ((x & 11) == 8) || ((x & 31) == 20) || ((x & 47) == 24)) return false; – Deandre 7/1, 2009 at 23:48

Please don't call it the "John Carmack hack." He didn't come up with it. – Traduce 11/3, 2009 at 6:27

@martinus: for values below 2^53, the double representation is exact so there will be no roundoff error. i've also tested on every perfect square greater than 2^53, as well as +/- 1 from each perfect square, and roundoff error never results in an incorrect answer. – Brunildabruning 11/3, 2009 at 12:41

I believe that modern JVM's may be able to skip array index checks at a given spot if it can be concluded that they will always be valid there. What JVM was the testing done with? – Enrollee 25/5, 2009 at 19:26

The "John Carmack hack" should be feasible for a larger range, by doing the extra iteration commented out in the Quake source if needed (i,e number large enough). – Enrollee 26/5, 2009 at 11:39

@Thorbjørn Ravn Andersen: i was using j2se 6.0 windows jvm, downloaded from sun's site. i also tried uncommenting the extra iteration, and IIRC it actually got less accurate somehow. – Brunildabruning 26/5, 2009 at 14:34

Make your quickfail quicker and make // Quickfail if( n < 0 || ((n&2) != 0) || ((n & 7) == 5) || ((n & 11) == 8) ) return false; if( n == 0 ) return true; the other way around: // Quickfail if( n == 0 ) return true; if( n < 0 || ((n&2) != 0) || ((n & 7) == 5) || ((n & 11) == 8) ) return false; – Gaskins 25/9, 2009 at 9:11

@dstibbe: that would only be faster when the input is 0. for 75% of other inputs (not even counting negative numbers) it will be faster the way its written, and for the other 25% there will be no difference. – Brunildabruning 25/9, 2009 at 13:8

@mamama -- Perhaps, but it's attributed to him. Henry Ford didn't invent the car, the Wright Bros. didn't invent the airplane, and and Galleleo wasn't the first to figure out the Earth revolved around the sun... the world is made up of stolen inventions (and love). – Achates 6/5, 2010 at 10:25

@Kip, the microbenchmark for the algorithm could be off due to the relatively large table. When the entire table is in the cache (tight loops) there won't be cache misses. The boolean[] should be replaced by long constants (or array) and it will gain some more. – Orthogonal 6/6, 2011 at 19:33

You might get a tiny speed increase in the 'quickfail' by using something like ((1<<(n&15))|65004) != 0, instead of having three separate checks. – Maleficence 1/11, 2011 at 12:6

Why are you adding the 0.5 ? – Anglonorman 26/5, 2013 at 0:50

@KorayTugay to round the number. My concern was that Math.sqrt might return a value that is slightly off due to roundoff error. Say, if math.sqrt(100) returned 9.999999999999999 instead of 10. I am not sure if there are any cases where this actually happens, however. – Brunildabruning 28/5, 2013 at 19:42

@Brunildabruning check out my answer, I got a significant performance gain on your posted answer. – Reflux 10/6, 2013 at 23:10

@Brunildabruning I edited my post to have a third algorithm that sometimes does better than my previous answer. – Reflux 13/6, 2013 at 14:57

Surprised not to see, on such a popular question, anyone point out that you can be accurate at higher n using so-called 'Carmack hack' by performing another iteration(s). N.B. The result isn't Carmack's, but Newton/Raphson's, I suppose the 'magic number' hack would be a fairer attribution to Carmack. – Kisor 1/5, 2014 at 23:14

Not sure if this is what you want, but this takes about 1 millisecond. (javascript, not Java.) return math.sqrt(x).split(".").length > 1 – Royden 10/2, 2015 at 4:40

I'm late to the party, but I believe you could squeeze even more performance out of your final code snippet by using techniques from here: #15621583 – Pedroza 7/7, 2016 at 15:21

@user3932000 I did it because it may result in a slight compiler optimization. Here is some better discussion: #5548163 – Brunildabruning 12/10, 2016 at 14:8

I have never encountered a PE problem that could not be computed in under a minute in ruby. So the argument that performance matters to this end is a little unbelievable. – Foreclosure 24/1, 2017 at 20:42

I am curious how this wisdom holds up in the face of innovations like SSE SQRTPS? felixcloutier.com/x86/SQRTPS.html, – Troposphere 28/2, 2017 at 16:12

What is an example of a problem where this makes a difference? Rather than calculating individual square roots for a long list it is much better to walk up the list of perfect squares and filter the list that way. – Forb 13/12, 2017 at 2:12

For very large numbers, there is a fast randomized algorithm: petr-mitrichev.blogspot.com/2017/12/a-quadratic-week.html – Thoraco 2/1, 2018 at 23:53

@RobertFraser Galileo Galilei wasn't the first to have his name mutilated. – Wellheeled 25/3, 2018 at 17:32

@RobertFraser: Yes, and those are great injustices, about which you can do nothing about. This is never a way to justify your bad behavior. – Erdda 22/4, 2018 at 17:3

@RobertFraser I've never heard anybody claim Galileo invented the heliocentric model. It's always attributed to Copernicus. Is that who you meant? – Napper 11/10, 2019 at 16:20

825

I figured out a method that works ~35% faster than your 6bits+Carmack+sqrt code, at least with my CPU (x86) and programming language (C/C++). Your results may vary, especially because I don't know how the Java factor will play out.

My approach is threefold:

First, filter out obvious answers. This includes negative numbers and looking at the last 4 bits. (I found looking at the last six didn't help.) I also answer yes for 0. (In reading the code below, note that my input is int64 x.)
```
if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) )
    return false;
if( x == 0 )
    return true;
```
Next, check if it's a square modulo 255 = 3 * 5 * 17. Because that's a product of three distinct primes, only about 1/8 of the residues mod 255 are squares. However, in my experience, calling the modulo operator (%) costs more than the benefit one gets, so I use bit tricks involving 255 = 2^8-1 to compute the residue. (For better or worse, I am not using the trick of reading individual bytes out of a word, only bitwise-and and shifts.)
```
int64 y = x;
y = (y & 4294967295LL) + (y >> 32); 
y = (y & 65535) + (y >> 16);
y = (y & 255) + ((y >> 8) & 255) + (y >> 16);
// At this point, y is between 0 and 511.  More code can reduce it farther.
```
To actually check if the residue is a square, I look up the answer in a precomputed table.
```
if( bad255[y] )
    return false;
// However, I just use a table of size 512
```
Finally, try to compute the square root using a method similar to Hensel's lemma. (I don't think it's applicable directly, but it works with some modifications.) Before doing that, I divide out all powers of 2 with a binary search:
```
if((x & 4294967295LL) == 0)
    x >>= 32;
if((x & 65535) == 0)
    x >>= 16;
if((x & 255) == 0)
    x >>= 8;
if((x & 15) == 0)
    x >>= 4;
if((x & 3) == 0)
    x >>= 2;
```
At this point, for our number to be a square, it must be 1 mod 8.
```
if((x & 7) != 1)
    return false;
```
The basic structure of Hensel's lemma is the following. (Note: untested code; if it doesn't work, try t=2 or 8.)
```
int64 t = 4, r = 1;
t <<= 1; r += ((x - r * r) & t) >> 1;
t <<= 1; r += ((x - r * r) & t) >> 1;
t <<= 1; r += ((x - r * r) & t) >> 1;
// Repeat until t is 2^33 or so.  Use a loop if you want.
```
The idea is that at each iteration, you add one bit onto r, the "current" square root of x; each square root is accurate modulo a larger and larger power of 2, namely t/2. At the end, r and t/2-r will be square roots of x modulo t/2. (Note that if r is a square root of x, then so is -r. This is true even modulo numbers, but beware, modulo some numbers, things can have even more than 2 square roots; notably, this includes powers of 2.) Because our actual square root is less than 2^32, at that point we can actually just check if r or t/2-r are real square roots. In my actual code, I use the following modified loop:
```
int64 r, t, z;
r = start[(x >> 3) & 1023];
do {
    z = x - r * r;
    if( z == 0 )
        return true;
    if( z < 0 )
        return false;
    t = z & (-z);
    r += (z & t) >> 1;
    if( r > (t >> 1) )
        r = t - r;
} while( t <= (1LL << 33) );
```
The speedup here is obtained in three ways: precomputed start value (equivalent to ~10 iterations of the loop), earlier exit of the loop, and skipping some t values. For the last part, I look at z = r - x * x, and set t to be the largest power of 2 dividing z with a bit trick. This allows me to skip t values that wouldn't have affected the value of r anyway. The precomputed start value in my case picks out the "smallest positive" square root modulo 8192.

Even if this code doesn't work faster for you, I hope you enjoy some of the ideas it contains. Complete, tested code follows, including the precomputed tables.

typedef signed long long int int64;

int start[1024] =
{1,3,1769,5,1937,1741,7,1451,479,157,9,91,945,659,1817,11,
1983,707,1321,1211,1071,13,1479,405,415,1501,1609,741,15,339,1703,203,
129,1411,873,1669,17,1715,1145,1835,351,1251,887,1573,975,19,1127,395,
1855,1981,425,453,1105,653,327,21,287,93,713,1691,1935,301,551,587,
257,1277,23,763,1903,1075,1799,1877,223,1437,1783,859,1201,621,25,779,
1727,573,471,1979,815,1293,825,363,159,1315,183,27,241,941,601,971,
385,131,919,901,273,435,647,1493,95,29,1417,805,719,1261,1177,1163,
1599,835,1367,315,1361,1933,1977,747,31,1373,1079,1637,1679,1581,1753,1355,
513,1539,1815,1531,1647,205,505,1109,33,1379,521,1627,1457,1901,1767,1547,
1471,1853,1833,1349,559,1523,967,1131,97,35,1975,795,497,1875,1191,1739,
641,1149,1385,133,529,845,1657,725,161,1309,375,37,463,1555,615,1931,
1343,445,937,1083,1617,883,185,1515,225,1443,1225,869,1423,1235,39,1973,
769,259,489,1797,1391,1485,1287,341,289,99,1271,1701,1713,915,537,1781,
1215,963,41,581,303,243,1337,1899,353,1245,329,1563,753,595,1113,1589,
897,1667,407,635,785,1971,135,43,417,1507,1929,731,207,275,1689,1397,
1087,1725,855,1851,1873,397,1607,1813,481,163,567,101,1167,45,1831,1205,
1025,1021,1303,1029,1135,1331,1017,427,545,1181,1033,933,1969,365,1255,1013,
959,317,1751,187,47,1037,455,1429,609,1571,1463,1765,1009,685,679,821,
1153,387,1897,1403,1041,691,1927,811,673,227,137,1499,49,1005,103,629,
831,1091,1449,1477,1967,1677,697,1045,737,1117,1737,667,911,1325,473,437,
1281,1795,1001,261,879,51,775,1195,801,1635,759,165,1871,1645,1049,245,
703,1597,553,955,209,1779,1849,661,865,291,841,997,1265,1965,1625,53,
1409,893,105,1925,1297,589,377,1579,929,1053,1655,1829,305,1811,1895,139,
575,189,343,709,1711,1139,1095,277,993,1699,55,1435,655,1491,1319,331,
1537,515,791,507,623,1229,1529,1963,1057,355,1545,603,1615,1171,743,523,
447,1219,1239,1723,465,499,57,107,1121,989,951,229,1521,851,167,715,
1665,1923,1687,1157,1553,1869,1415,1749,1185,1763,649,1061,561,531,409,907,
319,1469,1961,59,1455,141,1209,491,1249,419,1847,1893,399,211,985,1099,
1793,765,1513,1275,367,1587,263,1365,1313,925,247,1371,1359,109,1561,1291,
191,61,1065,1605,721,781,1735,875,1377,1827,1353,539,1777,429,1959,1483,
1921,643,617,389,1809,947,889,981,1441,483,1143,293,817,749,1383,1675,
63,1347,169,827,1199,1421,583,1259,1505,861,457,1125,143,1069,807,1867,
2047,2045,279,2043,111,307,2041,597,1569,1891,2039,1957,1103,1389,231,2037,
65,1341,727,837,977,2035,569,1643,1633,547,439,1307,2033,1709,345,1845,
1919,637,1175,379,2031,333,903,213,1697,797,1161,475,1073,2029,921,1653,
193,67,1623,1595,943,1395,1721,2027,1761,1955,1335,357,113,1747,1497,1461,
1791,771,2025,1285,145,973,249,171,1825,611,265,1189,847,1427,2023,1269,
321,1475,1577,69,1233,755,1223,1685,1889,733,1865,2021,1807,1107,1447,1077,
1663,1917,1129,1147,1775,1613,1401,555,1953,2019,631,1243,1329,787,871,885,
449,1213,681,1733,687,115,71,1301,2017,675,969,411,369,467,295,693,
1535,509,233,517,401,1843,1543,939,2015,669,1527,421,591,147,281,501,
577,195,215,699,1489,525,1081,917,1951,2013,73,1253,1551,173,857,309,
1407,899,663,1915,1519,1203,391,1323,1887,739,1673,2011,1585,493,1433,117,
705,1603,1111,965,431,1165,1863,533,1823,605,823,1179,625,813,2009,75,
1279,1789,1559,251,657,563,761,1707,1759,1949,777,347,335,1133,1511,267,
833,1085,2007,1467,1745,1805,711,149,1695,803,1719,485,1295,1453,935,459,
1151,381,1641,1413,1263,77,1913,2005,1631,541,119,1317,1841,1773,359,651,
961,323,1193,197,175,1651,441,235,1567,1885,1481,1947,881,2003,217,843,
1023,1027,745,1019,913,717,1031,1621,1503,867,1015,1115,79,1683,793,1035,
1089,1731,297,1861,2001,1011,1593,619,1439,477,585,283,1039,1363,1369,1227,
895,1661,151,645,1007,1357,121,1237,1375,1821,1911,549,1999,1043,1945,1419,
1217,957,599,571,81,371,1351,1003,1311,931,311,1381,1137,723,1575,1611,
767,253,1047,1787,1169,1997,1273,853,1247,413,1289,1883,177,403,999,1803,
1345,451,1495,1093,1839,269,199,1387,1183,1757,1207,1051,783,83,423,1995,
639,1155,1943,123,751,1459,1671,469,1119,995,393,219,1743,237,153,1909,
1473,1859,1705,1339,337,909,953,1771,1055,349,1993,613,1393,557,729,1717,
511,1533,1257,1541,1425,819,519,85,991,1693,503,1445,433,877,1305,1525,
1601,829,809,325,1583,1549,1991,1941,927,1059,1097,1819,527,1197,1881,1333,
383,125,361,891,495,179,633,299,863,285,1399,987,1487,1517,1639,1141,
1729,579,87,1989,593,1907,839,1557,799,1629,201,155,1649,1837,1063,949,
255,1283,535,773,1681,461,1785,683,735,1123,1801,677,689,1939,487,757,
1857,1987,983,443,1327,1267,313,1173,671,221,695,1509,271,1619,89,565,
127,1405,1431,1659,239,1101,1159,1067,607,1565,905,1755,1231,1299,665,373,
1985,701,1879,1221,849,627,1465,789,543,1187,1591,923,1905,979,1241,181};

bool bad255[512] =
{0,0,1,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,
 1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,
 0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,1,0,1,
 1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,0,1,1,1,1,1,1,
 1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,
 1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,
 1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,
 1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,
 0,0,1,1,0,1,1,1,1,0,1,1,1,1,1,0,0,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,
 1,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,0,1,0,1,1,1,0,1,1,1,1,0,1,1,1,
 0,1,0,1,1,0,0,1,1,1,1,1,0,1,1,1,1,0,1,1,0,0,1,1,1,1,1,1,1,1,0,1,
 1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,0,0,1,1,1,1,1,1,
 1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,0,1,1,0,1,1,1,1,1,
 1,1,1,1,1,1,0,1,1,0,1,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1,
 1,1,1,0,0,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,
 1,0,1,1,1,0,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,
 0,0};

inline bool square( int64 x ) {
    // Quickfail
    if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) )
        return false;
    if( x == 0 )
        return true;

    // Check mod 255 = 3 * 5 * 17, for fun
    int64 y = x;
    y = (y & 4294967295LL) + (y >> 32);
    y = (y & 65535) + (y >> 16);
    y = (y & 255) + ((y >> 8) & 255) + (y >> 16);
    if( bad255[y] )
        return false;

    // Divide out powers of 4 using binary search
    if((x & 4294967295LL) == 0)
        x >>= 32;
    if((x & 65535) == 0)
        x >>= 16;
    if((x & 255) == 0)
        x >>= 8;
    if((x & 15) == 0)
        x >>= 4;
    if((x & 3) == 0)
        x >>= 2;

    if((x & 7) != 1)
        return false;

    // Compute sqrt using something like Hensel's lemma
    int64 r, t, z;
    r = start[(x >> 3) & 1023];
    do {
        z = x - r * r;
        if( z == 0 )
            return true;
        if( z < 0 )
            return false;
        t = z & (-z);
        r += (z & t) >> 1;
        if( r > (t >> 1) )
            r = t - r;
    } while( t <= (1LL << 33) );
    
    return false;
}

Deandre answered 17/11, 2008 at 13:43 Comment(17)

Wow! I'll try to convert this to Java and do a comparison, as well as an accuracy check on the results. I'll let you know what I find. – Brunildabruning 12/1, 2009 at 21:47

Testing on all values is impossible, but a test on suspect values (+/- 1 from very large perfect squares) has proven to be accurate. On a run of the first billion integers, this only took 34% of the time required by the original algorithm. – Brunildabruning 16/1, 2009 at 20:13

Awesome! I'm glad it worked out for you. One thing that's different in C(++) and Java is that Java checks bounds on array lookups, so I thought you might take a hit there. – Deandre 16/1, 2009 at 20:50

Seems that this only returns true/false, but it will be difficult to return the square root itself? – Hun 27/3, 2009 at 23:44

Wow, this is beautiful. I'd seen Hensel lifting before (computing roots of polynomials modulo a prime) but I hadn't even realised the lemma could be carefully lowered all the way for computing square roots of numbers; this is... uplifting :) – Schauer 25/5, 2009 at 2:36

@Dimitre Novatchev: No, it would not be too difficult. If the number is a perfect square, then its square root is r times some power of 2, which can be determined when dividing out by powers of 4. – Deandre 3/12, 2009 at 23:50

@A. Rex: HotSpot is able to eliminate bounds checks in certain circumstances https://mcmap.net/q/46059/-bounds-checking-in-java so the hit is probably being avoided. – Personable 11/1, 2011 at 15:8

@A. Rex: I might be mistaken, but isn't 9 a perfect square? And doesn't if( x < 0 || (x&2) || ((x & 7) == 5) || ((x & 11) == 8) ) return false; filter it out as not being one? And similar for all other uneven perfect squares? – Black 17/9, 2012 at 8:20

@nightcracker It doesn't. 9 < 0 => false, 9&2 => 0, 9&7 == 5 => false, 9&11 == 8 => false. – Rhoden 2/12, 2012 at 7:56

Maartinus posted a 2x faster solution (and much shorter) down below, a bit later, that doesn't seem to be getting much love. – Angio 17/3, 2014 at 3:46

@A.Rex AFAIK, the final test t <= (1LL << 33) is useless as you either get a square root or overshoot anyway. I dropped it and it works (but maybe there's a gotcha with numbers I haven't tried?). – Valero 13/6, 2014 at 19:21

It seems like a lot of the speed advantage in the different solutions is gained by filtering out the obvious squares. Did anyone benchmark the situation of filtering out via Maartinus' solution and then just using the sqrt function as that's a built-in function? – Fi 7/10, 2014 at 17:26

LL in y & 4294967295LL not needed. – Sharrisharron 21/5, 2018 at 8:30

“Next, check if it's a square modulo 255 = 3 * 5 * 17. Because that's a product of three distinct primes, only about 1/8 of the residues mod 255 are squares.” Actually, the proportion of squares modulo 255 is 54/255 ≃ 0.212, much more than 1/8 = 0.125 (because the proportion of squares modulo each of the prime factors are slightly more than 1/2). But note that you are not limited to primes. The proportion of squares modulo 256 is 44/256 ≃ 0.172, and computing modulo 256 is much more convenient. – Yardage 29/9, 2019 at 15:31

That’s essentially one of the things maaartinus’ answer does, except that it uses 64 instead of 256 (which gives a more compact precomputed table with a better cache behavior, while still being pretty efficient at filtering out non-squares, as the proportion of squares modulo 64 is 12/64 ≃ 0.188). – Yardage 29/9, 2019 at 15:50

It might be worth mentioning that x86 has an assembly instruction to get the number of trailing zeroes which is supported as a trailing zeroes function in most languages, so the binary search isn't needed. But applying Hensel's lemma is a good idea. – Marlette 8/2, 2022 at 14:21

For the Hensel's lemma lifting part of the solution, it ought to be possible to roughly double the number of correct bits per iteration instead of just getting an additional bit per iteration. See gist.github.com/mdickinson/… for some C code along these lines. – Tuddor 3/3, 2022 at 19:15

458

I'm pretty late to the party, but I hope to provide a better answer; shorter and (assuming my benchmark is correct) also much faster.

long goodMask; // 0xC840C04048404040 computed below
{
    for (int i=0; i<64; ++i) goodMask |= Long.MIN_VALUE >>> (i*i);
}

public boolean isSquare(long x) {
    // This tests if the 6 least significant bits are right.
    // Moving the to be tested bit to the highest position saves us masking.
    if (goodMask << x >= 0) return false;
    final int numberOfTrailingZeros = Long.numberOfTrailingZeros(x);
    // Each square ends with an even number of zeros.
    if ((numberOfTrailingZeros & 1) != 0) return false;
    x >>= numberOfTrailingZeros;
    // Now x is either 0 or odd.
    // In binary each odd square ends with 001.
    // Postpone the sign test until now; handle zero in the branch.
    if ((x&7) != 1 | x <= 0) return x == 0;
    // Do it in the classical way.
    // The correctness is not trivial as the conversion from long to double is lossy!
    final long tst = (long) Math.sqrt(x);
    return tst * tst == x;
}

The first test catches most non-squares quickly. It uses a 64-item table packed in a long, so there's no array access cost (indirection and bounds checks). For a uniformly random long, there's a 81.25% probability of ending here.

The second test catches all numbers having an odd number of twos in their factorization. The method Long.numberOfTrailingZeros is very fast as it gets JIT-ed into a single i86 instruction.

After dropping the trailing zeros, the third test handles numbers ending with 011, 101, or 111 in binary, which are no perfect squares. It also cares about negative numbers and also handles 0.

The final test falls back to double arithmetic. As double has only 53 bits mantissa, the conversion from long to double includes rounding for big values. Nonetheless, the test is correct (unless the proof is wrong).

Trying to incorporate the mod255 idea wasn't successful.

Valero answered 17/11, 2008 at 13:43 Comment(21)

That implicit masking of the shift value is a bit ... evil. Do you have any idea why that's in the Java spec? – Sake 11/7, 2014 at 1:4

A question about your code in particular: why do you need to check that an odd number ends 001? Isn't that handled by the goodMask test? – Sake 11/7, 2014 at 1:8

@Sake I guess there are two reasons: 1. Shifting by more makes no sense. 2. It's like the HW works and anyone using bitwise operations is interested in performance, so doing anything else would be wrong. - The goodMask test does it, but it does it before the right shift. So you'd have to repeat it, but this way it's simpler and AFAIK a tiny bit faster and equally good. – Valero 11/7, 2014 at 14:55

I missed the fact that you were shifting off the zeroes there. I guess the trailing zero count is too expensive to do first? Which architectures use only low-order bits for shifts? – Sake 11/7, 2014 at 15:11

@Sake For the benchmark it's important to give answer ASAP, and the trailing zero count itself gives no answer; it's just a preparatory step. i86/amd64 do it. No idea about the small CPUs in mobiles, but at worst, Java has to generate an AND instruction for them, which is surely simpler than the other way round. – Valero 11/7, 2014 at 15:24

Please take a look at my answer. If you could plug it into your benchmark system, I'd be very grateful. – Sake 13/7, 2014 at 10:29

Can you post the full benchmark results, not just those 5? – Reflux 13/3, 2015 at 23:42

The third test if ((x&7) != 1 | x <= 0) return x == 0; could be if ((x&7) != 1 | x < 0) return x == 0; (the case x==0 is already caught by (x&7) != 1). I'm not sure if this speeds up the computation, but it makes it more obvious that you can remove the second test if your input is positive: if ((x&7) != 1) return x == 0; I know there are no unsigned longs in java, but it might be helpful for people porting this code to other languages (like I'm doing right now). – Dilate 4/8, 2016 at 16:30

@Dilate I guess, you're right. However, IIRC the JIT made out of it something very different from what I was hoping for, so it may improve or worsen things. The problen with JIT is that the compiled code depends on data it was fed with, so aggregating the two cases didn't take place as I used no (hardly any?) negative inputs in my benchmark. – Valero 4/8, 2016 at 16:57

@Dilate A probably better test: if ((x & (7 | Integer.MIN_VALUE)) != 1) return x == 0;. – Valero 4/8, 2016 at 16:59

@maaartinus: I agree :) and then of course put (7 | Long.MIN_VALUE) in a constant (or assume the compiler is clever enough to do that for you) – Dilate 4/8, 2016 at 17:2

"As double has only 56 bits mantissa" --> I would say it has more likely has a 53 bit one. Also – Sharrisharron 23/3, 2017 at 14:51

@Dilate It must be clever as it's a compile time constant according to the JLS. – Valero 30/7, 2017 at 22:36

@Valero Dropbox says: "Sorry, that file doesn’t live here anymore. It might have been moved or made private." Would you post it here instead? Stackoverflow supports posts up to 40 KiB.Doesn't the file fit in? – Zucchetto 1/2, 2018 at 10:49

@OlivierGrégoire Sorry for that. I'll do it in two weeks, when I'm on the right machine again (it's offline, sorry!). The file surely fits. All my dropbox links broke, that's sad. – Valero 1/2, 2018 at 11:5

@OlivierGrégoire I've fixed the link instead. The file has 22k and would fit, but I guess, the current link will for for a few years. I prefer not to post it here, as it's pretty long and boring, but feel free to edit, if you think otherwise. – Valero 15/2, 2018 at 20:21

@OlivierGrégoire I'm just using it for projecteuler.net/problem=621 and that sad thing is that using nothing but the floating point test is equally good. :( – Valero 25/2, 2018 at 23:17

How can I use this for very large numbers (BigIntegers).Then what will be the Mask value?How to calculate it? – Sept 26/4, 2020 at 9:25

@NuwanHarshakumaraPiyarathna Both the goodMask and numberOfTrailingZeros can be used in the same way (same value of goodMask). – Valero 26/4, 2020 at 22:54

@Valero How does the goodMask test catch 81% of all longs? Isn't ``goodMask << x` equal to 0 for all x > 64? – Kampmeier 26/7, 2020 at 23:6

@Kampmeier No, because the Java Language Specification says "If the promoted type of the left-hand operand is long, then only the six lowest-order bits of the right-hand operand are used as the shift distance. It is as if the right-hand operand were subjected to a bitwise logical AND operator & (§15.22.1) with the mask value 0x3f (0b111111). The shift distance actually used is therefore always in the range 0 to 63, inclusive." docs.oracle.com/javase/specs/jls/se11/jls11.pdf – Embarkment 8/8, 2021 at 15:47

138

You'll have to do some benchmarking. The best algorithm will depend on the distribution of your inputs.

Your algorithm may be nearly optimal, but you might want to do a quick check to rule out some possibilities before calling your square root routine. For example, look at the last digit of your number in hex by doing a bit-wise "and." Perfect squares can only end in 0, 1, 4, or 9 in base 16, So for 75% of your inputs (assuming they are uniformly distributed) you can avoid a call to the square root in exchange for some very fast bit twiddling.

Kip benchmarked the following code implementing the hex trick. When testing numbers 1 through 100,000,000, this code ran twice as fast as the original.

public final static boolean isPerfectSquare(long n)
{
    if (n < 0)
        return false;

    switch((int)(n & 0xF))
    {
    case 0: case 1: case 4: case 9:
        long tst = (long)Math.sqrt(n);
        return tst*tst == n;

    default:
        return false;
    }
}

When I tested the analogous code in C++, it actually ran slower than the original. However, when I eliminated the switch statement, the hex trick once again make the code twice as fast.

int isPerfectSquare(int n)
{
    int h = n & 0xF;  // h is the last hex "digit"
    if (h > 9)
        return 0;
    // Use lazy evaluation to jump out of the if statement as soon as possible
    if (h != 2 && h != 3 && h != 5 && h != 6 && h != 7 && h != 8)
    {
        int t = (int) floor( sqrt((double) n) + 0.5 );
        return t*t == n;
    }
    return 0;
}

Eliminating the switch statement had little effect on the C# code.

Artisan answered 17/11, 2008 at 13:43 Comment(7)

Nice point about the trailing bits. I would try to combine that test with some of the other remarks here. – Unintelligible 17/11, 2008 at 14:56

I benchmarked it for the first 100 million integers.. this approximately halves the time required. – Brunildabruning 17/11, 2008 at 16:9

You could extend this beyond the last digit as well. For example, there are only 44 different values the least significant byte can have. – Galilean 8/2, 2010 at 4:41

FYI, I tested this method in Python. It gave about a 3/8 speedup over the vanilla t = math.floor(math.sqrt(n) + 0.5); return t*t == n (5 seconds vs. 8 seconds to compute for the first 100,000 integers, 100 times). – Flavia 17/10, 2013 at 14:43

@Flavia There's no need to add 0.5, see my solution for a link to the proof. – Valero 13/6, 2014 at 18:12

you said that if-else is faster than switch but this https://mcmap.net/q/23262/-is-39-switch-39-faster-than-39-if-39 question tells the different story. – Speaks 3/1, 2017 at 9:26

@JerryGoyal It depends on the compiler and the values of the cases. In a perfect compiler, a switch is always at least as fast as if-else. But compilers are not perfect, so it is best to try it out, like John did. – Doubt 10/11, 2017 at 16:56

I was thinking about the horrible times I've spent in Numerical Analysis course.

And then I remember, there was this function circling around the 'net from the Quake Source code:

float Q_rsqrt( float number )
{
  long i;
  float x2, y;
  const float threehalfs = 1.5F;

  x2 = number * 0.5F;
  y  = number;
  i  = * ( long * ) &y;  // evil floating point bit level hacking
  i  = 0x5f3759df - ( i >> 1 ); // wtf?
  y  = * ( float * ) &i;
  y  = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
  // y  = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

  #ifndef Q3_VM
  #ifdef __linux__
    assert( !isnan(y) ); // bk010122 - FPE?
  #endif
  #endif
  return y;
}

Which basically calculates a square root, using Newton's approximation function (cant remember the exact name).

It should be usable and might even be faster, it's from one of the phenomenal id software's game!

It's written in C++ but it should not be too hard to reuse the same technique in Java once you get the idea:

I originally found it at: https://web.archive.org/web/20110708173806/https://www.codemaestro.com/reviews/9

Newton's method explained at wikipedia: http://en.wikipedia.org/wiki/Newton%27s_method

You can follow the link for more explanation of how it works, but if you don't care much, then this is roughly what I remember from reading the blog and from taking the Numerical Analysis course:

the * (long*) &y is basically a fast convert-to-long function so integer operations can be applied on the raw bytes.
the 0x5f3759df - (i >> 1); line is a pre-calculated seed value for the approximation function.
the * (float*) &i converts the value back to floating point.
the y = y * ( threehalfs - ( x2 * y * y ) ) line bascially iterates the value over the function again.

The approximation function gives more precise values the more you iterate the function over the result. In Quake's case, one iteration is "good enough", but if it wasn't for you... then you could add as much iteration as you need.

This should be faster because it reduces the number of division operations done in naive square rooting down to a simple divide by 2 (actually a * 0.5F multiply operation) and replace it with a few fixed number of multiplication operations instead.

Bootstrap answered 17/11, 2008 at 13:43 Comment(6)

It should be noted that this returns 1/sqrt(number), not sqrt(number). I've done some testing, and this fails starting at n=410881: the John Carmack magic formula returns 642.00104, when the actual square root is 641. – Brunildabruning 17/11, 2008 at 19:56

You could look at Chris Lomonts paper on fast inverse square roots: lomont.org/Math/Papers/2003/InvSqrt.pdf It uses the same technique as here, but with a different magic number. The paper explains why the magic number was chosen. – Bathhouse 2/12, 2008 at 10:24

Also, beyond3d.com/content/articles/8 and beyond3d.com/content/articles/15 shed some light as to the origins of this method. It's often attributed to John Carmack, but it seems the original code was (possibly) written by Gary Tarolli, Greg Walsh and probably others. – Bathhouse 2/12, 2008 at 10:26

Also you can't typepun floats and ints in Java. – Welby 27/8, 2013 at 14:29

@Welby who says? FloatToIntBits and IntToFloatBits have been around since java 1.0.2. – Cycling 27/5, 2015 at 17:50

This does not answer the question which is about exact square number computation. – Indignant 2/8, 2022 at 22:13

I'm not sure if it would be faster, or even accurate, but you could use John Carmack's Magical Square Root, algorithm to solve the square root faster. You could probably easily test this for all possible 32 bit integers, and validate that you actually got correct results, as it's only an appoximation. However, now that I think about it, using doubles is approximating also, so I'm not sure how that would come into play.

Bessiebessy answered 17/11, 2008 at 13:43 Comment(10)

I believe Carmack's trick is fairly pointless these days. The built-in sqrt instruction is a lot faster than it used to be, so you may be better off just performing a regular square root and testing if the result is an int. As always, benchmark it. – Winthorpe 17/11, 2008 at 15:22

doubles can always represent integers in their range exactly – Denbighshire 17/11, 2008 at 18:45

This breaks starting at n=410881, the John Carmack magic formula returns 642.00104, when the actual square root is 641. – Brunildabruning 17/11, 2008 at 19:55

I recently used Carmack's trick in a Java game and it was very effective, giving a speedup of about 40%, so it is still useful, at least in Java. – Regnant 7/1, 2010 at 11:57

@Regnant - +40% in actual framerate or just in the sqrt function? I'm actually quite surprised about this, since most of the games I've played (and the one I've made in XNA) have been GPU-bound instead of CPU-bound anyway (though I know this can differ between use's machine, blah blah blah) – Achates 6/5, 2010 at 10:27

@Robert Fraser Yes +40% in the overall frame rate. The game had a particle physics system which took up nearly all available CPU cycles, dominated by the square root function and the round-to-nearest-integer function (which I had also optimised using a similar bit twiddling hack.) – Regnant 6/5, 2010 at 10:48

@Denbighshire but they can't represent longs above 2^53. – Welby 27/8, 2013 at 14:28

@Welby Right, but it works when done right, see my solution below. – Valero 19/5, 2014 at 21:7

The link has been fixed by pointing it to the Wayback Machine's snapshot from around the time this was posted :) – Testudo 2/7, 2019 at 0:38

Would Carmack's trick have value in an architecture other than x86? Such as MIPS? – Rear 13/12, 2022 at 16:2

If you do a binary chop to try to find the "right" square root, you can fairly easily detect if the value you've got is close enough to tell:

(n+1)^2 = n^2 + 2n + 1
(n-1)^2 = n^2 - 2n + 1

So having calculated n^2, the options are:

n^2 = target: done, return true
n^2 + 2n + 1 > target > n^2 : you're close, but it's not perfect: return false
n^2 - 2n + 1 < target < n^2 : ditto
target < n^2 - 2n + 1 : binary chop on a lower n
target > n^2 + 2n + 1 : binary chop on a higher n

(Sorry, this uses n as your current guess, and target for the parameter. Apologise for the confusion!)

I don't know whether this will be faster or not, but it's worth a try.

EDIT: The binary chop doesn't have to take in the whole range of integers, either (2^x)^2 = 2^(2x), so once you've found the top set bit in your target (which can be done with a bit-twiddling trick; I forget exactly how) you can quickly get a range of potential answers. Mind you, a naive binary chop is still only going to take up to 31 or 32 iterations.

Nsf answered 17/11, 2008 at 13:43 Comment(3)

My money is on this kind of approach. Avoid calling sqrt() since it is calculating a full square root, and you only need the first few digits. – Unintelligible 17/11, 2008 at 14:44

On the other hand, if the floating point is being done in a dedicated FP unit, it may be using all kinds of fun tricks. I wouldn't like to bet on it without a benchmark :) (I may try it tonight though in C#, just to see...) – Nsf 17/11, 2008 at 14:49

Hardware sqrts are actually pretty fast these days. – Moschatel 29/11, 2008 at 3:58

I ran my own analysis of several of the algorithms in this thread and came up with some new results. You can see those old results in the edit history of this answer, but they're not accurate, as I made a mistake, and wasted time analyzing several algorithms which aren't close. However, pulling lessons from several different answers, I now have two algorithms that crush the "winner" of this thread. Here's the core thing I do differently than everyone else:

// This is faster because a number is divisible by 2^4 or more only 6% of the time
// and more than that a vanishingly small percentage.
while((x & 0x3) == 0) x >>= 2;
// This is effectively the same as the switch-case statement used in the original
// answer. 
if((x & 0x7) != 1) return false;

However, this simple line, which most of the time adds one or two very fast instructions, greatly simplifies the switch-case statement into one if statement. However, it can add to the runtime if many of the tested numbers have significant power-of-two factors.

The algorithms below are as follows:

Internet - Kip's posted answer
Durron - My modified answer using the one-pass answer as a base
DurronTwo - My modified answer using the two-pass answer (by @JohnnyHeggheim), with some other slight modifications.

Here is a sample runtime if the numbers are generated using Math.abs(java.util.Random.nextLong())

 0% Scenario{vm=java, trial=0, benchmark=Internet} 39673.40 ns; ?=378.78 ns @ 3 trials
33% Scenario{vm=java, trial=0, benchmark=Durron} 37785.75 ns; ?=478.86 ns @ 10 trials
67% Scenario{vm=java, trial=0, benchmark=DurronTwo} 35978.10 ns; ?=734.10 ns @ 10 trials

benchmark   us linear runtime
 Internet 39.7 ==============================
   Durron 37.8 ============================
DurronTwo 36.0 ===========================

vm: java
trial: 0

And here is a sample runtime if it's run on the first million longs only:

 0% Scenario{vm=java, trial=0, benchmark=Internet} 2933380.84 ns; ?=56939.84 ns @ 10 trials
33% Scenario{vm=java, trial=0, benchmark=Durron} 2243266.81 ns; ?=50537.62 ns @ 10 trials
67% Scenario{vm=java, trial=0, benchmark=DurronTwo} 3159227.68 ns; ?=10766.22 ns @ 3 trials

benchmark   ms linear runtime
 Internet 2.93 ===========================
   Durron 2.24 =====================
DurronTwo 3.16 ==============================

vm: java
trial: 0

As you can see, DurronTwo does better for large inputs, because it gets to use the magic trick very very often, but gets clobbered compared to the first algorithm and Math.sqrt because the numbers are so much smaller. Meanwhile, the simpler Durron is a huge winner because it never has to divide by 4 many many times in the first million numbers.

Here's Durron:

public final static boolean isPerfectSquareDurron(long n) {
    if(n < 0) return false;
    if(n == 0) return true;

    long x = n;
    // This is faster because a number is divisible by 16 only 6% of the time
    // and more than that a vanishingly small percentage.
    while((x & 0x3) == 0) x >>= 2;
    // This is effectively the same as the switch-case statement used in the original
    // answer. 
    if((x & 0x7) == 1) {

        long sqrt;
        if(x < 410881L)
        {
            int i;
            float x2, y;

            x2 = x * 0.5F;
            y  = x;
            i  = Float.floatToRawIntBits(y);
            i  = 0x5f3759df - ( i >> 1 );
            y  = Float.intBitsToFloat(i);
            y  = y * ( 1.5F - ( x2 * y * y ) );

            sqrt = (long)(1.0F/y);
        } else {
            sqrt = (long) Math.sqrt(x);
        }
        return sqrt*sqrt == x;
    }
    return false;
}

And DurronTwo

public final static boolean isPerfectSquareDurronTwo(long n) {
    if(n < 0) return false;
    // Needed to prevent infinite loop
    if(n == 0) return true;

    long x = n;
    while((x & 0x3) == 0) x >>= 2;
    if((x & 0x7) == 1) {
        long sqrt;
        if (x < 41529141369L) {
            int i;
            float x2, y;

            x2 = x * 0.5F;
            y = x;
            i = Float.floatToRawIntBits(y);
            //using the magic number from 
            //http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf
            //since it more accurate
            i = 0x5f375a86 - (i >> 1);
            y = Float.intBitsToFloat(i);
            y = y * (1.5F - (x2 * y * y));
            y = y * (1.5F - (x2 * y * y)); //Newton iteration, more accurate
            sqrt = (long) ((1.0F/y) + 0.2);
        } else {
            //Carmack hack gives incorrect answer for n >= 41529141369.
            sqrt = (long) Math.sqrt(x);
        }
        return sqrt*sqrt == x;
    }
    return false;
}

And my benchmark harness: (Requires Google caliper 0.1-rc5)

public class SquareRootBenchmark {
    public static class Benchmark1 extends SimpleBenchmark {
        private static final int ARRAY_SIZE = 10000;
        long[] trials = new long[ARRAY_SIZE];

        @Override
        protected void setUp() throws Exception {
            Random r = new Random();
            for (int i = 0; i < ARRAY_SIZE; i++) {
                trials[i] = Math.abs(r.nextLong());
            }
        }


        public int timeInternet(int reps) {
            int trues = 0;
            for(int i = 0; i < reps; i++) {
                for(int j = 0; j < ARRAY_SIZE; j++) {
                    if(SquareRootAlgs.isPerfectSquareInternet(trials[j])) trues++;
                }
            }

            return trues;   
        }

        public int timeDurron(int reps) {
            int trues = 0;
            for(int i = 0; i < reps; i++) {
                for(int j = 0; j < ARRAY_SIZE; j++) {
                    if(SquareRootAlgs.isPerfectSquareDurron(trials[j])) trues++;
                }
            }

            return trues;   
        }

        public int timeDurronTwo(int reps) {
            int trues = 0;
            for(int i = 0; i < reps; i++) {
                for(int j = 0; j < ARRAY_SIZE; j++) {
                    if(SquareRootAlgs.isPerfectSquareDurronTwo(trials[j])) trues++;
                }
            }

            return trues;   
        }
    }

    public static void main(String... args) {
        Runner.main(Benchmark1.class, args);
    }
}

UPDATE: I've made a new algorithm that is faster in some scenarios, slower in others, I've gotten different benchmarks based on different inputs. If we calculate modulo 0xFFFFFF = 3 x 3 x 5 x 7 x 13 x 17 x 241, we can eliminate 97.82% of numbers that cannot be squares. This can be (sort of) done in one line, with 5 bitwise operations:

if (!goodLookupSquares[(int) ((n & 0xFFFFFFl) + ((n >> 24) & 0xFFFFFFl) + (n >> 48))]) return false;

The resulting index is either 1) the residue, 2) the residue + 0xFFFFFF, or 3) the residue + 0x1FFFFFE. Of course, we need to have a lookup table for residues modulo 0xFFFFFF, which is about a 3mb file (in this case stored as ascii text decimal numbers, not optimal but clearly improvable with a ByteBuffer and so forth. But since that is precalculation it doesn't matter so much. You can find the file here (or generate it yourself):

public final static boolean isPerfectSquareDurronThree(long n) {
    if(n < 0) return false;
    if(n == 0) return true;

    long x = n;
    while((x & 0x3) == 0) x >>= 2;
    if((x & 0x7) == 1) {
        if (!goodLookupSquares[(int) ((n & 0xFFFFFFl) + ((n >> 24) & 0xFFFFFFl) + (n >> 48))]) return false;
        long sqrt;
        if(x < 410881L)
        {
            int i;
            float x2, y;

            x2 = x * 0.5F;
            y  = x;
            i  = Float.floatToRawIntBits(y);
            i  = 0x5f3759df - ( i >> 1 );
            y  = Float.intBitsToFloat(i);
            y  = y * ( 1.5F - ( x2 * y * y ) );

            sqrt = (long)(1.0F/y);
        } else {
            sqrt = (long) Math.sqrt(x);
        }
        return sqrt*sqrt == x;
    }
    return false;
}

I load it into a boolean array like this:

private static boolean[] goodLookupSquares = null;

public static void initGoodLookupSquares() throws Exception {
    Scanner s = new Scanner(new File("24residues_squares.txt"));

    goodLookupSquares = new boolean[0x1FFFFFE];

    while(s.hasNextLine()) {
        int residue = Integer.valueOf(s.nextLine());
        goodLookupSquares[residue] = true;
        goodLookupSquares[residue + 0xFFFFFF] = true;
        goodLookupSquares[residue + 0x1FFFFFE] = true;
    }

    s.close();
}

Example runtime. It beat Durron (version one) in every trial I ran.

 0% Scenario{vm=java, trial=0, benchmark=Internet} 40665.77 ns; ?=566.71 ns @ 10 trials
33% Scenario{vm=java, trial=0, benchmark=Durron} 38397.60 ns; ?=784.30 ns @ 10 trials
67% Scenario{vm=java, trial=0, benchmark=DurronThree} 36171.46 ns; ?=693.02 ns @ 10 trials

  benchmark   us linear runtime
   Internet 40.7 ==============================
     Durron 38.4 ============================
DurronThree 36.2 ==========================

vm: java
trial: 0

Reflux answered 17/11, 2008 at 13:43 Comment(8)

A giant lookup table doesn't seem like a good idea. A cache miss is slower (~100 to 150 cycles) than the x86 hardware sqrt instruction (~20 cycles). Throughput-wise, you can sustain a lot of outstanding cache-misses, but you're still evicting other useful data. A huge lookup table would only be worth it if it was a LOT faster than any other option, and this function was the major factor in the performance of your whole program. – Pilsner 21/8, 2015 at 2:5

@Peter Cordes, you should only have cache misses a few times initially, then it will all be in cache, right? – Loony 30/10, 2018 at 19:9

@SwissFrank: is perfect-square checking the only thing your program does? A lookup table can look good in a microbenchmark that calls it repeatedly in a tight loop, but in a real program that has other data in its working set, it's not good. – Pilsner 30/10, 2018 at 22:13

A bitmap of 0x1FFFFFE bits takes 4 mega-bytes if stored as a packed bitmap. An L3 cache hit on a modern Intel desktop has > 40 cycles of latency, and worse on a big Xeon; longer than hardware sqrt+mul latency. If stored as a byte-map with 1 byte per value, it's about 32 MB; bigger than the L3 cache of anything but a many-core Xeon where all core share one huge cache. So if your inputs data has a uniform random distribution over a large enough range of inputs, you'll get lots of L2 cache misses even in a tight loop. (private per-core L2 on Intel is only 256k, with ~12 cycle latency.) – Pilsner 30/10, 2018 at 22:18

@Peter Cordes, the original poster seems to say that that this is the core part of his program. As a recreational mathematician he may not be running any other processes while this is going, so it could be expected to have the cache much to itself. However if I understand your numbers, if you can't get it in L2--and you can't--then sqrt will be faster than going to L3. Off subject but do you have a convenient reference for access times on various CPUs? – Loony 31/10, 2018 at 9:36

@SwissFrank: Oh, if all you're doing is root checking, then there's potential to this with a bitmap to get L3 hits. I was looking at latency, but many misses can be in flight at once, so throughput is potentially good. OTOH, SIMD sqrtps throughput or even sqrtpd (double-precision) is not too bad on Skylake, but is not much better than latency on old CPUs. Anyway 7-cpu.com/cpu/Haswell.html has some nice experimental numbers, and pages for other CPUs. Agner Fog's microarch guide pdf has some cache latency numbers for Intel and AMD uarches: agner.org/optimize – Pilsner 31/10, 2018 at 9:47

Using x86 SIMD from Java is a problem, and by the time you add in the cost of int->fp and fp->int conversion, it's plausible that a bitmap could be better. You do need double precision to avoid rounding some integer outside the +-2^24 range (so a 32-bit integer can be outside that), and sqrtpd is slower than sqrtps as well as only processing half as many elements per instruction (per SIMD vector). – Pilsner 31/10, 2018 at 9:52

> "by the time you add in the cost of int->fp and fp->int conversion, it's plausible that a bitmap could be better" Just checked your profile, many bows! One idea that comes to mind is running your loop in int and double side by side, using the double where you need it and int where you need it. I totally haven't checked whether that even fits the wider discussion, but is just one thing I've done. BTW I do synthesizer software and am keen to try SIMD for one aspect of it... – Loony 31/10, 2018 at 14:45

It should be much faster to use Newton's method to calculate the Integer Square Root, then square this number and check, as you do in your current solution. Newton's method is the basis for the Carmack solution mentioned in some other answers. You should be able to get a faster answer since you're only interested in the integer part of the root, allowing you to stop the approximation algorithm sooner.

Another optimization that you can try: If the Digital Root of a number doesn't end in 1, 4, 7, or 9 the number is not a perfect square. This can be used as a quick way to eliminate 60% of your inputs before applying the slower square root algorithm.

Map answered 17/11, 2008 at 13:43 Comment(3)

The digital root is strictly computationally equivalent to modulo, so should be considered along with other modulo methods here, such as mod 16 and mod 255. – Abib 8/1, 2010 at 16:48

Are you sure the digital root is equivalent to modulo? It seems to be something entirely different as explained by the link. Notice the list is 1,4,7,9 not 1,4,5,9. – Rosio 4/12, 2011 at 4:44

Digital root in the decimal system is equivalent to using modulo 9 (well dr(n)=1+((n-1) mod 9); so a slight shift as well). The numbers 0,1,4,5,9 are for modulo 16, and 0, 1, 4, 7 are for modulo 9 - which correspond to 1, 4, 7, 9 for digital root. – Scram 19/1, 2018 at 12:31

I want this function to work with all positive 64-bit signed integers

Math.sqrt() works with doubles as input parameters, so you won't get accurate results for integers bigger than 2^53.

Maundy answered 17/11, 2008 at 13:43 Comment(4)

I've actually tested the answer on all perfect squares larger than 2^53, as well as all numbers from 5 below each perfect square to 5 above each perfect square, and I get the correct result. (the roundoff error is corrected when I round the sqrt answer to a long, then square that value and compare) – Brunildabruning 6/1, 2009 at 14:33

@Kip: I guess I've proven that it works. – Valero 8/9, 2013 at 19:12

The results aren't perfectly accurate, but more accurate than you might think. If we assume at least 15 accurate digits after the conversion to double and after the square root, then that's plenty, because we need no more than 11: 10 digits for the 32-bit square root and less than 1 for a decimal place, because the +0.5 rounds to nearest. – Stammer 22/4, 2014 at 0:54

Math.sqrt() isn't totally accurate, but it doesn't have to. In the very first post, tst is an integer close to sqrt (N). If N is not a square, then tst * tst != N, no matter what the value of tst is. If N is a perfect square, then sqrt (N) < 2^32, and as long as sqrt (N) is calculated with an error < 0.5, we are fine. – Halfcock 1/5, 2014 at 23:27

An integer problem deserves an integer solution. Thus

Do binary search on the (non-negative) integers to find the greatest integer t such that t**2 <= n. Then test whether r**2 = n exactly. This takes time O(log n).

If you don't know how to binary search the positive integers because the set is unbounded, it's easy. You starting by computing your increasing function f (above f(t) = t**2 - n) on powers of two. When you see it turn positive, you've found an upper bound. Then you can do standard binary search.

Toluidine answered 17/11, 2008 at 13:43 Comment(2)

Actually the time would be at least O((log n)^2) because multiplication is not constant-time but in fact has a lower bound of O(log n), which becomes apparent when working with large multi-precision numbers. But the scope of this wiki seems to be 64-bits, so maybe it's nbd. – Adal 14/7, 2015 at 2:5

Multiplication of native word sized integers is O(1) on many modern CPUs. – Ectophyte 1/6, 2021 at 11:52

Just for the record, another approach is to use the prime decomposition. If every factor of the decomposition is even, then the number is a perfect square. So what you want is to see if a number can be decomposed as a product of squares of prime numbers. Of course, you don't need to obtain such a decomposition, just to see if it exists.

First build a table of squares of prime numbers which are lower than 2^32. This is far smaller than a table of all integers up to this limit.

A solution would then be like this:

boolean isPerfectSquare(long number)
{
    if (number < 0) return false;
    if (number < 2) return true;

    for (int i = 0; ; i++)
    {
        long square = squareTable[i];
        if (square > number) return false;
        while (number % square == 0)
        {
            number /= square;
        }
        if (number == 1) return true;
    }
}

I guess it's a bit cryptic. What it does is checking in every step that the square of a prime number divide the input number. If it does then it divides the number by the square as long as it is possible, to remove this square from the prime decomposition. If by this process, we came to 1, then the input number was a decomposition of square of prime numbers. If the square becomes larger than the number itself, then there is no way this square, or any larger squares, can divide it, so the number can not be a decomposition of squares of prime numbers.

Given nowadays' sqrt done in hardware and the need to compute prime numbers here, I guess this solution is way slower. But it should give better results than solution with sqrt which won't work over 2^54, as says mrzl in his answer.

Sardanapalus answered 17/11, 2008 at 13:43 Comment(4)

integer division is slower than FP sqrt on current hardware. This idea has no chance. >.< Even in 2008, Core2's sqrtsd throughput is one per 6-58c. Its idiv is one per 12-36cycles. (latencies similar to throughputs: neither unit is pipelined). – Pilsner 21/8, 2015 at 2:22

sqrt doesn't need to be perfectly accurate. That's why you check by integer-squaring the result and doing an integer-compare to decide if the input integer had an exact integer sqrt. – Pilsner 21/8, 2015 at 2:23

This may not be the winner but I like it because it's a different approach. – Lichenin 19/6, 2021 at 22:7

Your solution to check if a number is square is to... factor it?? Do you know how hard factoring can be? – Indignant 2/8, 2022 at 22:15

The following simplification of maaartinus's solution appears to shave a few percentage points off the runtime, but I'm not good enough at benchmarking to produce a benchmark I can trust:

long goodMask; // 0xC840C04048404040 computed below
{
    for (int i=0; i<64; ++i) goodMask |= Long.MIN_VALUE >>> (i*i);
}

public boolean isSquare(long x) {
    // This tests if the 6 least significant bits are right.
    // Moving the to be tested bit to the highest position saves us masking.
    if (goodMask << x >= 0) return false;
    // Remove an even number of trailing zeros, leaving at most one.
    x >>= (Long.numberOfTrailingZeros(x) & (-2);
    // Repeat the test on the 6 least significant remaining bits.
    if (goodMask << x >= 0 | x <= 0) return x == 0;
    // Do it in the classical way.
    // The correctness is not trivial as the conversion from long to double is lossy!
    final long tst = (long) Math.sqrt(x);
    return tst * tst == x;
}

It would be worth checking how omitting the first test,

if (goodMask << x >= 0) return false;

would affect performance.

Sake answered 17/11, 2008 at 13:43 Comment(1)

The results are here. Removing the first test is bad as it solves most cases pretty cheaply. The source is in my answer (updated). – Valero 16/7, 2014 at 11:45

It's been pointed out that the last d digits of a perfect square can only take on certain values. The last d digits (in base b) of a number n is the same as the remainder when n is divided by b^d, ie. in C notation n % pow(b, d).

This can be generalized to any modulus m, ie. n % m can be used to rule out some percentage of numbers from being perfect squares. The modulus you are currently using is 64, which allows 12, ie. 19% of remainders, as possible squares. With a little coding I found the modulus 110880, which allows only 2016, ie. 1.8% of remainders as possible squares. So depending on the cost of a modulus operation (ie. division) and a table lookup versus a square root on your machine, using this modulus might be faster.

By the way if Java has a way to store a packed array of bits for the lookup table, don't use it. 110880 32-bit words is not much RAM these days and fetching a machine word is going to be faster than fetching a single bit.

Bathsheba answered 17/11, 2008 at 13:43 Comment(4)

Nice. Did you work this out algebraically or by trial and error? I can see why it's so effective - lots of collisions between perfect squares, e.g. 333^2 % 110880 == 3^2, 334^2 % 110880 == 26^2, 338^2 % 110880 == 58^2... – Regnant 7/1, 2010 at 12:28

IIRC it was brute force, but note that 110880 = 2^5 * 3^2 * 5 * 7 * 11, which gives 6*3*2*2*2 - 1 = 143 proper divisors. – Bathsheba 19/1, 2010 at 1:18

I found that because of the limitations of the lookup, 44352 works better, with a 2.6% pass rate. At least in my implementation. – Rosio 4/12, 2011 at 5:57

Integer division (idiv) is equal or worse in cost to FP sqrt (sqrtsd) on current x86 hardware. Also, completely disagree with avoiding bitfields. Cache hit rate will be tons better with a bitfield, and testing a bit in a bitfield is only one or two more simple instructions than testing a whole byte. (For tiny tables that fit in cache even as non-bitfields, a byte array would be best, not 32bit ints. x86 has single-byte access with equal speed to 32bit dword.) – Pilsner 21/8, 2015 at 2:13

For performance, you very often have to do some compromsies. Others have expressed various methods, however, you noted Carmack's hack was faster up to certain values of N. Then, you should check the "n" and if it is less than that number N, use Carmack's hack, else use some other method described in the answers here.

Scraperboard answered 17/11, 2008 at 13:43 Comment(1)

I've incorporated your suggestion into the solution too. Also, nice handle. :) – Brunildabruning 5/12, 2008 at 19:56

This is the fastest Java implementation I could come up with, using a combination of techniques suggested by others in this thread.

Mod-256 test
Inexact mod-3465 test (avoids integer division at the cost of some false positives)
Floating-point square root, round and compare with input value

I also experimented with these modifications but they did not help performance:

Additional mod-255 test
Dividing the input value by powers of 4
Fast Inverse Square Root (to work for high values of N it needs 3 iterations, enough to make it slower than the hardware square root function.)

public class SquareTester {

    public static boolean isPerfectSquare(long n) {
        if (n < 0) {
            return false;
        } else {
            switch ((byte) n) {
            case -128: case -127: case -124: case -119: case -112:
            case -111: case -103: case  -95: case  -92: case  -87:
            case  -79: case  -71: case  -64: case  -63: case  -60:
            case  -55: case  -47: case  -39: case  -31: case  -28:
            case  -23: case  -15: case   -7: case    0: case    1:
            case    4: case    9: case   16: case   17: case   25:
            case   33: case   36: case   41: case   49: case   57:
            case   64: case   65: case   68: case   73: case   81:
            case   89: case   97: case  100: case  105: case  113:
            case  121:
                long i = (n * INV3465) >>> 52;
                if (! good3465[(int) i]) {
                    return false;
                } else {
                    long r = round(Math.sqrt(n));
                    return r*r == n; 
                }
            default:
                return false;
            }
        }
    }

    private static int round(double x) {
        return (int) Double.doubleToRawLongBits(x + (double) (1L << 52));
    }

    /** 3465<sup>-1</sup> modulo 2<sup>64</sup> */
    private static final long INV3465 = 0x8ffed161732e78b9L;

    private static final boolean[] good3465 =
        new boolean[0x1000];

    static {
        for (int r = 0; r < 3465; ++ r) {
            int i = (int) ((r * r * INV3465) >>> 52);
            good3465[i] = good3465[i+1] = true;
        }
    }

}

Regnant answered 17/11, 2008 at 13:43 Comment(0)

Project Euler is mentioned in the tags and many of the problems in it require checking numbers >> 2^64. Most of the optimizations mentioned above don't work easily when you are working with an 80 byte buffer.

I used java BigInteger and a slightly modified version of Newton's method, one that works better with integers. The problem was that exact squares n^2 converged to (n-1) instead of n because n^2-1 = (n-1)(n+1) and the final error was just one step below the final divisor and the algorithm terminated. It was easy to fix by adding one to the original argument before computing the error. (Add two for cube roots, etc.)

One nice attribute of this algorithm is that you can immediately tell if the number is a perfect square - the final error (not correction) in Newton's method will be zero. A simple modification also lets you quickly calculate floor(sqrt(x)) instead of the closest integer. This is handy with several Euler problems.

Stain answered 17/11, 2008 at 13:43 Comment(1)

I was thinking the same thing about these algorithms not translating well to multi-precision buffers. So thought I'd stick this here... I actually found a probabilistic squaredness test with better asymptotic complexity for huge numbers..... where number theory applications not uncommonly find themselves. Not familiar with Project Euler though... looks interesting. – Adal 14/7, 2015 at 2:27

You should get rid of the 2-power part of N right from the start.

2nd Edit The magical expression for m below should be

m = N - (N & (N-1));

and not as written

End of 2nd edit

m = N & (N-1); // the lawest bit of N
N /= m;
byte = N & 0x0F;
if ((m % 2) || (byte !=1 && byte !=9))
  return false;

1st Edit:

Minor improvement:

m = N & (N-1); // the lawest bit of N
N /= m;
if ((m % 2) || (N & 0x07 != 1))
  return false;

End of 1st edit

Now continue as usual. This way, by the time you get to the floating point part, you already got rid of all the numbers whose 2-power part is odd (about half), and then you only consider 1/8 of whats left. I.e. you run the floating point part on 6% of the numbers.

Wrightson answered 17/11, 2008 at 13:43 Comment(0)

Considering for general bit length (though I have used specific type here), I tried to design simplistic algo as below. Simple and obvious check for 0,1,2 or <0 is required initially. Following is simple in sense that it doesn't try to use any existing maths functions. Most of the operator can be replaced with bit-wise operators. I haven't tested with any bench mark data though. I'm neither expert at maths or computer algorithm design in particular, I would love to see you pointing out problem. I know there is lots of improvement chances there.

int main()
{
    unsigned int c1=0 ,c2 = 0;  
    unsigned int x = 0;  
    unsigned int p = 0;  
    int k1 = 0;  
    scanf("%d",&p);  
    if(p % 2 == 0) {  
        x = p/2; 
    }  
    else {  
        x = (p/2) +1;  
    }  
    while(x) 
    {
        if((x*x) > p) {  
            c1 = x;  
            x = x/2; 
        }else {  
            c2 = x;  
            break;  
        }  
    }  
    if((p%2) != 0)  
        c2++;

    while(c2 < c1) 
    {  
        if((c2 * c2 ) == p) {  
            k1 = 1;  
            break;  
        }  
        c2++; 
    }  
    if(k1)  
        printf("\n Perfect square for %d", c2);  
    else  
        printf("\n Not perfect but nearest to :%d :", c2);  
    return 0;  
}

Muezzin answered 17/11, 2008 at 13:43 Comment(1)

@Kip: Some problem with my browser. – Muezzin 18/12, 2010 at 15:6

I like the idea to use an almost correct method on some of the input. Here is a version with a higher "offset". The code seems to work and passes my simple test case.

Just replace your:

if(n < 410881L){...}

code with this one:

if (n < 11043908100L) {
    //John Carmack hack, converted to Java.
    // See: http://www.codemaestro.com/reviews/9
    int i;
    float x2, y;

    x2 = n * 0.5F;
    y = n;
    i = Float.floatToRawIntBits(y);
    //using the magic number from 
    //http://www.lomont.org/Math/Papers/2003/InvSqrt.pdf
    //since it more accurate
    i = 0x5f375a86 - (i >> 1);
    y = Float.intBitsToFloat(i);
    y = y * (1.5F - (x2 * y * y));
    y = y * (1.5F - (x2 * y * y)); //Newton iteration, more accurate

    sqrt = Math.round(1.0F / y);
} else {
    //Carmack hack gives incorrect answer for n >= 11043908100.
    sqrt = (long) Math.sqrt(n);
}

Rutherfordium answered 17/11, 2008 at 13:43 Comment(0)

The sqrt call is not perfectly accurate, as has been mentioned, but it's interesting and instructive that it doesn't blow away the other answers in terms of speed. After all, the sequence of assembly language instructions for a sqrt is tiny. Intel has a hardware instruction, which isn't used by Java I believe because it doesn't conform to IEEE.

So why is it slow? Because Java is actually calling a C routine through JNI, and it's actually slower to do so than to call a Java subroutine, which itself is slower than doing it inline. This is very annoying, and Java should have come up with a better solution, ie building in floating point library calls if necessary. Oh well.

In C++, I suspect all the complex alternatives would lose on speed, but I haven't checked them all. What I did, and what Java people will find usefull, is a simple hack, an extension of the special case testing suggested by A. Rex. Use a single long value as a bit array, which isn't bounds checked. That way, you have 64 bit boolean lookup.

typedef unsigned long long UVLONG
UVLONG pp1,pp2;

void init2() {
  for (int i = 0; i < 64; i++) {
    for (int j = 0; j < 64; j++)
      if (isPerfectSquare(i * 64 + j)) {
    pp1 |= (1 << j);
    pp2 |= (1 << i);
    break;
      }
   }
   cout << "pp1=" << pp1 << "," << pp2 << "\n";  
}


inline bool isPerfectSquare5(UVLONG x) {
  return pp1 & (1 << (x & 0x3F)) ? isPerfectSquare(x) : false;
}

The routine isPerfectSquare5 runs in about 1/3 the time on my core2 duo machine. I suspect that further tweaks along the same lines could reduce the time further on average, but every time you check, you are trading off more testing for more eliminating, so you can't go too much farther on that road.

Certainly, rather than having a separate test for negative, you could check the high 6 bits the same way.

Note that all I'm doing is eliminating possible squares, but when I have a potential case I have to call the original, inlined isPerfectSquare.

The init2 routine is called once to initialize the static values of pp1 and pp2. Note that in my implementation in C++, I'm using unsigned long long, so since you're signed, you'd have to use the >>> operator.

There is no intrinsic need to bounds check the array, but Java's optimizer has to figure this stuff out pretty quickly, so I don't blame them for that.

Corymb answered 17/11, 2008 at 13:43 Comment(2)

I'd bet you're wrong twice. 1. Intel sqrt conforms to IEEE. The only non-conforming instructions are the goniometrical instructions for lange arguments. 2. Java uses intrinsics for Math.sqrt, no JNI. – Valero 8/9, 2013 at 19:18

Didn't you forget to use pp2? I understand that pp1 is used for testing the six least significant bits, but I don't believe that testing the next six bits makes any sense. – Valero 20/2, 2018 at 16:34

This a rework from decimal to binary of the old Marchant calculator algorithm (sorry, I don't have a reference), in Ruby, adapted specifically for this question:

def isexactsqrt(v)
    value = v.abs
    residue = value
    root = 0
    onebit = 1
    onebit <<= 8 while (onebit < residue)
    onebit >>= 2 while (onebit > residue)
    while (onebit > 0)
        x = root + onebit
        if (residue >= x) then
            residue -= x
            root = x + onebit
        end
        root >>= 1
        onebit >>= 2
    end
    return (residue == 0)
end

Here's a workup of something similar (there may be coding style/smells or clunky O/O - it's the algorithm that counts, and C++ is not my home language). In this case, we're looking for residue == 0:

#include <iostream>  

using namespace std;  
typedef unsigned long long int llint;

class ISqrt {           // Integer Square Root
    llint value;        // Integer whose square root is required
    llint root;         // Result: floor(sqrt(value))
    llint residue;      // Result: value-root*root
    llint onebit, x;    // Working bit, working value

public:

    ISqrt(llint v = 2) {    // Constructor
        Root(v);            // Take the root 
    };

    llint Root(llint r) {   // Resets and calculates new square root
        value = r;          // Store input
        residue = value;    // Initialise for subtracting down
        root = 0;           // Clear root accumulator
        
        onebit = 1;                 // Calculate start value of counter
        onebit <<= (8*sizeof(llint)-2);         // Set up counter bit as greatest odd power of 2 
        while (onebit > residue) {onebit >>= 2; };  // Shift down until just < value
        
        while (onebit > 0) {
            x = root ^ onebit;          // Will check root+1bit (root bit corresponding to onebit is always zero)
            if (residue >= x) {         // Room to subtract?
                residue -= x;           // Yes - deduct from residue
                root = x + onebit;      // and step root
            };
            root >>= 1;
            onebit >>= 2;
        };
        return root;                    
    };
    llint Residue() {           // Returns residue from last calculation
        return residue;                 
    };
};

int main() {
    llint big, i, q, r, v, delta;
    big = 0; big = (big-1);         // Kludge for "big number"
    ISqrt b;                            // Make q sqrt generator
    for ( i = big; i > 0 ; i /= 7 ) {   // for several numbers
        q = b.Root(i);                  // Get the square root
        r = b.Residue();                // Get the residue
        v = q*q+r;                      // Recalc original value
        delta = v-i;                    // And diff, hopefully 0
        cout << i << ": " << q << " ++ " << r << " V: " << v << " Delta: " << delta << "\n";
    };
    return 0;
};

Meit answered 17/11, 2008 at 13:43 Comment(6)

The number of iterations looks O(ln n), where n is the bit-length of v, so I doubt this will save much for larger v. Floating point sqrt is slow, maybe 100-200 cycles, but integer math isn't free either. A dozen iterations with 15 cycles each, and it'd be a wash. Still, +1 for being interesting. – Marine 2/1, 2009 at 0:24

Actually, I believe the additions and subtractions can be done by XOR. – Meit 2/1, 2009 at 8:12

That was a daft comment - only the addition can be done by an XOR; the subtraction is arithmetic. – Meit 2/1, 2009 at 9:13

Is there really any substantive difference between the run time of XOR and addition anyway? – Marine 2/1, 2009 at 22:59

@Tadmas: probably not enough to break the "optimise later" rule. (:-) – Meit 4/1, 2009 at 0:17

@Marine No. While XOR is much simpler and faster in hardware, the cycle of any CPU is more than long enough for the addition. So you can't save anything unless you build your own circuit. +++ The slowest operation is actually the branching (pipeline work lost due to branch mispredictions) and this may be optimized away (a good compiler does it already). – Valero 20/2, 2018 at 16:27

I checked all of the possible results when the last n bits of a square is observed. By successively examining more bits, up to 5/6th of inputs can be eliminated. I actually designed this to implement Fermat's Factorization algorithm, and it is very fast there.

public static boolean isSquare(final long val) {
   if ((val & 2) == 2 || (val & 7) == 5) {
     return false;
   }
   if ((val & 11) == 8 || (val & 31) == 20) {
     return false;
   }

   if ((val & 47) == 32 || (val & 127) == 80) {
     return false;
   }

   if ((val & 191) == 128 || (val & 511) == 320) {
     return false;
   }

   // if((val & a == b) || (val & c == d){
   //   return false;
   // }

   if (!modSq[(int) (val % modSq.length)]) {
        return false;
   }

   final long root = (long) Math.sqrt(val);
   return root * root == val;
}

The last bit of pseudocode can be used to extend the tests to eliminate more values. The tests above are for k = 0, 1, 2, 3

a is of the form (3 << 2k) - 1

b is of the form (2 << 2k)

c is of the form (2 << 2k + 2) - 1

d is of the form (2 << 2k - 1) * 10

It first tests whether it has a square residual with moduli of power of two, then it tests based on a final modulus, then it uses the Math.sqrt to do a final test. I came up with the idea from the top post, and attempted to extend upon it. I appreciate any comments or suggestions.

Update: Using the test by a modulus, (modSq) and a modulus base of 44352, my test runs in 96% of the time of the one in the OP's update for numbers up to 1,000,000,000.

Rosio answered 17/11, 2008 at 13:43 Comment(0)

Here is a divide and conquer solution.

If the square root of a natural number (number) is a natural number (solution), you can easily determine a range for solution based on the number of digits of number:

number has 1 digit: solution in range = 1 - 4
number has 2 digits: solution in range = 3 - 10
number has 3 digits: solution in range = 10 - 40
number has 4 digits: solution in range = 30 - 100
number has 5 digits: solution in range = 100 - 400

Notice the repetition?

You can use this range in a binary search approach to see if there is a solution for which:

number == solution * solution

Here is the code

Here is my class SquareRootChecker

public class SquareRootChecker {

    private long number;
    private long initialLow;
    private long initialHigh;

    public SquareRootChecker(long number) {
        this.number = number;

        initialLow = 1;
        initialHigh = 4;
        if (Long.toString(number).length() % 2 == 0) {
            initialLow = 3;
            initialHigh = 10;
        }
        for (long i = 0; i < Long.toString(number).length() / 2; i++) {
            initialLow *= 10;
            initialHigh *= 10;
        }
        if (Long.toString(number).length() % 2 == 0) {
            initialLow /= 10;
            initialHigh /=10;
        }
    }

    public boolean checkSquareRoot() {
        return findSquareRoot(initialLow, initialHigh, number);
    }

    private boolean findSquareRoot(long low, long high, long number) {
        long check = low + (high - low) / 2;
        if (high >= low) {
            if (number == check * check) {
                return true;
            }
            else if (number < check * check) {
                high = check - 1;
                return findSquareRoot(low, high, number);
            }
            else  {
                low = check + 1;
                return findSquareRoot(low, high, number);
            }
        }
        return false;
    }

}

And here is an example on how to use it.

long number =  1234567;
long square = number * number;
SquareRootChecker squareRootChecker = new SquareRootChecker(square);
System.out.println(square + ": " + squareRootChecker.checkSquareRoot()); //Prints "1524155677489: true"

long notSquare = square + 1;
squareRootChecker = new SquareRootChecker(notSquare);
System.out.println(notSquare + ": " + squareRootChecker.checkSquareRoot()); //Prints "1524155677490: false"

Harleyharli answered 17/11, 2008 at 13:43 Comment(1)

I love the concept, but I would like to politely point out a major flaw: numbers are in base 2 binary. Converting base 2 to base 10 via toString is an incredibly expensive operation compared to bitwise operators. Thus, to satisfy the objective of the question--performance--you must use bitwise operators instead of base 10 strings. Again, I really like your concept. Notwithstanding, your implementation (as it stands now) is by far the slowest out of all the possible solutions posted for the question. – Shanonshanta 23/1, 2020 at 21:45

This question got me wondering, so I did some simple coding and I'm presenting it here because I think it's interesting, relevant, but I don't know how useful. There's a simple algorithm

a_n+1 = (a_n + x/a_n)/2

for calculating square roots, but it's meant to be used for decimals. I wondered what would happen if I just coded the same algorithm using integer maths. Would it even converge on the right answer? I didn't know, so I wrote a program...

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>

_Bool isperfectsquare(uint64_t x, uint64_t *isqrtx) {
  // NOTE: isqrtx approximate for non-squares. (benchmarked at 162ns 3GHz i5)
  uint32_t i;
  uint64_t ai;
  ai = 1 + ((x & 0xffff000000000000) >> 32) + ((x & 0xffff00000000) >> 24) + ((x & 0xffff0000) >> 16);
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = (ai + x/ai)/2;
  ai = ai & 0xffffffff;
  if (isqrtx != NULL) isqrtx[0] = ai;
  return ai*ai == x;
}

void main() {

  uint64_t x, isqrtx;
  uint64_t i;
  for (i=1; i<0x100000000; i++) {
    if (!isperfectsquare(i*i, &isqrtx)) {
      printf("Failed at %li", i);
      exit(1);
    }
  }
  printf("All OK.\n");
}

So, it turns out that 12 iterations of the formula is enough to give correct results for all 64 bit unsigned longs that are perfect squares, and of course, non-squares will return false.

simon@simon-Inspiron-N5040:~$ time ./isqrt.bin 
All OK.

real    11m37.096s
user    11m35.053s
sys 0m0.272s

So 697s/2^32 is approx 162ns. As it is, the function will have the same runtime for all inputs. Some of the measures detailed elsewhere in the discussion could speed it up for non-squares by checking the last four bits etc. Hope someone finds this interesting as I did.

Jan 2024 EDIT:

Since it isn't feasible for mere mortals to exhaustively test that all input values give correct output values to a 64 bit integer square root function, it is prudent to use a simple algorithm that's easy to understand so one can see from the code that it must work. I made a better and faster version of the above code. This code returns integer square roots that can be trusted to be correct.

static inline uint32_t isqrtu64(uint64_t x) {
  // Author: Simon Goater 11 Jan 2024
  // Performs binary search on sequence of square integers to return floor(sqrt(x)).
  // No division, lzcnt (A.K.A. clz) instruction, fp calculations, 128+bit integers, or look-up table required.
  // Approx. 34 ns (depenedent loop) Ivy Bridge (2012) Xeon 3.2GHz.
  // Approx. 21 ns (indepenedent loop) Ivy Bridge (2012) Xeon 3.2GHz.
  uint64_t res = 0, c = 0x80000000, d = c;
  for (int i=0; i<32; i++) {
    if (x >= c*c) { 
      res += d;
      d >>= 1;
      c += d;
    } else {
      d >>= 1;
      c -= d;
    }
  }
  return res;
}

_Bool isperfectsquare(uint64_t x, uint64_t *isqrtx) {
  uint64_t rt;
  if (isqrtx != NULL) {
    rt = isqrtu64(x);
    *isqrtx = rt;
  } else {
    uint64_t nibble = x & 0xf;
    if ((nibble == 0) || (nibble == 1) 
    || (nibble == 4) || (nibble == 9)) {
      rt = isqrtu64(x);
    } else {
      return false;
    }
  }
  return (rt*rt) == x;
}

Hellenhellene answered 17/11, 2008 at 13:43 Comment(2)

Yep, this is a fairly common approach. (It's what Java's BigInteger uses under the hood, for example.) See en.wikipedia.org/wiki/… for more. For 64-bit integers, it's possible to reduce the number of divisions to just 2 (at the expense of a lookup table). See https://mcmap.net/q/46060/-is-there-a-non-looping-unsigned-32-bit-integer-square-root-function-c – Tuddor 13/11, 2022 at 9:25

You appear to be something of an authority on this topic. Thanks for the link. I think the comprehensive answer you posted on the link deserves more credit and attention. – Hellenhellene 13/11, 2022 at 10:26

Square Root of a number, given that the number is a perfect square.

The complexity is log(n)

/**
 * Calculate square root if the given number is a perfect square.
 * 
 * Approach: Sum of n odd numbers is equals to the square root of n*n, given 
 * that n is a perfect square.
 *
 * @param number
 * @return squareRoot
 */

public static int calculateSquareRoot(int number) {

    int sum=1;
    int count =1;
    int squareRoot=1;
    while(sum<number) {
        count+=2;
        sum+=count;
        squareRoot++;
    }
    return squareRoot;
}

Vivisection answered 17/11, 2008 at 13:43 Comment(1)

I don't think that complexity is correct. The complexity should be the number of squares less than or equal to n, or approximately O(sqrt(n)). – Memento 15/12, 2020 at 15:40

Newton's Method with integer arithmetic

If you wish to avoid non-integer operations you could use the method below. It basically uses Newton's Method modified for integer arithmetic.

/**
 * Test if the given number is a perfect square.
 * @param n Must be greater than 0 and less
 *    than Long.MAX_VALUE.
 * @return <code>true</code> if n is a perfect
 *    square, or <code>false</code> otherwise.
 */
public static boolean isSquare(long n)
{
    long x1 = n;
    long x2 = 1L;

    while (x1 > x2)
    {
        x1 = (x1 + x2) / 2L;
        x2 = n / x1;
    }

    return x1 == x2 && n % x1 == 0L;
}

This implementation can not compete with solutions that use Math.sqrt. However, its performance can be improved by using the filtering mechanisms described in some of the other posts.

Incoordination answered 17/11, 2008 at 13:43 Comment(0)

Here is the simplest and most concise way, although I do not know how it compares in terms of CPU cycles. This works great if you only wish to know if the root is a whole number. If you really care if it is an integer, you can also figure that out. Here is a simple (and pure) function:

private static final MathContext precision = new MathContext(20);

private static final Function<Long, Boolean> isRootWhole = (n) -> {
    long digit = n % 10;
    if (digit == 2 || digit == 3 || digit == 7 || digit == 8) {
        return false;
    }
    return new BigDecimal(n).sqrt(precision).scale() == 0;
};

If you do not need micro-optimization, this answer is better in terms of simplicity and maintainability. If you will be calculating negative numbers, you will need to handle that accordingly, and send the absolute value into the function. I have included a minor optimization because no perfect squares have a tens digit of 2, 3, 7, or 8 due to quadratic residues mod 10.

On my CPU, a run of this algorithm on 0 - 10,000,000 took an average of 1000 - 1100 nanoseconds per calculation.

If you are performing a lesser number of calculations, the earlier calculations take a bit longer.

I had a negative comment that my previous edit did not work for large numbers. The OP mentioned Longs, and the largest perfect square that is a Long is 9223372030926249001, so this method works for all Longs.

Dray answered 17/11, 2008 at 13:43 Comment(4)

I think you mean (long) Math.pow(roundedRoot, 2) – Kalpak 4/12, 2017 at 12:52

I updated this soltution/suggestion to be the simplest way to figure it out. I wonder how it would compare in terms of benchmark against some of the custom solutions. – Dray 5/12, 2017 at 11:3

Your method returns true for 10_000_000_000_000_000L and 10_000_000_000_000_001L which proves that it't wrong. – Incoordination 13/10, 2018 at 21:44

@Incoordination I have changed my answer to deal with your accurate assessment of my previous attempt. If you downvoted my comment, please reconsider. Thanks! – Dray 15/2, 2021 at 18:32

Calculating square roots by Newton's method is horrendously fast ... provided that the starting value is reasonable. However there is no reasonable starting value, and in practice we end with bisection and log(2^64) behaviour.
To be really fast we need a fast way to get at a reasonable starting value, and that means we need to descend into machine language. If a processor provides an instruction like POPCNT in the Pentium, that counts the leading zeroes we can use that to have a starting value with half the significant bits. With care we can find a a fixed number of Newton steps that will always suffice. (Thus foregoing the need to loop and have very fast execution.)

A second solution is going via the floating point facility, which may have a fast sqrt calculation (like the i87 coprocessor.) Even an excursion via exp() and log() may be faster than Newton degenerated into a binary search. There is a tricky aspect to this, a processor dependant analysis of what and if refinement afterwards is necessary.

A third solution solves a slightly different problem, but is well worth mentionning because the situation is described in the question. If you want to calculate a great many square roots for numbers that differ slightly, you can use Newton iteration, if you never reinitialise the starting value, but just leave it where the previous calculation left off. I've used this with success in at least one Euler problem.

Brakesman answered 17/11, 2008 at 13:43 Comment(2)

Getting a good estimate is not too hard. You can use the number of digits of the number to estimate a lower and upper bound for the solution. See also my answer where I propose a divide and conquer solution. – Harleyharli 27/12, 2018 at 17:58

What is the difference between POPCNT and counting the number of digits? Except that you can do POPCNT in one nanosecond. – Brakesman 29/12, 2018 at 16:27

It ought to be possible to pack the 'cannot be a perfect square if the last X digits are N' much more efficiently than that! I'll use java 32 bit ints, and produce enough data to check the last 16 bits of the number - that's 2048 hexadecimal int values.

...

Ok. Either I have run into some number theory that is a little beyond me, or there is a bug in my code. In any case, here is the code:

public static void main(String[] args) {
    final int BITS = 16;

    BitSet foo = new BitSet();

    for(int i = 0; i< (1<<BITS); i++) {
        int sq = (i*i);
        sq = sq & ((1<<BITS)-1);
        foo.set(sq);
    }

    System.out.println("int[] mayBeASquare = {");

    for(int i = 0; i< 1<<(BITS-5); i++) {
        int kk = 0;
        for(int j = 0; j<32; j++) {
            if(foo.get((i << 5) | j)) {
                kk |= 1<<j;
            }
        }
        System.out.print("0x" + Integer.toHexString(kk) + ", ");
        if(i%8 == 7) System.out.println();
    }
    System.out.println("};");
}

and here are the results:

(ed: elided for poor performance in prettify.js; view revision history to see.)

Campbellbannerman answered 17/11, 2008 at 13:43 Comment(0)

If speed is a concern, why not partition off the most commonly used set of inputs and their values to a lookup table and then do whatever optimized magic algorithm you have come up with for the exceptional cases?

Rother answered 17/11, 2008 at 13:43 Comment(1)

The problem is that there is no "commonly used set of inputs"--usually I'm iterating through a list, so I won't use the same inputs twice. – Brunildabruning 6/1, 2009 at 14:40

May be the best algorithm for the problem is a fast integer square root algorithm https://mcmap.net/q/46061/-looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2

There @Kde claims that three iterations of the Newton method would be sufficient for the accuracy of ±1 for 32 bit integers. Certainly, more iterations are needed for 64-bit integers, may be 6 or 7.

Controller answered 17/11, 2008 at 13:43 Comment(0)

"I'm looking for the fastest way to determine if a long value is a perfect square (i.e. its square root is another integer)."

The answers are impressive, but I failed to see a simple check :

check whether the first number on the right of the long it a member of the set (0,1,4,5,6,9) . If it is not, then it cannot possibly be a 'perfect square' .

eg.

4567 - cannot be a perfect square.

Gaskins answered 17/11, 2008 at 13:43 Comment(3)

actually this has been suggested, only in different bases. Checking the last base-10 digit requires taking n%10, which is a division (and thus expensive). Besides, this would only eliminate 40% of possible values. In base-16, you can find the last hex-digit with n&0xf, which is a very fast bit-wise operation. In base 16, the last digit of a perfect square has to be 0, 1, 4, or 9, which means that 75% of numbers are eliminated by that check. – Brunildabruning 24/9, 2009 at 13:42

This is the line that checks that the last hex digit is either 0, 1, 4, or 9, though it's using some optimized bit tricks to do it: if( n < 0 || ((n&2) != 0) || ((n & 7) == 5) || ((n & 11) == 8) ) – Brunildabruning 24/9, 2009 at 13:44

Perhaps the following will be faster? Change: if( n < 0 || ((n&2) != 0) || ((n & 7) == 5) || ((n & 11) == 8) ) return false; if( n == 0 ) return true; into: if( n == 0 ) return true; if( n < 0 || !( ( n&7 == 1 ) || n==4 ) ) return false; – Gaskins 25/9, 2009 at 9:47

Regarding the Carmac method, it seems like it would be quite easy just to iterate once more, which should double the number of digits of accuracy. It is, after all, an extremely truncated iterative method -- Newton's, with a very good first guess.

Regarding your current best, I see two micro-optimizations:

move the check vs. 0 after the check using mod255
rearrange the dividing out powers of four to skip all the checks for the usual (75%) case.

I.e:

// Divide out powers of 4 using binary search

if((n & 0x3L) == 0) {
  n >>=2;

  if((n & 0xffffffffL) == 0)
    n >>= 32;
  if((n & 0xffffL) == 0)
      n >>= 16;
  if((n & 0xffL) == 0)
      n >>= 8;
  if((n & 0xfL) == 0)
      n >>= 4;
  if((n & 0x3L) == 0)
      n >>= 2;
}

Even better might be a simple

while ((n & 0x03L) == 0) n >>= 2;

Obviously, it would be interesting to know how many numbers get culled at each checkpoint -- I rather doubt the checks are truly independent, which makes things tricky.

Chromogenic answered 17/11, 2008 at 13:43 Comment(0)

If you want speed, given that your integers are of finite size, I suspect that the quickest way would involve (a) partitioning the parameters by size (e.g. into categories by largest bit set), then checking the value against an array of perfect squares within that range.

Verniavernice answered 17/11, 2008 at 13:43 Comment(3)

There are 2^32 perfect squares in the range of a long. This table would be huge. Also, the advantage of computing the value over a memory access could be huge. – Unintelligible 17/11, 2008 at 14:54

Oh no there aren't, there are 2^16. 2^32 is 2^16 squared. There are 2^16. – Verniavernice 18/11, 2008 at 12:4

yes, but the range of a long is 64 bits, not 32 bits. sqrt(2^64)=2^32. (i'm ignoring the sign bit to make the math a little easier... there are actually (long)(2^31.5)=3037000499 perfect squares) – Brunildabruning 2/12, 2008 at 19:14

-1

Not sure if this is the fastest way, but this is something I stumbled upon (long time ago in high-school) when I was bored and playing with my calculator during math class. At that time, I was really amazed this was working...

public static boolean isIntRoot(int number) {
    return isIntRootHelper(number, 1);
}

private static boolean isIntRootHelper(int number, int index) {
    if (number == index) {
        return true;
    }
    if (number < index) {
        return false;
    }
    else {
        return isIntRootHelper(number - 2 * index, index + 1);
    }
}

Harleyharli answered 17/11, 2008 at 13:43 Comment(2)

Oops, this is an O(N^.5) algorithm, so it is really bad regards speed, and lasts forever for 63 bit numbers that may be input. I changed my upvode to a downvote. What was I thinking when I upvoted it. At least the idea behind it is correct, but I didn't test it. – Brakesman 27/12, 2018 at 11:57

the idea is based on square numbers being of form 1 + 3 + 5 + ... unfortunately this is very slow, with sqrt(N) iterations, as opposed to a few machine cycles using FP sqrt. The recursion doesn't help either. – Indignant 2/8, 2022 at 22:28

-5

static boolean isPerfectSquare (int input) {
  return Math.sqrt(input) == (int) Math.sqrt(input);
}

This will return if the integer value of the square root of input is equal to the double value. This means it it is an integer, it will return true. Otherwise, it will return false.

Pekan answered 17/11, 2008 at 13:43 Comment(1)

The OP was asking about long rather than int, and unfortunately for long this approach doesn't work (which is why it doesn't appear in the older answers). The first failure is with input = 4503599761588224, which is not a square, but whose square root calculated with IEEE 754 binary64 floating-point arithmetic is an integer, 67108865.0. – Tuddor 25/2, 2022 at 8:18

-5

Don't know about fastest, but the simplest is to take the square root in the normal fashion, multiply the result by itself, and see if it matches your original value.

Since we're talking integers here, the fasted would probably involve a collection where you can just make a lookup.

Hypognathous answered 17/11, 2008 at 13:43 Comment(3)

wouldn't it be faster and cheaper to "take the square root in the normal fashion" and check if its an int? – Panoptic 17/11, 2008 at 14:4

I suppose it depends on whether a mod op or a mult op is faster on your system. – Hypognathous 17/11, 2008 at 15:57

Check my suggestion above. It takes a very simple approach. But you can also use the sqrt function and mod it by 1. I'll add that as another proposed solution. – Dray 5/12, 2017 at 10:50

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Newton's Method with integer arithmetic

Recommended topics

Hot tags