Avg of float inconsistency
Asked Answered
L

1

1

The select returns right at 23,000 rows
The except will return between 60 to 200 rows (and not the same rows)
The except should return 0 as it is select a except select a

PK: [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

[tf] is a float and and I get float is not exact
But I naively thought avg(float) would be repeatable
Avg(float) does appear to be repeatable

What is the solution?
TF is between 0 and 1 and I only need like 5 significant digits
I just need avg(TF) to be the same number run to run
Decimal(9,8) gives me enough precision and if I cast to decimal(9,8) the except properly returns 0
I can change [TF] to decimal(9,8) but it will be bit of work and lot of regression testing as some of the test that use [tf] take over a day to run
Is change [TF] to decimal(9,8) the best solution?

  SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
       , avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
    FROM [docSVenum1] 
    JOIN [docFieldLock] 
           ON [docFieldLock].[sID] = [docSVenum1].[sID] 
          AND [docFieldLock].[fieldID] = [docSVenum1].[enumID] 
          AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */ 
    JOIN [FTSindexWordOnce] 
           ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

except 

  SELECT [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]
       , avg([FTSindexWordOnce].[tf]) AS [avgTFraw]
    FROM [docSVenum1] 
    JOIN [docFieldLock] 
           ON [docFieldLock].[sID] = [docSVenum1].[sID] 
          AND [docFieldLock].[fieldID] = [docSVenum1].[enumID] 
          AND [docFieldLock].[lockID] IN (4, 5) /* secLvl docAdm */ 
    JOIN [FTSindexWordOnce] 
           ON [FTSindexWordOnce].[sID] = [docSVenum1].[sID]
GROUP BY [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID] 

order by [docSVenum1].[enumID], [docSVenum1].[valueID], [FTSindexWordOnce].[wordID]

In this case tf is term frequency of tf-idf
tf normalization is subjective and does not require much precision
Avg(tf) needs to be consistent from select to select or the results are not consistent
In a single select with joins I need a consistent avg(tf)
Going with decimal and a low precision for tf got consistent results

Lunnete answered 24/10, 2015 at 20:8 Comment(7)
What exactly is the problem with having seemingly indeterministic variations that are very, very close to 0?Hypophosphate
@Frisbee For someone your problem is unclear(based on close flag).Throe
@Hypophosphate I just need avg(TF) to be the same number run to runLunnete
Just cast your float to the appropriate decimal in the query: AVG(CAST([FTSindexWordOnce].[tf] AS decimal(9,8)))Klusek
@VladimirBaranov Yes it works but over the long run would it be more efficient to convert the column?Lunnete
@Frisbee, you need to measure performance. It is hard to guess. decimal with low precision (up to 9) uses 5 bytes, higher precisions use more than 8 bytes for float. But, if you don't need high precision, maybe 4 byte real is enough. But, correctness should go first and performance second. When you use float or real your algorithms should never compare for equality, you should always compare floating point values as: abs(x-y) < epsilon.Klusek
@VladimirBaranov Going down on the input precision actually helped on the output consistency. I updated the question. And thank for the input.Lunnete
T
3

This is very similiar to: SELECT SUM(...) is non-deterministic when adding the column-values of datatype float.

The problem is that with inaccurate datatype (FLOAT/REAL) the order of of arithmetic operations on floating point matters. Demo from connect:

DECLARE @fl FLOAT = 100000000000000000000
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
    SET @fl = @fl + CONVERT(float, 5000)
    SET @i = @i + 1
END
SET @fl = @fl - 100000000000000000000
SELECT CONVERT(NVARCHAR(40), @fl, 2)
-- 0.000000000000000e+000


DECLARE @fl FLOAT = 0
DECLARE @i SMALLINT = 0
WHILE (@i < 100)
BEGIN
    SET @fl = @fl + CONVERT(float, 5000)
    SET @i = @i + 1
END
SET @fl = @fl + 100000000000000000000
SET @fl = @fl - 100000000000000000000
SELECT @fl
-- 507904

LiveDemo

Possible solutions:

  • CAST all arguments to accurate datatype like DECIMAL/NUMERIC
  • alter table and change FLOAT to DECIMAL
  • you can try to force query optimizer to calculate the sum with the same order.

The good news is that when a stable query result matters to your application, you can force the order to be the same by preventing parallelism with OPTION (MAXDOP 1).


It looks like intial link is dead. WebArchive

Throe answered 24/10, 2015 at 20:31 Comment(10)
The order of evalaution of the inputs to average will depend on a lot more than serial vs parallel. E.g. Physical access paths used, Join orders used, join algorithms used, stream aggregate vs hash aggregate.Prefer
@MartinSmith In connect answer it would rather means order between execution the same query(the same execution plan). Of course when statistics will change and execution plan will refresh it will change too. The only way is to use appropriate datatype (DECIMAL).Throe
Thanks, this is used to find near duplicates using cosine tf-idf. That avg([FTSindexWordOnce].[tf]) is called literally billions of times in 24 hour run. I assume alter the table to decimal is the most efficient?Lunnete
@Frisbee If I were you I would alter table. Of course if you have any other reason to use FLOATThroe
I only did it because I thought float would be faster. But it has to be repeatable. Let me test it out before I give it the check.Lunnete
@Frisbee Take your time and share performance result float vs decimal :)Throe
@lad2025 I am getting no measurable difference in individual selects. In the run that make billions of calls there are too many other variables. I even thought about hacking in some integer math as if I scale it up that works at the TF level. But in the final calculation I have to take square root so I lost most or all of what I gained and just ended up with some messy code. And thanksLunnete
@Frisbee Is TF acronym of Term Frequency? I see now cosine measure similarity Do you write search engine :)Throe
@lad2025 Yes? It is embedded in the application but not sold as search engine product. Using like a Lucene (outstanding product - I am not putting it down) I just did not have the granularity or speed I needed. If you look me up on datascience.stackexchange.com you will get a feel for what I am doing.Lunnete
@Frisbee Interesting website to broaden horizons :)Throe

© 2022 - 2024 — McMap. All rights reserved.