How to filter out rows with NaN values in Hive?
Asked Answered
D

3

8

I am running a sum function of one hive table in Hue, and get a return value of NaN.

Here is my code:

select sum(v1) from hivedb.tb1;

I don't know why it is giving me a NaN result. I checked if any of my v1 values are null:

select * from hivedb.tb1 where v1 is null;

, and it turns out no record has null value. The table has 100 million rows, so I can not do a manual check for each record.

  1. Does anybody know why I am getting a NaN result?
  2. And if it is because I have some abnormal value in some rows, how can I find them?

Any help is appreciated. Thank you in advance!

UPDATE 1 I manually screened the first 1000 rows, and luckily spotted some abnormal values of NaN in tb1. It is resulted from some rounding error from the previous steps. So my question 1 is probably answered. Please feel free to comment on it, if you think there could be other reasons.

I still don't know how to use an efficient way to spot the rows with NaN values. So I am still looking forward to any answers to my question #2. Please feel free to share. I appreciate your help.

UPDATE 2 The problem is solved with help in the accepted answer below, in the discussion section. There are multiple ways to deal with it.

  1. Use a condition selection of v1+1 >v1. It will select rows with non NaN values.
  2. Use a condition selection of cast(v1 as String) ='NaN'. It will select rows with NaN values.
Ditch answered 20/4, 2017 at 17:33 Comment(3)
Oh, come on, Null values are ignored when computing SQL aggregates. On the other hand, the sum of 100 million values might overflow the capacity of an Integer or Float or Decimal(p,s). What is the data type of v1, by the way? And did you try sum(cast(v1 as Double))?Technique
@Samson Scharfrichter You are right. I just don't know how to select the abnormal rows. Thank you for the remainder though. Back to your question, v1 is double.Ditch
@SamsonScharfrichter Thank you. I kinda figure it out. By chance, I manually screen thousand of rows, and luckily I spot some NaN values in tb1. It is generated by some rounding error issue. So my first question is kinda answered, which is NaN values existing in tb1. But I still don't know how to deal with my question #2. Please feel free to share if you have any insights.Ditch
T
7

Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.

Quoting that post...

(Float.NaN == Float.NaN) always returns false.
In fact, if you look at the JDK implementation of Float.isNaN(), a number is not-a-number if it is not equal to itself (which makes sense because a number should be equal to itself).
The same holds for Double.NaN

So, there is no point in showing you how to use the (undocumented) Hive function called reflect2, which allows you to invoke raw Java methods on Hive columns, i.e.

where v1 is not null and not reflect2(v1, "isNaN")

...because -- in theory -- you can simply state:

where v1 is not null and v1=v1

Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.
In other words, if the simple v1=v1 clause does not filter out the NaN values as expected, then look into reflect2...

Edit -- indeed, the optimizer appears to ignore the v1=v1 clause in some versions of Hive (see comments) so a more devious formula is necessary:

  • v1 +1.0 > v1 should work... except when rounding errors make either abs(v1) << 1 or abs(v1) >> 1
  • other "numeric" tricks will fail similarly in edge cases, especially when v1 =0.0

In the end, the most robust approach appears to try cast(v1 as String) <>'NaN' (because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).


Side note about reflect2 -- you can see that it is indeed not mentioned in the official Hive doc, while reflect is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025

Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive CLI.

Technique answered 23/4, 2017 at 20:55 Comment(6)
Hi, @Samson Scharfrichter: Thank you for the kind note. Unfortunately, neither works. I tried both v1=v1 and reflect 2. For v1=v1 test, there is no error, but it is generating wrong result. Here is the error message for reflect 2: Error while compiling statement: FAILED: SemanticException UDF reflect2 is not allowedDitch
Cf. community.hortonworks.com/questions/25828/… about reflect and reflect2 being "black-listed" via JDBC. Can you try out the legacy hive fat client, on an Edge Node?Technique
Did you try something more convoluted than v1=v1, for example (v1 +1.0 > v1 or abs(v1) *1.1 > abs(v1))? (the +1 trick will not work consistently if v1 is not in the 10^-38, 10^+38 range; and the *1.1 trick will not work for zero, but together they might do the job -- in a very inefficient way)Technique
Or maybe something straightforward: cast(v1 as String) <>'NaN' (or whatever you see displayed when browsing a NaN value in Hue)Technique
Hi, @Samson Scharfrichter: Smart thinking. I tried the v1 + 1 thing, and it works. My initial goal is to select rows with NaN values. So in addition to what you've provided, I add another step with join and conditioned selection to reach my goal.Ditch
Hi, @Samson Scharfrichter: I also tried the cast(as String) advice, and it works too. And this way is more direct to reach my goal. Thank you again for your help. Lots of good advice.Ditch
Q
1

You can handle NaN as

SELECT SUM(CAST(IF(v1 ='NaN', 0, v1)) as Double) FROM hivedb.tb1 
Quant answered 21/4, 2017 at 3:52 Comment(2)
I tried your suggestion, and get this AnalysisException: operands of type DOUBLE and STRING are not comparable: x = 'NaN'Ditch
Nice try, except you got the cast( ) in the wrong direction, and NaN should be treated as Null rather than zero, for a number of conceptual reasons. So you should have done some follow-up with sthg like sum(case when cast(v1 as String) ='NaN' then Null else v1 end)Technique
H
0

Not sure if this applies in many cases, but in Hive 3 I'm getting:

select float('NaN') = float('NaN')

returns True

So in theory:

select * from hivedb.tb1 where v1 <> float('NaN');

should accomplish this

Hapte answered 15/1, 2021 at 16:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.