What's the recommended implementation for hashing OLE Variants?
Asked Answered
C

3

7

OLE Variants, as used by older versions of Visual Basic and pervasively in COM Automation, can store lots of different types: basic types like integers and floats, more complicated types like strings and arrays, and all the way up to IDispatch implementations and pointers in the form of ByRef variants.

Variants are also weakly typed: they convert the value to another type without warning depending on which operator you apply and what the current types are of the values passed to the operator. For example, comparing two variants, one containing the integer 1 and another containing the string "1", for equality will return True.

So assuming that I'm working with variants at the underlying data level (e.g. VARIANT in C++ or TVarData in Delphi - i.e. the big union of different possible values), how should I hash variants consistently so that they obey the right rules?

Rules:

  • Variants that hash unequally should compare as unequal, both in sorting and direct equality
  • Variants that compare as equal for both sorting and direct equality should hash as equal

It's OK if I have to use different sorting and direct comparison rules in order to make the hashing fit.

The way I'm currently working is I'm normalizing the variants to strings (if they fit), and treating them as strings, otherwise I'm working with the variant data as if it was an opaque blob, and hashing and comparing its raw bytes. That has some limitations, of course: numbers 1..10 sort as [1, 10, 2, ... 9] etc. This is mildly annoying, but it is consistent and it is very little work. However, I do wonder if there is an accepted practice for this problem.

Conducive answered 19/3, 2010 at 18:12 Comment(12)
VARIANT is actually a struct, that has two pieces of data - value and type. Your comparison and conversion claim seems to take in consideration only the value and does not look at the type filed of that struct. The right approach is to always consider the type filed as well.Mezzotint
@Franci, I think you missed the point. Two variants can compare equal even when their types differ. If the variants compare equal, then Barry wishes their hashes to be equal, too. Variant(1) = Variant('1') ==> hash(Variant(1)) = hash(Variant('1')).Beauvoir
Barry, I don't think your first rule is right. It ignores the possibility of hash collisions, where the hashes are equal but the values are not similar at all.Beauvoir
You're right @Rob, I meant the inverse - I've inverted it. Though I think it actually follows from the second rule.Conducive
No, I didn't miss Barry's point. But his conversion and comparison claims in paragraph 2 of the question are mistaken. Two VARIANTs can't compare equal, if their types differ. There has to be a conversion before the comparison (which in the COM world is explicit, but in the VB and Delphi can be implicit). VARIANT(VT_INT, 1) is not equal to VARIANT(VT_BSTR, L"1").Mezzotint
If Barry wants semantical equality, i.e. any representation of the number 1 in any VARIANT type to be considered the same (including strings, dates and objects (what's the IDispatch or OLEDATE representation of 1, anyway?)) and to have the same hash, he should take into consideration the type when calculating the hash, or should "normalize" all representations to the same data type (which he seems to be doing currently). If he chooses the "normalization" approach, the VARIANT part is absolutely irrelevant to the question, as the same problem applies to any other value storage he chooses.Mezzotint
If y'all were more inclined to post answers rather than comments, there could be some proper (+ and -) voting and potentially acceptance going on here...Conducive
Just out of curiosity, what are you trying to accomplish? I wonder if there is a better way.Immuno
@Immuno - all I need is comparing and hashing that's consistent enough to use in data structures like hash tables, trees, etc.Conducive
But what are you trying to do at a higher level? The notion of an associative collection where the VARIANT is the key seems nonsensical to me. Why would you need to do a lookup on a VARIANT?Immuno
@Immuno - I'm trying to write a generic container for a language runtime library. It's not up to me to question why the user would make the key a variant.Conducive
I don't think there's a good way to do this with VARIANTs. You could convert them all to strings for hashing purposes, but as others have stated there are issues with this (also, not all VARIANT types can be converted to strings). Maybe you should look at the .NET CLR to see how Object.GetHashCode() is implemented.Immuno
H
0

Hash codes of VARIANTS that are equal should be equal.

Without knowing the equality and coercion rules that are used for testing equality, it is hard to come up with a proper implementation.

Hak answered 21/3, 2010 at 17:9 Comment(2)
I'm very familiar with how hash codes work in .NET and Java (I've written compilers in both targeting both CLR and JVM), but the problem is that Variants as used in VB and Delphi are not type-safe in the same way as polymorphic objects stored in a location of type Object in .NET or Java, or the way values are type-safe in Ruby, Python or Javascript. That is, 1 == "1", or 1.Equals("1") == true, for Variant values of 1 and "1". I guess the answer to my question is "it depends" - depending on the language semantics.Conducive
I'm marking this the answer as it's quite true, in order to write the hash function that is guaranteed to match the equality function, the equality function must be known and well defined.Conducive
K
2

There's a built in tension in your question between the use of a hash function and the stated requirements, which are to validated against the input of the hash. I'd suggest we keep in mind a few properties of hashes in general: information is lost during the hashing process, and hash collisions are to be expected. It is possible to construct a perfect hash without collisions, but it would be problematic (or impossible?) to construct a perfect hash function if the domain of the function is any possible OLE Variant. On the other hand, if we're not talking about a perfect hash, then your first rule is violated.

I don't know the larger context of what you're trying to accomplish, but I must push back on one of your assumptions: is a hash function really what you want? Your requirements could be met in a fairly straightforward way if you develop a system that encodes, not hashes, all of the possible OLE Variant attributes so that they can be recalled later and compared against other Variant images.

Your baseline implementation of converting the Variant to a string representation is moving in this direction. As you are no doubt aware, a Variant can contain pointers, double pointers, and arrays, so you'll have to develop consistent string representation of these data types. I question whether this approach could really be classified as a hash. Aren't you just persisting data attributes?

Kalin answered 22/3, 2010 at 0:35 Comment(1)
I'm writing a generic collection class for a runtime library. The generic parameter might be a variant. Perfect hashing isn't relevant. (Actually, perfect hashing can be counterproductive in small hash tables by increasing the cost of a failed hash lookup.)Conducive
E
0

So in summary, to make stuff comparable you first stream to a common format, string or blob.

How do you handle e.g. localisation, e.g. formating of reals? A real compared to a string containing the same real created in another locale will fail. Or a real written to string with a different precision setting.

It sounds to me the definition of equal() is the problem, not the hashing. If "equal" values can be serialized to string (or blob) differently, hashing will fail.

Ezekiel answered 20/3, 2010 at 15:33 Comment(1)
That's a good point. There are two potential answers: (a) use invariant settings so that hash codes will be reliable across multiple instances and locales etc., or (b) don't care so long as the results are consistent within any given run (though settings may change and break things in edge cases). Given all that's been said in the comments to my question - I wish more of those comments were actual answers - I may review my approach and handle types individually, and not try to preserve Delphi-like equality semantics when considering comparers etc. needed for algorithms.Conducive
H
0

Hash codes of VARIANTS that are equal should be equal.

Without knowing the equality and coercion rules that are used for testing equality, it is hard to come up with a proper implementation.

Hak answered 21/3, 2010 at 17:9 Comment(2)
I'm very familiar with how hash codes work in .NET and Java (I've written compilers in both targeting both CLR and JVM), but the problem is that Variants as used in VB and Delphi are not type-safe in the same way as polymorphic objects stored in a location of type Object in .NET or Java, or the way values are type-safe in Ruby, Python or Javascript. That is, 1 == "1", or 1.Equals("1") == true, for Variant values of 1 and "1". I guess the answer to my question is "it depends" - depending on the language semantics.Conducive
I'm marking this the answer as it's quite true, in order to write the hash function that is guaranteed to match the equality function, the equality function must be known and well defined.Conducive

© 2022 - 2024 — McMap. All rights reserved.