How to store 32-bit floats using ruby-msgpack gem?
Asked Answered
H

2

5

I am working on a data system that needs to store large amounts of simple, extensible data (alongside some specialist indexing we are developing in-house, and not part of this question). I expect there to be billions of records stored, so efficient serialisation is a key part of the system. The serialisation needs to be fast, space-efficient, and supported in multiple platforms and languages (because packing and unpacking this data will be a client component responsibility, not part of the storage system)

The data type is effectively a hash with optional key/value pairs. Keys will be small integers (interpreted at application layer). Values can be a variety of simple data types - String, Integer, Float.

As a technology choice, we have picked MessagePack, and I am writing code to perform data serialisation via Ruby's msgpack-ruby gem.

I don't need the precision of Ruby's 64-bit Float. None of the numbers being stored has meaningful precision even to limits of 32-bit. So I want to use MessagePack support for 32-bit floating point values. This definitely exists. However, the default behaviour in Ruby on any 64-bit system is to serialise Float to 64 bits:

MessagePack.pack(10.3)
 => "\xCB@$\x99\x99\x99\x99\x99\x9A"

Looking at MessagePack code, it seems there is a method MessagePack::Packer#write_float32, and this does what I expect:

MessagePack::DefaultFactory.packer.write_float32(10.3).to_s
 => "\xCAA$\xCC\xCD"

. . . but I cannot find a way to set up either the default packer or create a new one, that will use this method when serialising a larger structure.

As a test of my comprehension, I tried this:

class Float
  def to_msgpack_ext
    packer.write_float32(self)
  end

  def self.from_msgpack_ext s
    unpacker.read(s)
  end
end

MessagePack::DefaultFactory.register_type(0, Float )

MessagePack.pack(10.3)
 => "\xCB@$\x99\x99\x99\x99\x99\x9A"

No difference at all . . . clearly I am missing or misunderstanding something about the object model used in MessagePack. Is what I want to do possible, and what do I need to do?

Haik answered 5/9, 2018 at 10:21 Comment(0)
A
3

I know it would be nice to use MessagePack.pack, but the Ruby shim is very thin. It barely gives you an entry point into the C (or Java) library. And as AnoE pointed out, I think you can only customize to_msgpack_ext and self.from_msgpack_ext for registered types, not built-in types.

The other problem with your attempt is that you don't have access to packer and unpacker from those methods. You would just have to use Array#pack and String#unpack, I think, even if you could figure out a way to get the library to call your methods. To get a handle to packer, you have to override a different method:

class Float
  private
  def to_msgpack_with_packer(packer)
    packer.write_float32 self
    packer
  end
end

And then call it appropriately (see this code as to why):

10.3.to_msgpack(MessagePack::Packer.new).to_s # => "\xCAA$\xCC\xCD"

However, this falls apart when you call #to_msgpack on a Hash containing a float; it just reverts to its internal methods to pack hash keys and values. This is why I said above that the Ruby shim just gives you an entry point: the core extensions are only used for the initial call.

I think the best, simplest solution is to write a little serialization function that iterates through the hash in Ruby, using the MessagePack::Packer API to do what you want when it sees a float, etc. Zero C-hacking, zero monkey-patching, zero confusion when someone tries to read your code in six months.

def pack_float32(obj, packer=MessagePack::Packer.new)
  case obj
  when Hash
    packer.write_map_header(obj.size)
    obj.each_pair do |key, value|
      pack_float32(value, pack_float32(key, packer))
    end
  when Enumerable
    packer.write_array_header(obj.size)
    obj.each do |value|
      pack_float32(value, packer)
    end
  when Float
    packer.write_float32(obj)
  else
    packer.write(obj)
  end

  packer
end

pack_float32(1=>[10.3]).to_s # => "\x81\x01\x91\xCAA$\xCC\xCD"

Obviously this is not strenuously tested, and it may not handle all the edge cases, but hopefully it's enough to get you started.

One other note: You don't have to worry about unpacking. msgpack-ruby appears to correctly unpack a 32-bit float to a 64-bit Float without any fiddling on our part.

Adagietto answered 9/9, 2018 at 10:40 Comment(0)
F
3

Overriding Float

As of right now (version 1.2.4 of msgpack-ruby) this is not possible in the exact fashion you tried: the msgpack_packer_write_value function first checks all hard-coded data types, and handles them with its default implementation. Only if the current object does not fit any of those types are the extensions handled.

In other words: you cannot override the default pack formats with MessagePack::DefaultFactory#register_type, calling that will simply be a no-op.

Using extensions

Furthermore, the extension mechanism is not what you are looking at, anyways. Using that, messagepack would emit a marker byte "this is an extension", followed by the extension ID (the value "0" in your example), followed by what is already encoded as float32 - alternatively you would need to handle the binary encoding/decoding yourself.

Creating your own Float class

You could, in principle, create your own FloatX class or whatever, but this is just a very bad move:

  • Float has no new method which you could monkeypatch, and I know of no way to tell ruby to create a FloatX instance when you write 10.3 in your code. So you would have to do manual object creation throughout your code, probably with severe impact on performance.
  • You would end up with the extension mechanism anyways, infeasible as shown above.

Overriding the behaviour of msgpack_packer_write_value

You would need to to override the msgpack_packer_write_value implementation of packer.c. Unfortunately you cannot do that in the ruby world since there is no equivalent ruby method defined for it. So the usual monkeypatching of ruby cannot be used.

Also, the method is called from plenty of other methods inside the packer.c implementation, for example in the respective methods responsible for writing arrays or hashes. Those of course would not call a ruby method of the same name either, as they're living in their binary world completely.

Finally, whily the usage of a factory mechanism seems to imply that you can somehow create different implementations of packers, I see no evidence that this is actually true - reading the C code of the Gem, there seems to be no provision for anything of that kind. The factory seems to be there to handle the ruby<->C interactions of the Gem.

What now

If I were in your shoes, I would clone that Gem and modify msgpack_packer_write_value in packer.c to behave as you wish. Check the case T_FLOAT and go on from there. The code seems pretty straightforward - it soon proceeds to the following method in packer.h:

static inline void msgpack_packer_write_float_value(msgpack_packer_t* pk, VALUE v)
{
    msgpack_packer_write_double(pk, rb_num2dbl(v));
}

...which is of course the real culprit here.

Approaching that from the other direction (the write_float32 you already found), the comparable code is:

msgpack_packer_write_float(pk, (float)rb_num2dbl(numeric));

So if you replace that line in msgpack_packer_write_float_value appropriately, you will be done. Should be doable even if you're not that much into C.

Afterwards, you give your Gem an individual release tag, build it yourself and specify it in your Gemfile or however you manage your gems.

Froemming answered 7/9, 2018 at 12:29 Comment(0)
A
3

I know it would be nice to use MessagePack.pack, but the Ruby shim is very thin. It barely gives you an entry point into the C (or Java) library. And as AnoE pointed out, I think you can only customize to_msgpack_ext and self.from_msgpack_ext for registered types, not built-in types.

The other problem with your attempt is that you don't have access to packer and unpacker from those methods. You would just have to use Array#pack and String#unpack, I think, even if you could figure out a way to get the library to call your methods. To get a handle to packer, you have to override a different method:

class Float
  private
  def to_msgpack_with_packer(packer)
    packer.write_float32 self
    packer
  end
end

And then call it appropriately (see this code as to why):

10.3.to_msgpack(MessagePack::Packer.new).to_s # => "\xCAA$\xCC\xCD"

However, this falls apart when you call #to_msgpack on a Hash containing a float; it just reverts to its internal methods to pack hash keys and values. This is why I said above that the Ruby shim just gives you an entry point: the core extensions are only used for the initial call.

I think the best, simplest solution is to write a little serialization function that iterates through the hash in Ruby, using the MessagePack::Packer API to do what you want when it sees a float, etc. Zero C-hacking, zero monkey-patching, zero confusion when someone tries to read your code in six months.

def pack_float32(obj, packer=MessagePack::Packer.new)
  case obj
  when Hash
    packer.write_map_header(obj.size)
    obj.each_pair do |key, value|
      pack_float32(value, pack_float32(key, packer))
    end
  when Enumerable
    packer.write_array_header(obj.size)
    obj.each do |value|
      pack_float32(value, packer)
    end
  when Float
    packer.write_float32(obj)
  else
    packer.write(obj)
  end

  packer
end

pack_float32(1=>[10.3]).to_s # => "\x81\x01\x91\xCAA$\xCC\xCD"

Obviously this is not strenuously tested, and it may not handle all the edge cases, but hopefully it's enough to get you started.

One other note: You don't have to worry about unpacking. msgpack-ruby appears to correctly unpack a 32-bit float to a 64-bit Float without any fiddling on our part.

Adagietto answered 9/9, 2018 at 10:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.