Fastest serialization/deserialization of Scala case classes
Asked Answered
I

4

18

If I've got a nested object graph of case classes, similar to the example below, and I want to store collections of them in a redis list, what libraries or tools should I look at that that will give the fastest overall round trip to redis?

This will include:

  • Time to serialize the item
  • network cost of transferring the serialized data
  • network cost of retrieving stored serialized data
  • time to deserialize back into case classes

    case class Person(name: String, age: Int, children: List[Person]) {}
    
Irra answered 10/8, 2013 at 8:19 Comment(1)
upickle is fast and easy to use, see my answer for more details. The accepted answer is dated. My answer has a link with updated Scala serialization benchmarking.Tussis
H
28

UPDATE (2018): scala/pickling is no longer actively maintained. There are hoards of other libraries that have arisen as alternatives which take similar approaches but which tend to focus on specific serialization formats; e.g., JSON, binary, protobuf.

Your use case is exactly the targeted use case for scala/pickling (https://github.com/scala/pickling). Disclaimer: I'm an author.

Scala/pickling was designed to be a faster, more typesafe, and more open alternative to automatic frameworks like Java or Kryo. It was built in particular for distributed applications, so serialization/deserialization time and serialized data size take a front seat. It takes a different approach to serialization all together- it generates pickling (serialization) code inline at the use-site at compile-time, so it's really very fast.

The latest benchmarks are in our OOPSLA paper- for the binary pickle format (you can also choose others, like JSON) scala/pickling is consistently faster than Java and Kryo, and produces binary representations that are on par or smaller than Kryo's, meaning less latency when passing your pickled data over the network.

For more info, there's a project page: http://lampwww.epfl.ch/~hmiller/pickling

And a ScalaDays 2013 talk from June on Parley's.

We'll also be presenting some new developments in particular related to dealing with sending closures over the network at Strange Loop 2013, in case that might also be a pain point for your use case.

As of the time of this writing, scala/pickling is in pre-release, with our first stable release planned for August 21st.

Hoon answered 10/8, 2013 at 12:11 Comment(8)
That looks quite useful, but the github page is a little short. I'll definitely have a look at those looks and thanks for the reply.Irra
*links. Btw, is JSON the best format for performance, or was that just the example on github?Irra
Yep, we're working on documentation. There will be more for usage instructions in the next week or two along with the release. Our binary format is our most performance-tuned format. Although JSON should be pretty fast too (we haven't published comprehensive benchmarks for JSON yet).Hoon
Unfortunately pickling doesn't support Enumeration types.Punchdrunk
The project doesn't seem to be maintained any more. Or is on a brink of becoming an abandonware soon.Damnable
Using pickling causes severe problems with my Scala IDE - it freezes for a long time during "update occurences annontations" step. I think I would be better off with something less inconvenient.Sabadilla
What do you use nowadays, in 2020? Only knowing that there alternatives, but not which ones are good, doesn't really helpFjeld
@Fjeld you can look at my answer. I use the MsgPack spec and achieve great binary compression results.Aubin
M
9

Update:

You must be careful to use the serialize methods from JDK. The performance is not great and one small change in your class will make the data unable to deserialize.


I've used scala/pickling but it has a global lock while serializing/deserializing.

So instead of using it, I write my own serialization/deserialization code like this:

import java.io._

object Serializer {

  def serialize[T <: Serializable](obj: T): Array[Byte] = {
    val byteOut = new ByteArrayOutputStream()
    val objOut = new ObjectOutputStream(byteOut)
    objOut.writeObject(obj)
    objOut.close()
    byteOut.close()
    byteOut.toByteArray
  }

  def deserialize[T <: Serializable](bytes: Array[Byte]): T = {
    val byteIn = new ByteArrayInputStream(bytes)
    val objIn = new ObjectInputStream(byteIn)
    val obj = objIn.readObject().asInstanceOf[T]
    byteIn.close()
    objIn.close()
    obj
  }
}

Here is an example of using it:

case class Example(a: String, b: String)

val obj = Example("a", "b")
val bytes = Serializer.serialize(obj)
val obj2 = Serializer.deserialize[Example](bytes)
Musset answered 28/12, 2015 at 10:55 Comment(0)
T
0

According to the upickle benchmarks: "uPickle runs 30-50% faster than Circe for reads/writes, and ~200% faster than play-json" for serializing case classes.

It's easy to use, here's how to serialize a case class to a JSON string:

case class City(name: String, funActivity: String, latitude: Double)
val bengaluru = City("Bengaluru", "South Indian food", 12.97)
implicit val cityRW = upickle.default.macroRW[City]
upickle.default.write(bengaluru) // "{\"name\":\"Bengaluru\",\"funActivity\":\"South Indian food\",\"latitude\":12.97}"

You can also serialize to binary or other formats.

Tussis answered 23/12, 2020 at 3:30 Comment(0)
A
0

The accepted answer from 2013 proposes a library that is no longer maintained. There are many similar questions on StackOverflow but I really couldn't find a good answer which would meet the following criteria:

  • serialization/ deserialization should be fast
  • high performance data exchange over the wire where you only encode as much metadata as you need
  • supports schema evolution so that changing the serialized object (ex: case class) doesn't break past deserializations

I recommend against using low-level JDK SerDes (like ByteArrayOutputStream and ByteArrayInputStream). Supporting schema evolution becomes a pain and it's difficult to make it work with external services (ex: Thrift) since you have no control if the data being sent back used the same type of streams.

Some people use the JSON spec, using libraries like json4s but it is not suitable for distributed computing message transfer. It marshalls data as a JSON string so it'll be both slower and storage inefficient, since it will use 8 bits to store every character in the string.

I highly recommend using the MessagePack binary serialization format. I would recommend reading the spec to understand the encoding specifics. It has implementations in many different languages, here's a generic example I wrote for a Scala case class that you can copy-paste in your code.

import java.nio.ByteBuffer
import java.util.concurrent.TimeUnit

import org.msgpack.core.MessagePack

case class Data(message: String, number: Long, timeUnit: TimeUnit, price: Long)

object Data extends App {

  def serialize(data: Data): ByteBuffer = {
    val packer = MessagePack.newDefaultBufferPacker
    packer
      .packString(data.message)
      .packLong(data.number)
      .packString(data.timeUnit.toString)
      .packLong(data.price)
    packer.close()
    ByteBuffer.wrap(packer.toByteArray)
  }

  def deserialize(data: ByteBuffer): Data = {
    val unpacker = MessagePack.newDefaultUnpacker(convertDataToByteArray(data))
    val newdata = Data.apply(
      message = unpacker.unpackString(),
      number = unpacker.unpackLong(),
      timeUnit = TimeUnit.valueOf(unpacker.unpackString()),
      price = unpacker.unpackLong()
    )
    unpacker.close()
    newdata
  }

  def convertDataToByteArray(data: ByteBuffer): Array[Byte] = {
    val buffer = Array.ofDim[Byte](data.remaining())
    data.duplicate().get(buffer)
    buffer
  }

  println(deserialize(serialize(Data("Hello world!", 1L, TimeUnit.DAYS, 3L))))
}

It will print:

Data(Hello world!,1,DAYS,3)
Aubin answered 8/10, 2021 at 2:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.