Avro multiple record of same type in single schema
Asked Answered
M

2

22

I like to use the same record type in an Avro schema multiple times. Consider this schema definition

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        }
    ]
}

This is not a valid Avro schema and the Avro schema parser fails with

org.apache.avro.SchemaParseException: Can't redefine: my.types.OrderBookVolume

I can fix this by making the type unique by moving the OrderBookVolume into two different namespaces:

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.bid",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.ask",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        }
    ]
}

This is not a valid solution as the Avro code generation would generate two different classes, which is very annoying if I like to use the type also for other things and not just for deser and ser.

This problem is related to this issue here: Avro Spark issue #73

Which added differentiation of nested records with the same name by prepending the namespace with the outer record names. Their use case may be purely storage related so it may work for them but not for us.

Does anybody know a better solution? Is this a hard limitation of Avro?

Matriarch answered 4/1, 2018 at 17:31 Comment(0)
O
43

It's not well documented, but Avro allows you to reference previously defined names by using the full namespace for the name that is being referenced. In your case, the following code would result in only one class being generated, referenced by each array. It also DRYs up the schema nicely.

{
    "type": "record",
    "name": "OrderBook",
    "namespace": "my.types",
    "doc": "Test order update",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "namespace": "my.types.bid",
                    "fields": [
                        {
                            "name": "price",
                            "type": "double"
                        },
                        {
                            "name": "volume",
                            "type": "double"
                        }
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": "my.types.bid.OrderBookVolume"
            }
        }
    ]
}
Oden answered 6/1, 2018 at 20:13 Comment(1)
Nice.. But what about reference between types inside two different avsc files?Boeschen
T
6

As stated in the spec:

A schema or protocol may not contain multiple definitions of a fullname.
Further, a name must be defined before it is used ("before" in the
depth-first, left-to-right traversal of the JSON parse tree, where the
types attribute of a protocol is always deemed to come "before" the
messages attribute.)

For example:

{
    "type": "record",
    "namespace": "my.types",
    "name": "OrderBook",
    "fields": [
        {
            "name": "bids",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "OrderBookVolume",
                    "fields": [
                        {"name": "price", "type": "double"},
                        {"name": "volume", "type": "double"}
                    ]
                }
            }
        },
        {
            "name": "asks",
            "type": {
                "type": "array",
                "items": {
                    "type": "record",
                    "name": "my.types.OrderBookVolume"
                }
            }
        }
    ]
}

The first occurrence is the full schema for OrderBookVolume. Afterwards, you can just refer to the fullname: my.types.OrderBookVolume.

It's also worth noting that you don't need to have a namespace for each record. It inherits it from its parent. Including it will override the namespace.

Thoraco answered 29/3, 2019 at 14:36 Comment(4)
I don't think the syntax for array items is correct - see the accepted answer by John HunterProspect
@Prospect I'm not sure what you mean. Could you be more specific?Thoraco
Sure. To specify a previous reference to be used as the type for an array's items, you should only specify the name as a string. I couldn't get the Confluent Schema Registry to recognise the syntax in your example, but it worked when I swapped it for the following. Thanks for your reply though { "name": "asks", "type": { "type": "array", "items": "OrderBookVolume" <- this } }Prospect
Without adding namespace in first definition and referencing same it doesn't work, like Hunter did.Macaulay

© 2022 - 2024 — McMap. All rights reserved.