how to share avro schema definitions across teams
Asked Answered
S

2

5

Kafka schema-registry provides a nice way to serialize and deserialize the data from Kafka using common data contract. However, the data contract (.avsc file) is the glue between the producer and consumer(s).

Once producer makes the .avsc file, it could be checked in to version control on the producer's side. Depending the language, it auto-generates classes too.

However,

  1. what would be the best mechanism for the consumer to pull down the schema definition for reference? is there anything like swaggerhub or typical api documentation portals for avro ?
  2. If we use Confluent platform, control center provides a gui to view the schema associated to a topic but it also allows the user to edit. How would it work between producer & consumer(s) teams? what would prevent the consumer or anyone from editing the schema directly on the Confluent platform ?
  3. Is this something that we need to custom build using rest-proxy?
Sullage answered 5/9, 2019 at 2:49 Comment(0)
L
8

You're talking about two different ways to work with Avro schemas:

  • Having schema registry store the schemas for you.
  • Generating an .avsc file and making that available to downstream consumers.

In the first method, your producer would have an .avsc file that is used to serialize the messages and send them to Kafka, but if you're using schema registry, you don't need to worry about consumers needing the actual Avro definition, since the whole Avro schema is available from schema registry using the schema id. You don't have the actual generated classes, true, but you can still "walk" the entire message, and extract your data from that.

In the second method, without using a schema registry, the producer uses an .avsc file to serialize the data sent to Kafka as a byte array, and that file is then made available to consumer/downstream applications, usually through source control. Of course, this means your producer and consumers have to be in sync whenever you make schema changes, or else your consumers won't be able to read the fields the producer has added or modified.

So, if you're using schema registry, Kafka consumers, if properly configured, will pull the schema that each message requires automatically, and you can then extract the data you need. Separately, you can also get the latest schema for any topic with something like this:

  curl -X GET "http://schema-registry.company.com:8081/subjects/your_topic-value/versions/latest/schema"

If, however, you are not using the schema registry, the only way to get the full schema is to have access to the .avsc file used to serialize the message, usually through source control, as mentioned above. You can also then share the auto-generated classes, if available, to deserialize your messages into classes directly.

For more information on how to interact with Schema Registry, here's a link to the documentation: https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html#using-curl-to-interact-with-schema-registry

And some reading on general schema compatibility and how it's handled/configured in Schema Registry - https://docs.confluent.io/current/schema-registry/avro.html

Lobeline answered 5/9, 2019 at 4:41 Comment(1)
How to parse the schema? Or should we already know the schema beforehand for the first approach (registry)?Divergence
T
2

It is a little bit old question but the answer might be helpful for someone else.

  1. I recommend using schema registry for that e.g. Confluent Schema Registry or Apicurio Registry. Thanks to that you will have one source of information of what schemas are available, what was the history of changes and you don't need to synchronize it between applications. If you go with code first, schema last approach as you described, the schema can be pushed to the registry via auto registration or by some deployment pipeline on CI.
  2. I'm not familiar with Confluent Control Center but in project that I'm working on (Nussknacker) we are using AKHQ tool, that has read-only mode (I bet that Confluent Control Center has the same option). During deployments of Nussknacker we usually has schemas in some separate git repository and adding new version of schema was done by review process and after that schema was added to schema registry by pipeline step in CI. IMO it is better option instead of auto-registration or adding via GUI because both producer and consumer side can talk about API design. AKHQ or other visualization tool in this approach is only for visualization of schema registry state. Our pipeline step was using REST API like mjuarez described.
Telugu answered 17/2, 2022 at 9:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.