This is somewhat of a shot in the dark in case anyone savvy with the Java implementation of Apache Avro is reading this.
My high-level objective is to have some way to transmit some series of avro data over the network (let's just say HTTP for example, but the particular protocol is not that important for this purpose). In my context I have a HttpServletResponse I need to write this data to somehow.
I initially attempted to write the data as what amounted to a virtual version of an avro container file (suppose that "response" is of type HttpServletResponse):
response.setContentType("application/octet-stream");
response.setHeader("Content-transfer-encoding", "binary");
ServletOutputStream outStream = response.getOutputStream();
BufferedOutputStream bos = new BufferedOutputStream(outStream);
Schema someSchema = Schema.parse(".....some valid avro schema....");
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("somefield", someData);
...
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
DataFileWriter<GenericRecord> fileWriter = new DataFileWriter<GenericRecord>(datumWriter);
fileWriter.create(someSchema, bos);
fileWriter.append(someRecord);
fileWriter.close();
bos.flush();
This was all fine and dandy, except that it turns out Avro doesn't really provide a way to read a container file apart from an actual file: the DataFileReader only has two constructors:
public DataFileReader(File file, DatumReader<D> reader);
and
public DataFileReader(SeekableInput sin, DatumReader<D> reader);
where SeekableInput is some avro-specific customized form whose creation also ends up reading from a file. Now given that, unless there is some way to somehow coerce an InputStream into a File (https://mcmap.net/q/412148/-create-a-java-file-object-or-equivalent-using-a-byte-array-in-memory-without-a-physical-file suggests that there is not, and I have tried looking around the Java documentation as well), this approach won't work if the reader on the other end of the OutputStream receives that avro container file (I'm not sure why they allowed one to output avro binary container files to an arbitrary OutputStream without providing a way to read them from the corresponding InputStream on the other end, but that's beside the point). It seems that the implementation of the container file reader requires the "seekable" functionality that a concrete File provides.
Okay, so it doesn't look like that approach will do what I want. How about creating a JSON response that mimics the avro container file?
public static Schema WRAPPER_SCHEMA = Schema.parse(
"{\"type\": \"record\", " +
"\"name\": \"AvroContainer\", " +
"\"doc\": \"a JSON avro container file\", " +
"\"namespace\": \"org.bar.foo\", " +
"\"fields\": [" +
"{\"name\": \"schema\", \"type\": \"string\", \"doc\": \"schema representing the included data\"}, " +
"{\"name\": \"data\", \"type\": \"bytes\", \"doc\": \"packet of data represented by the schema\"}]}"
);
I'm not sure if this is the best way to approach this given the above constraints, but it looks like this might do the trick. I'll put the schema (of "Schema someSchema" from above, for instance) as a String inside the "schema" field, and then put in the avro-binary-serialized form of a record fitting that schema (ie. "GenericRecord someRecord") inside the "data" field.
I actually wanted to know about a specific detail of that which is described below, but I thought it would be worthwhile to give a bigger context as well, so that if there is a better high-level approach I could be taking (this approach works but just doesn't feel optimal) please do let me know.
My question is, assuming I go with this JSON-based approach, how do I write the avro binary representation of my Record into the "data" field of the AvroContainer schema? For example, I got up to here:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
GenericDatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(someSchema);
Encoder e = new BinaryEncoder(baos);
datumWriter.write(resultsRecord, e);
e.flush();
GenericRecord someRecord = new GenericData.Record(someSchema);
someRecord.put("schema", someSchema.toString());
someRecord.put("data", ByteBuffer.wrap(baos.toByteArray()));
datumWriter = new GenericDatumWriter<GenericRecord>(WRAPPER_SCHEMA);
JsonGenerator jsonGenerator = new JsonFactory().createJsonGenerator(baos, JsonEncoding.UTF8);
e = new JsonEncoder(WRAPPER_SCHEMA, jsonGenerator);
datumWriter.write(someRecord, e);
e.flush();
PrintWriter printWriter = response.getWriter(); // recall that response is the HttpServletResponse
response.setContentType("text/plain");
response.setCharacterEncoding("UTF-8");
printWriter.print(baos.toString("UTF-8"));
I initially tried omitting the ByteBuffer.wrap clause, but then then the line
datumWriter.write(someRecord, e);
threw an exception that I couldn't cast a byte array into ByteBuffer. Fair enough, it looks like when the Encoder class (of which JsonEncoder is a subclass) is called to write an avro Bytes object, it requires a ByteBuffer to be given as an argument. Thus, I tried encapsulating the byte[] with java.nio.ByteBuffer.wrap, but when the data was printed out, it was printed as a straight series of bytes, without being passed through the avro hexadecimal representation:
"data": {"bytes": ".....some gibberish other than the expected format...}
That doesn't seem right. According to the avro documentation, the example bytes object they give says that I need to put in a json object, an example of which looks like "\u00FF", and what I have put in there is clearly not of that format. What I now want to know is the following:
- What is an example of an avro bytes format? Does it look something like "\uDEADBEEFDEADBEEF..."?
- How do I coerce my binary avro data (as output by the BinaryEncoder into a byte[] array) into a format that I can stick into the GenericRecord object and have it print correctly in JSON? For example, I want an Object DATA for which I can call on some GenericRecord "someRecord.put("data", DATA);" with my avro serialized data inside?
- How would I then read that data back into a byte array on the other (consumer) end, when it is given the text JSON representation and wants to recreate the GenericRecord as represented by the AvroContainer-format JSON?
- (reiterating the question from before) Is there a better way I could be doing all this?