Where is an Avro schema stored when I create a hive table with 'STORED AS AVRO' clause?
Asked Answered
C

3

14

There are at least two different ways of creating a hive table backed with Avro data:

  1. Creating a table based on an Avro schema (in this example, stored in hdfs):

    CREATE TABLE users_from_avro_schema ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro/schema/user.avsc');

  2. Creating a table by specifying hive columns explicitly with STORED AS AVRO clause:

    CREATE TABLE users_stored_as_avro( id INT, name STRING ) STORED AS AVRO;

Am I correct that in the first case the metadata of users_from_avro_schema table are not stored in Hive Metastore, but inferred from the SERDE class reading the avro schema file? Or maybe the table metadata are stored in the Metastore, added on table's creation, but then what is the policy for synchronising hive metadata with the Avro schema? I mean both cases:

  1. updating table metadata (adding/removing columns) and
  2. updating Avro schema by changing avro.schema.url property.

In the second case when I call DESCRIBE FORMATTED users_stored_as_avro there is no avro.schema.* property defined, so I don't know which Avro schema is used to read/write data. Is it generated dynamically based on the table's metadata stored in the Metastore?

This fragment of Programming Hive book discusses inferring info about columns from the SerDe class, but on the other hand HIVE-4703 removes this from deserializer info form columns comments. How can I check then what is the source of column types for a given table (Metastore or Avro schema)?

Counterfoil answered 30/5, 2017 at 7:45 Comment(1)
just additional question , to support schema evolution EXTERNAL tables should be used right?Disaffirm
C
20

I decided to publish a complementary answer to those given by @DuduMarkovitz.

To make code examples more concise let's clarify that STORED AS AVRO clause is an equivalent of these three lines:

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

Let's take a look then at what happens when we create a table giving a reference to avro schema stored in hdfs. Here is the schema:

{
  "namespace": "io.sqooba",
  "name": "user",
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"}
  ]
}

We create our table with the following command:

CREATE TABLE users_from_avro_schema
STORED AS AVRO
TBLPROPERTIES ('avro.schema.url'='hdfs:///user/tulinski/user.avsc');

Hive has inferred the schema properly, which we can see by calling:

hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string

Hive Metastore shows us the same (I use @DuduMarkovitz's query):

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
+------------------------+-------------+-------------+-----------+

So far, so good, everything works as we expect. But let's see what happens when we update avro.schema.url property to point to the next version of our schema (users_v2.avsc), which is as follows:

{
  "namespace": "io.sqooba",
  "name": "user",
  "type": "record",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default":null}
  ]
}

We simply added another field called email.
Now we update a table property pointing to the avro schema in hdfs:

ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');

Has table metadata been changed?

hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
email                   string

Yeah, cool! But do you expect that Hive Metastore contains this additional column?
Unfortunately in Metastore nothing changed:

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
+------------------------+-------------+-------------+-----------+

I suspect that Hive has the following strategy of inferring schema: It tries to get it from a SerDe class specified for a given table. When SerDe cannot provide the schema Hive looks into the metastore.
Let's check that by removing avro.schema.url property:

hive> ALTER TABLE users_from_avro_schema UNSET TBLPROPERTIES ('avro.schema.url');
OK
Time taken: 0.33 seconds
hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
Time taken: 0.363 seconds, Fetched: 2 row(s)

Describe shows us data stored in the Metastore. Let's modify them by adding a column:

ALTER TABLE users_from_avro_schema ADD COLUMNS (phone string);

It of course changes Hive Metastore:

+------------------------+-------------+-------------+-----------+
| tbl_name               | column_name | integer_idx | type_name |
+------------------------+-------------+-------------+-----------+
| users_from_avro_schema | id          |           0 | int       |
| users_from_avro_schema | name        |           1 | string    |
| users_from_avro_schema | phone       |           2 | string    |
+------------------------+-------------+-------------+-----------+

But when we set avro.schema.url again back to user_v2.avsc what is in Hive Metastore doesn't matter any more:

hive> ALTER TABLE users_from_avro_schema SET TBLPROPERTIES('avro.schema.url'='hdfs:///user/tulinski/user_v2.avsc');
OK
Time taken: 0.268 seconds
hive> DESCRIBE users_from_avro_schema;
OK
id                      int
name                    string
email                   string

Avro schema takes precedence over the Metastore.

The above example shows that we should rather avoid mixing hive schema changes with avro schema evolution, because otherwise we can easily get into big mess and inconsistency between Hive Metastore and actual schema which is used while reading and writing data. The first inconsistency occurs when we change our avro schema definition by updating avro.schema.url property, but we can live with that if we are aware of Hive strategy of inferring schema. I haven't checked in Hive's source code whether my suspicions about schema logic are correct, but the above example convince me what happens underneath.

I extended my answer to show that even when there is a conflict between Avro schema and Hive Metastore data which comply Avro schema can be read. Please have a look again at my example above. Our table definition points to avro schema having three fields:

id    int
name  string
email string

whereas in Hive Metastore there are the following columns:

id    int
name  string
phone string

email vs phone
Let's create an avro file containing a single user record complying user_v2.avsc schema. This is its json representation:

{
  "id": 123,
  "name": "Tomek",
  "email": {"string": "tomek@tomek"}
}

To create the avro file we call:

java -jar avro-tools-1.8.2.jar fromjson --schema-file user_v2.avsc user_tomek_v2.json > user_tomek_v2.avro

We are able to query our table despite the fact that Hive Metastore doesn't contain email column and it contains phone column instead:

hive> set hive.cli.print.header=true;
hive> select * from users_from_avro_schema;
OK
users_from_avro_schema.id   users_from_avro_schema.name users_from_avro_schema.email
123 Tomek   tomek@tomek
Counterfoil answered 31/5, 2017 at 10:53 Comment(0)
E
9

The following refers to the use-case where no schema file is involved

The schema is stored in 2 places
1. The metastore
2. As part of the data files

All the information for the DESC/SHOW commands is taken from the metastore.
Every DDL change impacts only the metastore.

When you query the data the matching between the 2 schemas is done by the columns names.
If there is a mismatch in the columns types you'll get an error.

Demo

create table mytable 
stored as avro 
as 
select  1               as myint
       ,'Hello'         as mystring
       ,current_date    as mydate
;

select * from mytable
;

+-------+----------+------------+
| myint | mystring |   mydate   |
+-------+----------+------------+
|     1 | Hello    | 2017-05-30 |
+-------+----------+------------+

Metastore

select      c.column_name
           ,c.integer_idx
           ,c.type_name

from                metastore.DBS        as d
            join    metastore.TBLS       as t on t.db_id = d.db_id
            join    metastore.SDS        as s on s.sd_id = t.sd_id
            join    metastore.COLUMNS_V2 as c on c.cd_id = s.cd_id

where       d.name     = 'local_db'
        and t.tbl_name = 'mytable'

order by    integer_idx

+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| myint       |           0 | int       |
| mystring    |           1 | string    |
| mydate      |           2 | date      |
+-------------+-------------+-----------+

avro-tools

bash-4.1$ avro-tools getschema 000000_0 

{
  "type" : "record",
  "name" : "mytable",
  "namespace" : "local_db",
  "fields" : [ {
    "name" : "myint",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "mystring",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "mydate",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

alter table mytable change myint dummy1 int;

select * from mytable;

+--------+----------+------------+
| dummy1 | mystring |   mydate   |
+--------+----------+------------+
| (null) | Hello    | 2017-05-30 |
+--------+----------+------------+

alter table mytable add columns (myint int);

select * from mytable;

+--------+----------+------------+-------+
| dummy1 | mystring |   mydate   | myint |
+--------+----------+------------+-------+
| (null) | Hello    | 2017-05-30 |     1 |
+--------+----------+------------+-------+

Metastore

+-------------+-------------+-----------+
| column_name | integer_idx | type_name |
+-------------+-------------+-----------+
| dummy1      |           0 | int       |
| mystring    |           1 | string    |
| mydate      |           2 | date      |
| myint       |           3 | int       |
+-------------+-------------+-----------+

avro-tools
(same schema as the original one)

bash-4.1$ avro-tools getschema 000000_0 

{
  "type" : "record",
  "name" : "mytable",
  "namespace" : "local_db",
  "fields" : [ {
    "name" : "myint",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "mystring",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "mydate",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

Any work against that table is done based on the metadata stored in the Metastore.
When the table is being queried, additional metadata is being used which is the metadata stored in data file.
The query result structure is constructed from the Metastore (See in my example that 4 columns are being returned after the table was altered).
The data returned depends on both schemes - a field with a specific name in the file schema will be mapped to the column with the same name in the Metastore schema.
If the names match but the datatypes don't, an error will arise.
A fields from the data file that does not have a corresponding column name in the Metastore would not be presented.
A column in the Metastore without corresponding field in the data file schema will hold NULL values.

Evelyn answered 30/5, 2017 at 11:5 Comment(7)
Thank you @DuduMarkovitz for your answer and the example you gave. I've made also some tests with schema evolution and I'm going to share results.Counterfoil
Referring to your answer @DuduMarkovitz, do I understand correctly that in your example the avro schema (corresponding to hive table definition) is not stored in any single location but it is generated based on hive metadata each time hive stores data in hdfs (in order to embed it in output avro file(s)), right? What about reading then? I assume that hive needs to have a single avro schema for reading purposes, so the only way I see is that the schema is generated based on the current table definition taken from the metastore (the same way like in case of writing). Am I right?Counterfoil
I still see a difference between creating a table with a hive schema (STORED AS AVRO) and the case when the hive schema is inferred from an avro file passed by avro.schema.* table properties (which then can be updated). I'll show what I mean on more specific example soon.Counterfoil
I cannot agree with your last two statements: A fields from the data file that does not have a corresponding column name in the Metastore would not be presented. A column in the Metastore without corresponding field in the data file schema will hold NULL values. Please have a look at my updated example.Counterfoil
My answer, as can seen from the examples given, refferes only to the 2nd use case where there is no schema fileHeymann
Ok @Dudu, but someone can understand those statements as general rules (I mean that Hive Metastore always takes part in schema recognition, which is not true). Maybe it is a good idea to clarify it in your answer.Counterfoil
Clarified. I didn't have the time to present the 1st use-case but I see you implemented my methodology to answer it by yourself.Heymann
G
0

This may be little outdated now considering table formats like Iceberg and Delta table takes care of schema changes and evolution (which was the main selling point of AVRO to begin with)

Combining Tomek's and David's answer, is it safe to conclude that for hive tables created using "stored as avro", field metadata is inferred/referred/extracted from 3 different sources

  1. schema within the data updated by creating avro file (see Tomek's answer how to create one from JSON)
  2. hive metastore updated using add remove column
  3. avsc schema URL updated using avro.schema.url property. This is optional

Here is my interpretation (I will update my answer as I learn more) of the logic that is followed in hive code for different operation:

  1. select: schema within the data takes precedence, then hive metastore error is raised only if name matches but datatype does not
  2. describe/show: avsc schema URL takes precedence, then hive metastore
  3. insert: obviously schema within the data; but there is some validation done with avsc schema URL
Grazynagreabe answered 14/4, 2023 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.