Invalidate metadata/refresh imapala from spark code

Asked 6/7, 2016 at 9:29 Answered 16/10, 2019 at 10:14

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.

Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.

What would be the most efficient approach?

Oozie is just too slow (30 sec overhead? no thanks)
An SSH action to an (edge) node seems like a valid solution but feels "hackish"
I don't see a way to do this from the hive context in Spark either.

Tomcat answered 6/7, 2016 at 9:29 Comment(1)

About Spark HiveContext : it enables a job to interact with the Hive Metastore, in client/server mode. But it is completely unaware of what other jobs are doing against the Metastore at the same time -- i.e. other Spark jobs, Pig jobs, Impala queries, Hive CLI queries, HiveServer2 queries, Hue browsing sessions... – Middling 6/7, 2016 at 10:33

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)

You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:

download the latest Cloudera JDBC driver for Impala
install it on the server where you run your Spark job
list all the JARs in your *.*.extraClassPath properties
develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way

Hopefully Google will find some examples of JDBC/Scala code such as this one

Middling answered 6/7, 2016 at 11:15 Comment(0)

Seems this has been fixed by Impala 3.3.0 (cf. Section "Metadata Performance Improvements" here):

Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

Pisistratus answered 16/10, 2019 at 10:14 Comment(0)

all the above steps are not required, you can write the below code and execute invalidate metadata query to impala table.

impala_node_ip_address = "XX.XX.XX.XX"
impala Query = "impala-shell -i "+"\"" + str(impala_node_ip_address) + "\"" + " -k -q " + "\""+"invalidate metadata DBNAME"+"." + "TableName" + "\""

Vasilikivasilis answered 16/4, 2019 at 9:42 Comment(1)

As stated in the question, I would like to do I from my code, not as an external script. There is/was no other option than going the JDBC route. – Tomcat 17/4, 2019 at 9:49

Recommended topics

Hot tags