Invalidate metadata/refresh imapala from spark code
Asked Answered
T

3

5

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.

Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.

What would be the most efficient approach?

  • Oozie is just too slow (30 sec overhead? no thanks)
  • An SSH action to an (edge) node seems like a valid solution but feels "hackish"
  • I don't see a way to do this from the hive context in Spark either.
Tomcat answered 6/7, 2016 at 9:29 Comment(1)
About Spark HiveContext : it enables a job to interact with the Hive Metastore, in client/server mode. But it is completely unaware of what other jobs are doing against the Metastore at the same time -- i.e. other Spark jobs, Pig jobs, Impala queries, Hive CLI queries, HiveServer2 queries, Hue browsing sessions...Middling
M
12

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)

You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:

  • download the latest Cloudera JDBC driver for Impala
  • install it on the server where you run your Spark job
  • list all the JARs in your *.*.extraClassPath properties
  • develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way

Hopefully Google will find some examples of JDBC/Scala code such as this one

Middling answered 6/7, 2016 at 11:15 Comment(0)
P
1

Seems this has been fixed by Impala 3.3.0 (cf. Section "Metadata Performance Improvements" here):

Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

  • INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration
Pisistratus answered 16/10, 2019 at 10:14 Comment(0)
V
0

all the above steps are not required, you can write the below code and execute invalidate metadata query to impala table.

impala_node_ip_address = "XX.XX.XX.XX"
impala Query = "impala-shell -i "+"\"" + str(impala_node_ip_address) + "\"" + " -k -q " + "\""+"invalidate metadata DBNAME"+"." + "TableName" + "\""
Vasilikivasilis answered 16/4, 2019 at 9:42 Comment(1)
As stated in the question, I would like to do I from my code, not as an external script. There is/was no other option than going the JDBC route.Tomcat

© 2022 - 2024 — McMap. All rights reserved.