I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.
Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.
What would be the most efficient approach?
- Oozie is just too slow (30 sec overhead? no thanks)
- An SSH action to an (edge) node seems like a valid solution but feels "hackish"
- I don't see a way to do this from the hive context in Spark either.
HiveContext
: it enables a job to interact with the Hive Metastore, in client/server mode. But it is completely unaware of what other jobs are doing against the Metastore at the same time -- i.e. other Spark jobs, Pig jobs, Impala queries, Hive CLI queries, HiveServer2 queries, Hue browsing sessions... – Middling