How to get the value of the location for a Hive table using a Spark object?

Asked 6/1, 2019 at 10:27 Answered 10/10, 2023 at 9:12

Solved apache-spark hive apache-spark-sql

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:

describe formatted <table name>

I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?

Typebar answered 6/1, 2019 at 10:27 Comment(0)

First approach

You can use input_file_name with dataframe.

it will give you absolute file-path for a part file.

spark.read.table("zen.intent_master").select(input_file_name).take(1)

And then extract table path from it.

Second approach

Its more of hack you can say.

package org.apache.spark.sql.hive

import java.net.URI

import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession

class TableDetail {
  def getTableLocation(table: String, spark: SparkSession): URI = {
    val sessionState: SessionState = spark.sessionState
    val sharedState: SharedState = spark.sharedState
    val catalog: SessionCatalog = sessionState.catalog
    val sqlParser: ParserInterface = sessionState.sqlParser
    val client = sharedState.externalCatalog match {
      case catalog: HiveExternalCatalog => catalog.client
      case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
        "support hive client API")
    }

    val idtfr = sqlParser.parseTableIdentifier(table)

    require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
    val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
    rawTable.location
  }
}

Graehme answered 6/1, 2019 at 11:50 Comment(6)

What if the hive table doesn't have any files? How can I get the location value for it? – Typebar 6/1, 2019 at 12:12

@codeshark I have updated answer with second approach, hope this will work in your case. – Graehme 6/1, 2019 at 13:32

What is "input_file_name" ? – Ary 6/5, 2019 at 23:14

It is a spark function. You can use with import org.apache.spark.sql.functions._ – Graehme 7/5, 2019 at 7:34

You can look for documentation for more detail. spark.apache.org/docs/2.0.0/api/scala/… – Graehme 7/5, 2019 at 7:38

@Graehme is it possible to fetch sizes for each of these files fetched using input_file_name – Bosporus 15/3, 2021 at 6:8

Here is how to do it in PySpark:

 (spark.sql("desc formatted mydb.myschema")
       .filter("col_name=='Location'")
       .collect()[0].data_type)

Ophiology answered 4/2, 2020 at 21:40 Comment(0)

Here is the correct answer:

import org.apache.spark.sql.catalyst.TableIdentifier

lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))

Ary answered 7/5, 2019 at 2:22 Comment(2)

@GuilhermedeLazari here it is

spark._jsparkSession.sessionState().catalog().getTableMetadata(spark.sparkContext._jvm.org.apache.spark.sql.catalyst.TableIdentifier('table', spark.sparkContext._jvm.scala.Some('database'))).storage().locationUri().get()

– Unstick 21/10, 2019 at 22:29

I prefer this over submitting a new spark job (describe table). – Foodstuff 1/6, 2022 at 10:46

You can also use .toDF method on desc formatted table then filter from dataframe.

DataframeAPI:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString

Result:

String = hdfs://nn:8020/location/part_table

(or)

RDD Api:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc

Result:

String = /location/part_table

Cardinalate answered 30/10, 2019 at 18:26 Comment(1)

Thank you for the paste mode :) – Selfpropelled 23/4, 2020 at 16:37

First approach

You can use input_file_name with dataframe.

it will give you absolute file-path for a part file.

spark.read.table("zen.intent_master").select(input_file_name).take(1)

And then extract table path from it.

Second approach

Its more of hack you can say.

package org.apache.spark.sql.hive

import java.net.URI

import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession

class TableDetail {
  def getTableLocation(table: String, spark: SparkSession): URI = {
    val sessionState: SessionState = spark.sessionState
    val sharedState: SharedState = spark.sharedState
    val catalog: SessionCatalog = sessionState.catalog
    val sqlParser: ParserInterface = sessionState.sqlParser
    val client = sharedState.externalCatalog match {
      case catalog: HiveExternalCatalog => catalog.client
      case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
        "support hive client API")
    }

    val idtfr = sqlParser.parseTableIdentifier(table)

    require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
    val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
    rawTable.location
  }
}

Graehme answered 6/1, 2019 at 11:50 Comment(6)

What if the hive table doesn't have any files? How can I get the location value for it? – Typebar 6/1, 2019 at 12:12

@codeshark I have updated answer with second approach, hope this will work in your case. – Graehme 6/1, 2019 at 13:32

What is "input_file_name" ? – Ary 6/5, 2019 at 23:14

It is a spark function. You can use with import org.apache.spark.sql.functions._ – Graehme 7/5, 2019 at 7:34

You can look for documentation for more detail. spark.apache.org/docs/2.0.0/api/scala/… – Graehme 7/5, 2019 at 7:38

@Graehme is it possible to fetch sizes for each of these files fetched using input_file_name – Bosporus 15/3, 2021 at 6:8

USE ExternalCatalog

scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@4eba6e1f

scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog@24b05292

scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

Quackery answered 14/6, 2020 at 12:56 Comment(0)

Using spark catalog api should provide the needed information: getDatabase.

spark.catalog.getDatabase("database")

Output:

Database(name='database', catalog='hive_metastore', description='', locationUri='dbfs:/user/hive/warehouse/database.db')

The location of the table is locationUri/table_name

If it is configured using a cloud storage:

Database(name='database', catalog='hive_metastore', description='', locationUri='dbfs:/mnt/mount_point_of_the_cloud_storage/database/')

The location of the table in this case would be locationUri/table_name

This will work only if the tables are created inside the location of the database. It should be way faster than running any kind of describe or query and it is safer than doing a query to the metastore(I strongly do not recommend)

In other case, go with the running SQL "describe schema database"

Vacuity answered 10/10, 2023 at 9:12 Comment(0)

Use this as re-usable function in your scala project

  def getHiveTablePath(tableName: String, spark: SparkSession):String =
    {
       import org.apache.spark.sql.functions._
      val sql: String = String.format("desc formatted %s", tableName)
      val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
      result.show(false) // just for debug purpose
      val info: String = result.collect().mkString(",")
      val path: String = info.split(',')(1)
      path
    }

caller would be

    println(getHiveTablePath("src", spark)) // you can prefix schema if you have

Result (I executed in local so file:/ below if its hdfs hdfs:// will come):

+--------+------------------------------------+-------+
|col_name|data_type                           |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src|       |
+--------+------------------------------------+-------+

file:/Users/hive/spark-warehouse/src

Centralization answered 1/5, 2020 at 18:12 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags