Hive UDF for selecting all except some columns
Asked Answered
S

1

11

The common query building pattern in HiveQL (and SQL in general) is to either select all columns (SELECT *) or an explicitly-specified set of columns (SELECT A, B, C). SQL has no built-in mechanism for selecting all but a specified set of columns.

There are various mechanisms for excluding some columns as outlined in this SO question but none apply naturally to HiveQL. (For example, the idea to create a temporary table with SELECT * then ALTER TABLE DROP some of its columns would wreak havoc in a big data environment.)

Ignoring the ideological discussion about whether it is a good idea to select all but some columns, this question is about the possible ways to extend Hive with this capability.

Prior to Hive 0.13.0 SELECT could take regular-expression-based columns, e.g., property_.* inside a backtick-quoted string. @invoketheshell's answer below refers to this capability but it comes at a cost, which is that, when this capability is on, Hive cannot accept columns with non-standard characters in them, e.g., $foo or x/y. That's why the Hive developers turned this behavior off by default in 0.13.0. I am looking for a generic solution that works for any column name.

A generic table-generating UDF (UDTF) could certainly do this because it can manipulate the schema. Since we are not going to generate new rows, is there a way to solve this problem using a simple row-based UDF?

This seems like a common problem with many posts around the Web showing how to solve it for various databases yet I haven't been able to find a solution for Hive. Is there code somewhere that does this?

Snore answered 28/7, 2015 at 3:30 Comment(2)
This unresolved bug may cause problems, since it means you couldn't do a UDF e.g. all_except(*, excluded_column1, excluded_column2) issues.apache.org/jira/browse/HIVE-1459Rienzi
Thanks @mattinbits. I voted for the issue and encourage others to do so as well.Snore
N
9

You can choose every column except those listed in a regex based specification. This is query columns by exclusion. See below:

A SELECT statement can take regex-based column specification in Hive releases prior to 0.13.0, or in 0.13.0 and later releases if the configuration property hive.support.quoted.identifiers is set to none.

That being said you could create a new table or view using the following, and all the columns except the columns specified will be returned:

hive.support.quoted.identifiers=none;    

drop table if       exists database.table_name;
create table if not exists database.table_name as
    select `(column_to_remove_1|...|column_to_remove_N)?+.+`
    from database.some_table
    where 
    --...
;

This will create a table that has all the columns from some_table except the columns named column_to_remove_1, ... , to column_to_remove_N. You can also choose to create a view instead.

Ninefold answered 28/7, 2015 at 11:34 Comment(2)
You make a good point but the behavior you describe is not generic as it disallows columns with non-standard characters in them, which is why it is disabled by default in 0.13.0 and later versions. I have updated the question to clarify that I seek a behavior that does not depend on this setting.Snore
One suggestion: in the version of Hive sql I run at work, we have to add the keyword set in front of hive.support.quoted.identifiers=none; .Sendal

© 2022 - 2024 — McMap. All rights reserved.