Query Parquet data through Vertica (Vertica Hadoop Integration)
Asked Answered
M

1

5

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.

Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:

CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;

Since the data size is big. This method will not achieve good performance.

I have done some research, Vertica Hadoop Integration

I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.

My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?

EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.

Mika answered 5/5, 2017 at 15:11 Comment(0)
S
7

Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)

Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.

You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).

Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:

CREATE EXTERNAL TABLE t (id int, name varchar(50), 
                        created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*' 
PARQUET(hive_partition_cols='created,region');
Sanfred answered 5/5, 2017 at 15:50 Comment(4)
Hello Monica! You saved my life! Yes my Parquet data is partitioned by Impala (month,day). But my Vertica version is 7.2.3. Do I have to upgrade Vertica so that it can use the query you wrote?Mika
@Mika the hive_partition_cols argument for PARQUET (and ORC) was added in 8.0.1. (Sorry, I thought you might already be using 8.0 because I didn't remember the hdfs URI scheme being supported earlier, but I guess I was wrong about that.)Sanfred
Yes the hdfs URI is being supported but hive_partition_cols is not. I'm upgrading my Vertica. Thank you so much!Mika
@Mika BTW, 8.1 is out now, so if you're upgrading anyway you might want to consider jumping to the latest.Sanfred

© 2022 - 2024 — McMap. All rights reserved.