How do I use Spark ORC indexes?
Asked Answered
D

2

7

What is the option to enable orc indexing from spark?

          df
            .write()
            .option("mode", "DROPMALFORMED")
            .option("compression", "snappy")
            .mode("overwrite")
            .format("orc")
            .option("index", "user_id")
            .save(...);

I'm making up .option("index", uid), what would I have to put there to index column "user_id" from orc.

Dyane answered 29/10, 2017 at 21:9 Comment(0)
P
2

Have you tried : .partitionBy("user_id") ?

 df
        .write()
        .option("mode", "DROPMALFORMED")
        .option("compression", "snappy")
        .mode("overwrite")
        .format("orc")
        .partitionBy("user_id")
        .save(...)
Pacifist answered 8/11, 2017 at 18:8 Comment(4)
I think partitionBy will create a new file per user, rather than create an index. But you're only one that answered so I give you the bounty.Dyane
@Dyane i am researching on this. Will let you know soon.Intermingle
@Achyuth, have you found any approach to create index in ORC file? I found nothing till today. It seems to me the only way to leverage index in ORC file is using Hive. Please correct me if it's wrong. Thanks!Faun
@JamesGan Thats true, The file is already indexed on many thing, stats collumn offset etc. The best way to achieve the performance is by creating an orc files which are around 200 to 300 mb.Intermingle
W
2

According to the original blogpost on bringing ORC support to Apache Spark, there is a configuration knob to turn on in your spark context to enable ORC indexes.

# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

Reference: https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html

Wadesworth answered 23/2, 2020 at 9:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.