Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??
Asked Answered
A

1

6

I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention:

file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet

Argolis answered 8/3, 2018 at 4:57 Comment(1)
Maybe Combiner? It would be useful to see any code that outputs informationHug
M
12

You should use "Talk is cheap, show me the code." methodology. Everything is not documented and one way to go is just the code.

Consider part-1-2_3-4.parquet :

  1. Split/Partition number.

  2. Random UUID to prevent collision between different (appending) write jobs.

  3. Unique Job/Task ID (sometimes it will not be included).
  4. The "c" stands for count. This is file counter which means the number of files that have been written in the past for this specific partition. This is used to limit the max number of records written for a single file. The value should start from 0.

I found it based on this code and this code.

Musselman answered 8/3, 2018 at 8:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.