Bigquery export splits into multiple files with some empty files
Asked Answered
Z

1

7

I am trying to use the bigquery export functionality to push data out to GCS in json format. At the end of the process inorder to validate the count of exported records in the GCS file, I am creating an external table with auto schema detection just to take a count of records in the GCS files exported.

This works for single exported files. But for tables greater than 1gb in size, i use the wild card inorder to split into multiple files. This results in multiple files with some empty files as well created.

The empty files are causing an error while querying the external table : "400 Schema has no fields".

Please suggest any ideas to:

  1. Either make sure that empty files do not get created in the export operation for multiple files scenario
  2. To ignore empty files in the external table creation.
  3. Any other way to take count of records in GCS after the export operation
Zoonosis answered 15/7, 2020 at 11:45 Comment(4)
It is strange that BQ creates empty files. In my case, it works every time I try with a table larger than 5GB. All the files created contain at least more than 35MB. How are you doing the export? Have you tried from the UI? Could be possible that you already had some files created on that bucket that a messing up with the new ones?Confectioner
I am using the python library to do the export. I have tried the export using the UI as well and the behaviour is the same..creates multiple empty files along with the split files (less than 1 gb)Zoonosis
Looks like its an existing behaviour for bq exports to produce some empty files as well..I have found a workaround for my problem by setting the schema for the external table instead of using auto schema detection and then it works even if empty files are present.Zoonosis
@Zoonosis Can you share you thoughts and a potential workaround addressing the purpose of the question in a valuable answer that can make a best effort for the Stack community?Highsmith
O
12

I had the same problem but I found a workaround: it seems a TEMP TABLE does the trick.

(EDIT: reading the doc I noticed "export data" has always been described for BigQuery tables, non for custom selects. And since I never experienced empty files when exporting real tables, I gave temp tables the same chance)

Imagine we have the following query:

EXPORT DATA OPTIONS(
    uri='gs://mybucket/extract-here/*.csv.gz'
    , format='CSV'
    , compression='GZIP'
    , overwrite=true
    , header=true
    , field_delimiter=","
) AS (
    WITH mytable AS (
        SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
    )
    SELECT * FROM mytable
);

You can rewrite it as following:

BEGIN
    CREATE TEMP TABLE _SESSION.tmpExportTable AS (
        WITH mytable AS (
            SELECT col FROM UNNEST([1,2,3,4,5,6,7,8]) AS col
        )
        SELECT * FROM mytable
    );
    EXPORT DATA OPTIONS(
        uri='gs://mybucket/extract-here/*.csv.gz'
        , format='CSV'
        , compression='GZIP'
        , overwrite=true
        , header=true
        , field_delimiter=","
    ) AS
    SELECT * FROM _SESSION.tmpExportTable;
END;
Overstate answered 3/3, 2021 at 12:1 Comment(2)
Brilliant solution!Archimandrite
This should be accepted solution. I just run into this issue with multiple export and this solution is the only one that works.Masters

© 2022 - 2024 — McMap. All rights reserved.