Dataset was not found in location US

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in location US

Never have I run, deployed or used any US resource.
Region, zone and deployment are all EU based.
The pipeline is deployed to the EU, it is shown in the UI that in runs in the EU.
When I check the logs I see resouce.labels.region 'europe-west4'.
I run 40+ pipelines all with the same deploy script and never with any US setting.
All data locations for all tables in BQ are all in EU.
When using DirectRunner the log shows: INFO:root:Using location 'EU' from table
I use table references without project name in my queries, because we have test/acceptance etc. in different project. This works fine. When I add the project the error remains.

My run script
python pippeline/main.py --project thijs-dev --region europe-west4 --runner DataflowRunner --temp_location gs://thijs/dataflow/tmp --staging_location gs://thijs/dataflow/staging --job_name thijspipe --save_main_session --setup_file pipeline/setup.py --autoscaling_algorithm=THROUGHPUT_BASED --max_num_workers=7

My failing step
thijs = (p | 'ReadTable thijs' >> beam.io.Read(beam.io.BigQuerySource(query=queries.load_code_table(), use_standard_sql=True)))

Example what my query looks like

   #standardSQL
   select
     original.c1,
     original.c2,
     original.c3
   from `thijs.tablename` original
     inner join (
       select c1, max(c2) as col2 from `thijs.tablename` group by c2) 
       timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2

My question is: what is going wrong exactly, where is this US coming from?

The error
RuntimeError: apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 09:40:10 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '404', 'content-length': '338', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 404, "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "errors": [ { "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "domain": "global", "reason": "notFound" } ], "status": "NOT_FOUND" } } > [while running 'Transform Details Thijs']

[update]

Here you can see that I forced standardsql by using #standardsql as first line in my queries. But somewhere some API is forcing legacy SQL and I don't know what or where.

RuntimeError: apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 20:59:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '400', 'content-length': '354', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 400, "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "errors": [ { "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "domain": "global", "reason": "invalid" } ], "status": "INVALID_ARGUMENT" } } > [while running 'pipeline']

Python SDK 2.16.0 & 2.19.0

It's always recommeded that you will work with Google BigQuery, you should put the complete name of the BigQuery resource. What do I mean? Using the same example in your code:

select
 original.c1,
 original.c2,
 original.c3 from `thijs.tablename` original
 inner join ( elect c1, max(c2) as col2 from `thijs.tablename` group by c2) 
   timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2

Instead of putting only: thijs.tablename, you should use thijs-dev.nlthijs_ba.table_name

In that way, you make sure that you are calling the right resource on BigQuery.

Another thing you should check first is if you can query that table with a simple query using Python first:

select c1, max(c2) as col2 from thijs.tablename group by c2

If the call of this query fails, you should check that first.

Recommended topics

Hot tags