Dataset was not found in location US
Asked Answered
F

1

6

I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in location US

  • Never have I run, deployed or used any US resource.
  • Region, zone and deployment are all EU based.
  • The pipeline is deployed to the EU, it is shown in the UI that in runs in the EU.
  • When I check the logs I see resouce.labels.region 'europe-west4'.
  • I run 40+ pipelines all with the same deploy script and never with any US setting.
  • All data locations for all tables in BQ are all in EU.
  • When using DirectRunner the log shows: INFO:root:Using location 'EU' from table
  • I use table references without project name in my queries, because we have test/acceptance etc. in different project. This works fine. When I add the project the error remains.

My run script
python pippeline/main.py --project thijs-dev --region europe-west4 --runner DataflowRunner --temp_location gs://thijs/dataflow/tmp --staging_location gs://thijs/dataflow/staging --job_name thijspipe --save_main_session --setup_file pipeline/setup.py --autoscaling_algorithm=THROUGHPUT_BASED --max_num_workers=7

My failing step
thijs = (p | 'ReadTable thijs' >> beam.io.Read(beam.io.BigQuerySource(query=queries.load_code_table(), use_standard_sql=True)))

Example what my query looks like

   #standardSQL
   select
     original.c1,
     original.c2,
     original.c3
   from `thijs.tablename` original
     inner join (
       select c1, max(c2) as col2 from `thijs.tablename` group by c2) 
       timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2

My question is: what is going wrong exactly, where is this US coming from?

The error
RuntimeError: apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 09:40:10 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '404', 'content-length': '338', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 404, "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "errors": [ { "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "domain": "global", "reason": "notFound" } ], "status": "NOT_FOUND" } } > [while running 'Transform Details Thijs']

[update]

Here you can see that I forced standardsql by using #standardsql as first line in my queries. But somewhere some API is forcing legacy SQL and I don't know what or where.

RuntimeError: apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 20:59:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '400', 'content-length': '354', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 400, "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "errors": [ { "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "domain": "global", "reason": "invalid" } ], "status": "INVALID_ARGUMENT" } } > [while running 'pipeline']

Python SDK 2.16.0 & 2.19.0

Fatuity answered 16/2, 2020 at 10:22 Comment(9)
Your request could have gotten rerouted if the local server was down or busy.Gualtiero
Wow interesting! Do you have a link to more information about this process?Fatuity
Sorry, I don't have one. It's just stuff I noticed over the years. Maybe just google a few key search words including the server you're trying to access if you want more details.Gualtiero
What version of Python Beam are you using ?Palazzo
The version I use is 2.16.0, I will add this to the question. I will update to 2.19.0 and try again.Fatuity
version 2.19.0 did not helpFatuity
In which location have you placed your dataset, that you query?Tearoom
While reading error message, I see that your dataset name has an extra quotation at the beginning. Shouldn't it be thijs-dev:nlthijs_ba rather than thijs-dev:`nlthijs_ba?Tearoom
the dataset is in EU, the quotation does not come from my code strangely enough.Fatuity
G
1

It's always recommeded that you will work with Google BigQuery, you should put the complete name of the BigQuery resource. What do I mean? Using the same example in your code:

select
 original.c1,
 original.c2,
 original.c3 from `thijs.tablename` original
 inner join ( elect c1, max(c2) as col2 from `thijs.tablename` group by c2) 
   timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2

Instead of putting only: thijs.tablename, you should use thijs-dev.nlthijs_ba.table_name

In that way, you make sure that you are calling the right resource on BigQuery.

Another thing you should check first is if you can query that table with a simple query using Python first:

select c1, max(c2) as col2 from thijs.tablename group by c2

If the call of this query fails, you should check that first.

Grebe answered 7/8, 2021 at 1:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.