I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions. the code is as follows:
from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan(limit=5000): #BOOM!
pass
leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed. I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.
When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):
ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0
I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)
So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?
I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:
- google.cloud.happybase.table.Table# scan()
- google.cloud.bigtable.table.Table# read_rows().consume_all()
the abbreviated stacktrace is as follows:
---------------------------------------------------------------------------
InvalidChunk Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
1 row_gen = table.scan(limit=n)
2 rows = []
----> 3 for kvp in row_gen:
4 pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
391 while True:
392 try:
--> 393 partial_rows_data.consume_next()
394 for row_key in sorted(rows_dict):
395 curr_row_data = rows_dict.pop(row_key)
.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
273 for chunk in response.chunks:
274
--> 275 self._validate_chunk(chunk)
276
277 if chunk.reset_row:
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
388 self._validate_chunk_new_row(chunk)
389 if self.state == self.ROW_IN_PROGRESS:
--> 390 self._validate_chunk_row_in_progress(chunk)
391 if self.state == self.CELL_IN_PROGRESS:
392 self._validate_chunk_cell_in_progress(chunk)
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
368 self._validate_chunk_status(chunk)
369 if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370 _raise_if(not chunk.timestamp_micros or not chunk.value)
371 _raise_if(chunk.row_key and
372 chunk.row_key != self._row.row_key)
.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
439 """Helper for validation methods."""
440 if predicate:
--> 441 raise InvalidChunk(*args)
InvalidChunk:
Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk? (try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)
Also, can you show me code to chunk stream a table scan in BigTable? HappyBase batch_size & scan_batching don't seem to be supported.