GCP Dataflow Apache Beam code logic not working as expected

I am trying to implement a CDC in Apache Beam, deployed in Google Cloud Dataflow.

I have unloaded the master data and the new data, which is expected to coming daily. The join is not working as expected. Something is missing.

master_data = (
    p | 'Read base from BigQuery ' >> beam.io.Read(beam.io.BigQuerySource(query=master_data, use_standard_sql=True))
      | 'Map id in master' >> beam.Map(lambda master: (
          master['id'], master)))
new_data = (
    p | 'Read Delta from BigQuery ' >> beam.io.Read(beam.io.BigQuerySource(query=new_data, use_standard_sql=True))
      | 'Map id in new' >> beam.Map(lambda new: (new['id'], new)))

joined_dicts = (
    {'master_data' :master_data, 'new_data' : new_data }
    | beam.CoGroupByKey()
    | beam.FlatMap(join_lists)
    | 'mergeddicts' >> beam.Map(lambda masterdict, newdict: newdict.update(masterdict))
) 

def join_lists(k,v):
    itertools.product(v['master_data'], v['new_data'])

Observations (on sample data):

Data from the master

1, 'A',3232

2, 'B',234

New Data:

1,'A' ,44

4,'D',45

Expected result in master table, post the code implementation:

1, 'A',44

2, 'B',234

4,'D',45

However, what I am getting in master table is:

1,'A' ,44

4,'D',45

Am I missing a step? Can anyone please assist in rectifying my mistake.

from apache_beam.options.pipeline_options import PipelineOptions import apache_beam as beam def join_lists(e): (k,v)=e return (k, v['new_data']) if v['new_data'] != v['master_data'] else (k, None) with beam.Pipeline(options=PipelineOptions()) as p: master_data = ( p | 'Read base from BigQuery ' >> beam.Create([('A', [3232]),('B', [234])]) ) new_data = ( p | 'Read Delta from BigQuery ' >> beam.Create([('A',[44]),('D',[45])]) ) joined_dicts = ( {'master_data' :master_data, 'new_data' : new_data } | beam.CoGroupByKey() | 'mergeddicts' >> beam.Map(join_lists) ) result = p.run() result.wait_until_finish() print("Pipeline finished.")

Recommended topics

Hot tags