Accessing data in blob object from download_as_string in Python
C

3

8

I am trying to access and modify data in a newline JSON file pulled from Google Cloud Storage in Google Cloud Functions. The results always show up as numbers despite that not being the data in the JSON.

I see that download_as_string() for blob object returns Bytes (https://googleapis.github.io/google-cloud-python/latest/_modules/google/cloud/storage/blob.html#Blob.download_as_string) but in any references I see, everyone is able to access their data just fine.

I am doing this in Cloud Functions but I think my question would apply in any GCP tool.

My example below simply should load the newline JSON data, add it to a list, select the first two dictionary entries, convert back to newline JSON and output to JSON file on GCS. Samples, code, and bad output listed below.

Sample newline JSON input

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}
{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}
{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}

Code in Cloud Function

import requests
import json
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):
    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "sample_bucket"
    filename = 'sample_json_output.json'

    # get bucket with name
    bucket = read_storage_client.get_bucket('sample_bucket')
    # get bucket data as blob
    blob = bucket.get_blob('sample_json.json')
    # download as string
    json_data = blob.download_as_string()

    # create list 
    website_list = []
    for u,y in enumerate(json_data):
        website_list.append(y)

    # select first two
    website_list = website_list[0:2]

    # Create new-line JSON
    results_ready = '\n'.join(json.dumps(item) for item in website_list)

    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) \
        .blob(filename) \
        .upload_from_string(results_ready)

Current output in sample_json_output.json file

123
34

Expected output

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}

Update 6/6: If I write a file straight from the download_to_string blob, then it writes the JSON file perfectly, but I need to access the contents prior.

import requests
import json
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):

    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "sample_bucket"
    filename = 'sample_json_output.json'

    # get bucket with name
    bucket = read_storage_client.get_bucket('sample_bucket')

    # get bucket data as blob
    blob = bucket.get_blob('sample_json.json')

    # convert to string
    json_data = blob.download_as_string()


    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) \
        .blob(filename) \
        .upload_from_string(json_data)

Update 6/6 Output

{"Website": "Google", "URL": "Google.com", "ID": 1}
{"Website": "Bing", "URL": "Bing.com", "ID": 2}
{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}
{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}
Callaway answered 5/6, 2019 at 15:0 Comment(4)
Your problem is that each line is a JSON dictionary object. You need to break the input into lines and then treat each line as an object.Abstention
Hey John - I thought I was doing that by iterating and adding each dictionary line into a list. Am I misunderstanding?Callaway
The problem is when you download the file of new line JSON, before you iterate each dictionary line into a list. When you download_as_string() using a single JSON object then it works but using a file that is newline JSON with separate JSON objects it seems unable to read the file. I also tried download_to_file() and tried to read with the ndjson library but it still reads as numbers.Bearer
Hey Corinne - See my update in post. If I load the newline json and write it write back out, it works just fine...so it appears to read the file successfully.Callaway
B
5

I was able to get the result you wanted using a similar method to yourself in the code below and the ndjson library for new line JSON.

import requests
import json
import ndjson
import csv
from datetime import datetime, timedelta
import sys
from collections import OrderedDict
import os
import random

from google.cloud import bigquery
from google.cloud import storage

def importData(request, execution):

    # Read the data from Google Cloud Storage
    read_storage_client = storage.Client()

    # Set buckets and filenames
    bucket_name = "bucket-name"
    filename = "sample_json_output.json"

    # get bucket with name
    bucket = read_storage_client.get_bucket(bucket_name)

    # get bucket data as blob
    blob = bucket.get_blob("sample_json.json")

    # convert to string
    json_data_string = blob.download_as_string()

    json_data = ndjson.loads(json_data_string)

    list = []
    for item in json_data:
        list.append(item)

    list1 = list[0:2]

    result = ""
    for item in list1:
        result = result + str(item) + "\n"


    # Write the data to Google Cloud Storage
    write_storage_client = storage.Client()

    write_storage_client.get_bucket(bucket_name) \
        .blob(filename) \
        .upload_from_string(result)
Bearer answered 10/6, 2019 at 7:34 Comment(1)
This works! I am able to manipulate the data and write it back out. I will read more on ndjson package to understand more on functionality.Callaway
G
2

When you read the blob in json_data you are getting a bytes object, and when you iterate over it, you get the numeric representation of each character. Below an example that creates a list of dicts from the bytes object

json_data                                                                                                                                                                                                 
b'{"Website": "Google", "URL": "Google.com", "ID": 1}\n{"Website": "Bing", "URL": "Bing.com", "ID": 2}\n{"Website": "Yahoo", "URL": "Yahoo.com", "ID": 3}\n{"Website": "Yandex", "URL": "Yandex.com", "ID": 4}\n'

type(json_data)                                                                                                                                                                                           
bytes

website_list = [json.loads(row.decode('utf-8')) for row in json_data.split(b'\n') if row]                                                                                                                 

website_list                                                                                                                                                                                              
[{'Website': 'Google', 'URL': 'Google.com', 'ID': 1},
 {'Website': 'Bing', 'URL': 'Bing.com', 'ID': 2},
 {'Website': 'Yahoo', 'URL': 'Yahoo.com', 'ID': 3},
 {'Website': 'Yandex', 'URL': 'Yandex.com', 'ID': 4}]
Ghislainegholston answered 7/6, 2019 at 17:7 Comment(1)
I receive the following error on the json.loads line: --- decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 1 column 52 (char 51) --- This looks to be in-line with the error. I'll explore: #21059435Callaway
L
1

The text would be a regular string by default if you'd replace

json_data = blob.download_as_string()

by

json_data = blob.download_as_text()
Lau answered 20/7, 2022 at 17:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.