Is there a way to grep through text documents stored in Google Cloud Storage?
Asked Answered
A

6

10

Question

Is there a way to grep through the text documents stored in Google Cloud Storage?

Background

I am storing over 10 thousand documents (txt file) on a VM and is using up space. And before it reaches the limit I want to move the documents to an alternative location. Currently, I am considering to move to Google Cloud Storage on GCP.

Issues

I sometimes need to grep the documents with specific keywords. I was wondering if there is any way I can grep through the documents uploaded on Google Cloud Storage? I checked the gsutil docs, but it seems ls,cp,mv,rm is supported but I dont see grep.

Attorneyatlaw answered 5/3, 2019 at 2:10 Comment(0)
B
12

Unfortunately, there is no such command like grep for gsutil.

The only similary command is gsutil cat.

I suggest you can create a small vm, and grep on the cloud will faster and cheaper.

gsutil cat gs://bucket/ | grep "what you wnat to grep"
Befoul answered 5/3, 2019 at 2:18 Comment(1)
Thank you for your reply. I tried gsutil cat and it works if i don't have much files on Google Cloud Storage (GCP). Although, when considering scalability gsutil cat is definitely not best. Let me check the performance of grep on the small vm as suggested. Thank you again!!!Attorneyatlaw
B
3

@howie answer is good. I just want to mention that Google Cloud Storage is a product intended to store files and does not care about the contents of them. Also, it is designed to be massively scalable and the operation you are asking for is computationally expensive, so it is very unlikely that it will be supported natively in the future.

In your case, I would consider to create a index of the text files and trigger an update for it every time a new file is upload to GCS.

Brice answered 5/3, 2019 at 9:36 Comment(1)
thanks for your suggestion. i finally went with gcpfuse.Attorneyatlaw
A
2

i found the answer to this issue. gcpfuse solved this problem.

mount the google cloud storage to a specific directory. and you can grep from there.

https://cloud.google.com/storage/docs/gcs-fuse https://github.com/GoogleCloudPlatform/gcsfuse

Attorneyatlaw answered 27/3, 2019 at 13:25 Comment(0)
R
2

I've written a Linux native binary [mrgrep] (for ubuntu 18.04) (https://github.com/romange/gaia/releases/tag/v0.1.0) that does exactly this. It reads directly from GCS, and as a bonus, it handles compressed files and it's multi-threaded.

Roi answered 3/8, 2019 at 22:19 Comment(0)
H
1

I have another suggestion. You might want to consider using Google Dataflow to process the documents. You can just move them, but more importantly, you can transform the documents using Dataflow.

Harem answered 10/3, 2019 at 1:36 Comment(1)
thanks for your suggestion. i finally went with gcpfuse.Attorneyatlaw
F
0

you can try this python script in cloud console like -:
python script_file_name bucket_name pattern directory_if_any

from google.cloud import storage
import re
import sys

client = storage.Client()
BUCKET_NAME = sys.argv[1] 
PATTERN = sys.argv[2]
PREFIX = ""
try:
    PREFIX= sys.argv[3]
except:
    pass

def search(string, patern):
    obj = re.compile(patern) 
    return obj.search(string)

def walk(bucket_name, prefix=''):
    bucket = client.bucket(bucket_name) 
    blobs = bucket.list_blobs(prefix=prefix) 
    for ele in blobs:
        if not ele.name.endswith("/"): 
            yield ele

for file in walk(BUCKET_NAME, prefix=PREFIX): 
    temp = file.download_as_string().decode('utf-8') 
    if search(temp, PATTERN):
        print(file.name)
Fennec answered 11/5, 2023 at 11:36 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.