How to clear all data from AWS CloudSearch?
Asked Answered
M

8

38

I have an AWS CloudSearch instance that I am still developing.

At times, such as when I make some modification to the format of a field, I find myself wanting to wipe out all of the data and regenerating it.

Is there any way to clear out all of the data using the console, or do I have to go about it by programatic means?

If I do have to use programatic means (i.e. generate and POST a bunch of "delete" SDF files) is there any good way to query for all documents in a CloudSearch instance?

I guess I could just delete and re-create the instance, but thattakes a while, and loses all of the indexes/rank expressions/text options/etc

Minx answered 9/7, 2013 at 20:11 Comment(1)
Create a script to create your search domain with all parametersTurf
L
33

Using aws and jq from the command line (tested with bash on mac):

export CS_DOMAIN=https://yoursearchdomain.yourregion.cloudsearch.amazonaws.com

# Get ids of all existing documents, reformat as
# [{ type: "delete", id: "ID" }, ...] using jq
aws cloudsearchdomain search \
  --endpoint-url=$CS_DOMAIN \
  --size=10000 \
  --query-parser=structured \
  --search-query="matchall" \
  | jq '[.hits.hit[] | {type: "delete", id: .id}]' \
  > delete-all.json

# Delete the documents
aws cloudsearchdomain upload-documents \
  --endpoint-url=$CS_DOMAIN \
  --content-type='application/json' \
  --documents=delete-all.json

For more info on jq see Reshaping JSON with jq

Update Feb 22, 2017

Added --size to get the maximum number of documents (10,000) at a time. You may need to repeat this script multiple times. Also, --search-query can take something more specific, if you want to be selective about the documents getting deleted.

Lucifer answered 24/10, 2016 at 17:32 Comment(2)
Here's a bash script that uses the answer from @kevin-tonon, and adds looping as well as a status output that tells you how many documents there are: gist.github.com/jthomerson/ca06245d316d485252579a7d42630095Politico
Remember to use --output json when launching the aws command to prevent output in different formats (I had mine configured in yaml)Cotten
M
16

Best answer I've been able to find was somewhat buried in the AWS docs. To wit:

Amazon CloudSearch currently does not provide a mechanism for deleting all of the documents in a domain. However, you can clone the domain configuration to start over with an empty domain. For more information, see Cloning an Existing Domain's Indexing Options.

Source: http://docs.aws.amazon.com/cloudsearch/latest/developerguide/Troubleshooting.html#ts.cleardomain

Minx answered 16/7, 2013 at 20:47 Comment(5)
sigh; hopefully they add a method for doing that. I do like clicking buttons that do bad things on accident!Cornered
Agreed this functionality should be easier. Also, your link doesn’t seem to contain any information about cloning a domain anymore…Balder
But I found the info at awsdocs.s3.amazonaws.com/cloudsearch/2011-02-01/… – unfortunately the process hangs indefinitely for me, aside from already being nonideal since I'd have to track down all references to the endpoint to update.Balder
And it appears to be gone entirely now. So it seems this answer is no longer the answer.Balder
It's gone... =/Barbuto
L
5

On my side, I used a local nodejs script like this:

var AWS = require('aws-sdk');

AWS.config.update({
    accessKeyId: '<your AccessKey>', 
    secretAccessKey: '<Your secretAccessKey>',
    region: '<your region>',
    endpoint: '<your CloudSearch endpoint'
});

var params = {
       query:"(or <your facet.FIELD:'<one facet value>' facet.FIELD:'<one facet value>')",
       queryParser:'structured'
};


var cloudsearchdomain = new AWS.CloudSearchDomain(params);
cloudsearchdomain.search(params, function(err, data) {
    var fs = require('fs');
    var result = [];
    if (err) {
        console.log("Failed");
        console.log(err);
    } else {
        resultMessage = data;
        for(var i=0;i<data.hits.hit.length;i++){
            result.push({"type":"delete","id":data.hits.hit[i].id});
        }    

        fs.writeFile("delete.json", JSON.stringify(result), function(err) {
            if(err) {return console.log(err);}
        console.log("The file was saved!");
        });
    }
});

You have to know at least all the values of on facets, to be able to request all IDs. In my code, I just put 2 (in (or ....) section), but you can have more.

Once it is done, you have one delete.json file to be used with AWS-CLI using this command :

aws cloudsearchdomain upload-documents --documents delete.json --content-type application/json --endpoint-url <your CloudSearch endpoint>

... that did the job for me !

Lobworm answered 5/8, 2015 at 13:8 Comment(2)
The query can be a little bit more efficient if you have mandatory fields (application_name in my case. Then, it will be query:"(not application_name:'')"Lobworm
Thanks, this worked great. I had to add the size param to the search to get all my documentsTanagra
M
3

I've been doing the following, using the python adapter, boto, to empty cloudsearch. Not pretty but it gets the job done. The hard part is balancing the amount you fetch is within the cloudsearch 5mb limitation.

    count = CloudSearchAdaptor.Instance().get_total_documents()
    while count > 0:
         results = CloudSearchAdaptor.Instance().search("lolzcat|-lolzcat", 'simple', 1000)
         for doc in results.docs:
             CloudSearchAdaptor.Instance().delete(doc['id'])

         CloudSearchAdaptor.Instance().commit()
         #add delay here if cloudsearch takes to long to propigate delete change            
         count = CloudSearchAdaptor.Instance().get_total_documents()

Cloudsearch adapter class looks something like the following:

from boto.cloudsearch2.layer2 import Layer2
from singleton import Singleton

@Singleton
class CloudSearchAdaptor:

    def __init__(self):
        layer2 = Layer2(
            aws_access_key_id='AWS_ACCESS_KEY_ID',
            aws_secret_access_key='AWS_SECRET_ACCESS_KEY',
            region='AWS_REGION'
        )
        self.domain = layer2.lookup('AWS_DOMAIN'))
        self.doc_service = self.domain.get_document_service()
        self.search_service = self.domain.get_search_service()

@staticmethod
def delete(id):
    instance = CloudSearchAdaptor.Instance()
    try:
        response = instance.doc_service.delete(id)
    except Exception as e:
        print 'Error deleting to CloudSearch'

@staticmethod
def search(query, parser='structured', size=1000):
    instance = CloudSearchAdaptor.Instance()
    try:
        results = instance.search_service.search(q=query, parser=parser, size=size)
        return results
    except Exception as e:
        print 'Error searching CloudSearch'

@staticmethod
def get_total_documents():
    instance = CloudSearchAdaptor.Instance()
    try:
        results = instance.search_service.search(q='matchall', parser='structured', size=0)
        return results.hits
    except Exception as e:
        print 'Error getting total documents from CloudSearch'

@staticmethod
def commit():
    try:
        response = CloudSearchAdaptor.Instance().doc_service.commit()
        CloudSearchAdaptor.Instance().doc_service.clear_sdf()
    except Exception as e:
        print 'Error committing to CloudSearch'
Mathildamathilde answered 11/12, 2014 at 23:42 Comment(0)
A
2

On PHP, I managed to create a script for cleaning all records using the AWS PHP SDK:

clean.php - http://pastebin.com/Lkyk1D1i config.php - http://pastebin.com/kFkZhxCc

You'll need to configure your keys on config.php, and your endpoints on clean.php, download the AWS PHP SDK, and you're good to go!!!

Note it'll only clean 10000 documents max. as Amazon has got a limit.

Arsyvarsy answered 1/6, 2015 at 10:17 Comment(0)
F
2

You can manually upload document batch directly to AWS CloudSearch, Dashboard > Upload Document. If you can enumerate all the index id's you want to delete you can create a script to generate document batch or generate it manually.

document batch format should be like this

sample.json

[
    {
        "type": "delete",
        "id": "1"
    },
    {
        "type": "delete",
        "id": "2"
    }
]

How to enumerate all index - Run a test search

  • Search: id:* (or any field you sure will be available to all)
  • Query Parser: Lucene
Faddish answered 12/7, 2018 at 3:45 Comment(0)
C
1

I've managed to create a PowerShell script for it. Check my website here: http://www.mpustelak.com/2017/01/aws-cloudsearch-clear-domain-using-powershell/

Script:

$searchUrl = '[SEARCH_URL]'
$documentUrl = '[DOCUMENT_URL]'
$parser = 'Lucene'
$querySize = 500

function Get-DomainHits()
{
    (Search-CSDDocuments -ServiceUrl $searchUrl -Query "*:*" -QueryParser $parser -Size $querySize).Hits;
}

function Get-TotalDocuments()
{
    (Get-DomainHits).Found
}

function Delete-Documents()
{
    (Get-DomainHits).Hit | ForEach-Object -begin { $batch = '[' } -process { $batch += '{"type":"delete","id":' + $_.id + '},'} -end { $batch = $batch.Remove($batch.Length - 1, 1); $batch += ']' }

    Try
    {
        Invoke-WebRequest -Uri $documentUrl -Method POST -Body $batch -ContentType 'application/json'
    }
    Catch
    {
        $_.Exception
        $_.Exception.Message
    }
}

$total = Get-TotalDocuments
while($total -ne 0)
{
    Delete-Documents

    $total = Get-TotalDocuments

    Write-Host 'Documents left:'$total
    # Sleep for 1 second to give CS time to delete documents
    sleep 1
}
Carpentry answered 4/1, 2017 at 20:48 Comment(0)
N
0

Java version below to clear all data within a cloud search domain:

private static final AmazonCloudSearchDomain cloudSearch = Region
        .getRegion(Regions.fromName(CommonConfiguration.REGION_NAME))
        .createClient(AmazonCloudSearchDomainClient.class, null, null)
        .withEndpoint(CommonConfiguration.SEARCH_DOMAIN_DOCUMENT_ENDPOINT);

public static void main(String[] args) {

    // retrieve all documents from cloud search
    SearchRequest searchRequest = new SearchRequest().withQuery("matchall").withQueryParser(QueryParser.Structured);
    Hits hits = cloudSearch.search(searchRequest).getHits();

    if (hits.getFound() != 0) {
        StringBuffer sb = new StringBuffer();
        sb.append("[");

        int i = 1;
        // construct JSON to delete all
        for (Hit hit : hits.getHit()) {
            sb.append("{\"type\": \"delete\",  \"id\": \"").append(hit.getId()).append("\"}");
            if (i < hits.getHit().size()) {
                sb.append(",");
            }
            i++;
        }

        sb.append("]");

        // send to cloud search
        InputStream documents = IOUtils.toInputStream(sb.toString());
        UploadDocumentsRequest uploadDocumentsRequest = new UploadDocumentsRequest()
                .withContentType("application/json").withDocuments(documents).withContentLength((long) sb.length());
        cloudSearch.uploadDocuments(uploadDocumentsRequest);
    }
}
Novena answered 30/6, 2016 at 8:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.