What's the best method to EXTRACT product names given a list of SKU numbers from a website?
Asked Answered
B

1

0

I have a problem.

I have a list of SKU numbers (hundreds) that I'm trying to match with the title of the product that it belongs to. I have thought of a few ways to accomplish this, but I feel like I'm missing something... I'm hoping someone here has a quick and efficient idea to help me get this done.

The products come from Aidan Gray.

Attempt #1 (Batch Program Method) - FAIL:

After searching for a SKU in Aidan Gray, the website returns a URL that looks like below:

http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER

... with "SKUNUMBER" obviously being a SKU.

The first result of the webpage is almost always the product.

To click the first result (through the address bar) the following can be entered (if Javascript is enabled through the address bar):

javascript:{document.getElementsByClassName("product-image")[0].click;}

I wanted to create a .bat file through Command Prompt and execute the following command:

firefox http://www.aidangrayhome.com/catalogsearch/result/?q=SKUNUMBER javascript:{document.getElementsByClassName("product-image")[0].click;}

... but Firefox doesn't seem to allow these two commands to execute in the same tab.

If that worked, I was going to go to http://tools.buzzstream.com/meta-tag-extractor, paste the resulting links to get the titles of the pages, and export the data to CSV format, and copy over the data I wanted.

Unfortunately, I am unable to open both the webpage and the Javascript in the same tab through a batch program.

Attempt #2 (I'm Feeling Lucky Method):

I was going to use Google's &btnI URL suffix to automatically redirect to the first result.

http://www.google.com/search?btnI&q=site:aidangrayhome.com+SKUNUMBER

After opening all the links in tabs, I was going to use a Firefox add-on called "Send Tab URLs" to copy the names of the tabs (which contain the product names) to the clipboard.

The problem is that most of the results were simply not lucky enough...

If anybody has an idea or tip to get this accomplished, I'd be very grateful.

Bewitch answered 25/3, 2015 at 22:37 Comment(2)
look into greasemonkey and tampermonkey, which can run custom JS on any site. then, you can launch the browsers from the command line as expected, and the code will run if the URL matches a userscript pattern. you can also look into fake browsers like phantom.Responsiveness
You should provide valid SKU-Numbers.Swede
S
1

I recommend using JScript for this. It's easy to include as hybrid code in a batch script, its structure and syntax is familiar to anyone comfortable with JavaScript, and you can use it to fetch web pages via XMLHTTPRequest (a.k.a. Ajax by the less-informed) and build a DOM object from the .responseText using an htmlfile COM object.

Anyway, challenge: accepted. Save this with a .bat extension. It'll look for a text file containing SKUs, one per line, and fetch and scrape the search page for each, writing info from the first anchor element with a .className of "product-image" to a CSV file.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "skufile=sku.txt"
set "outfile=output.csv"
set "URL=http://www.aidangrayhome.com/catalogsearch/result/?q="

rem // invoke JScript portion
cscript /nologo /e:jscript "%~f0" "%skufile%" "%outfile%" "%URL%"

echo Done.

rem // end main runtime
goto :EOF

@end // end batch / begin JScript chimera

var fso = WSH.CreateObject('scripting.filesystemobject'),
    skufile = fso.OpenTextFile(WSH.Arguments(0), 1),
    skus = skufile.ReadAll().split(/\r?\n/),
    outfile = fso.CreateTextFile(WSH.Arguments(1), true),
    URL = WSH.Arguments(2);

skufile.Close();

String.prototype.trim = function() { return this.replace(/^\s+|\s+$/g, ''); }

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    WSH.StdErr.Write('fetching ' + url);

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write(XHR.responseText);
    return DOM;
}

function out(what) {
    WSH.StdErr.Write(new Array(79).join(String.fromCharCode(8)));
    WSH.Echo(what);
    outfile.WriteLine(what);
}

WSH.Echo('Writing to ' + WSH.Arguments(1) + '...')
out('sku,product,URL');

for (var i=0; i<skus.length; i++) {
    if (!skus[i]) continue;

    var DOM = fetch(URL + skus[i]),
        anchors = DOM.getElementsByTagName('a');

    for (var j=0; j<anchors.length; j++) {
        if (/\bproduct-image\b/i.test(anchors[j].className)) {
            out(skus[i]+',"' + anchors[j].title.trim() + '","' + anchors[j].href + '"');
            break;
        }
    }
}

outfile.Close();

Too bad the htmlfile COM object doesn't support getElementsByClassName. :/ But this seems to work well enough in my testing.

Spiderwort answered 26/3, 2015 at 14:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.