Using querySelectorAll on an mshtml.HTMLDocumentClass object in PowerShell causes a crash
Asked Answered
P

2

10

I'm trying to do some web-scraping via PowerShell, as I've recently discovered it is possible to do so without too much trouble.

A good starting point is to just fetch the HTML, use Get-Member, and see what I can do from there, like so:

$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member

The methods available to me for fetching specific elements appear to be the following:

getElementById()
getElementsByName()
getElementsByTagName()

For example I can get the first IMG tag in the document like so:

$html.ParsedHtml.getElementsByTagName("img")[0]

However after doing some more research in to whether I could use CSS Selectors or XPath I discovered that there are unlisted methods available, since we are just using the HTML Document object documented here:

querySelector()
querySelectorAll()

So instead of doing:

$html.ParsedHtml.getElementsByTagName("img")[0]

I can do:

$html.ParsedHtml.querySelector("img")

So I was expecting to be able to do:

$html.ParsedHtml.querySelectorAll("img")

...in order to get all of the IMG elements. All the documentation I've found and googling I've done supports this. However, in all my testing this function crashes the calling process and reports a heap corruption exception code in the Event Log (0xc0000374).

I'm using PowerShell 5 on Windows 10 x64. I've tried it in a Win10 x64 VM that is a clean build and just patched up. I've also tried it in Win7 x64 upgraded to PowerShell 5. I haven't tried it on anything prior to PowerShell 5 as all our systems here are upgraded, but I probably will once I have time to spool a new vanilla VM for testing.

Has anyone run in to this issue before? All my research so far is a dead end. Are there alternatives to querySelectorAll? I need to scrape pages that will have predictable sets of tags inside unpredictable layouts and potentially no IDs or classes assigned to the tags, so I want to be able to use selectors that allow structure/nesting/wildcards.

P.S. I've also tried using the InternetExplorer.Application COM object in PowerShell, the result is the same, except instead of PowerShell crashing Internet Explorer crashes. This was actually my original approach, here's the code:

# create browser object
$ie = New-Object -ComObject InternetExplorer.Application

# make browser visible for debugging, otherwise this isn't necessary for function
$ie.Visible = $true

# browse to page
$ie.Navigate("https://www.google.com")
# wait till browser is not busy
Do { Start-Sleep -m 100 } Until (!$ie.Busy)

# this works
$ie.document.getElementsByTagName("img")[0]

# this works as well
$ie.document.querySelector("img")

# blow it up
$ie.document.querySelectorAll("img")

# we wanna quit the process, but since we blew it up we don't really make it here
$ie.Quit()

Hope I'm not breaking any rules and this post makes sense and is relevant, thanks.

UPDATE

I tested earlier PowerShell versions. v2-v4 crash using the InternetExplorer.Application COM method. v3-4 crash using the Invoke-WebRequest method, v2 doesn't support it.

Pleura answered 12/5, 2016 at 20:12 Comment(0)
L
4

I ran into this problem, too, and posted about it on reddit. I believe the problem happens when Powershell tries to enumerate the HTML DOM NodeList object returned by querySelectorAll(). The same object is returned by childNodes() which can be enumerated by PS, so I'm guessing there's some glue code written for .ParsedHtml.childNodes but not .ParsedHtml.querySelectorAll(). The crash can be triggered by Intellisense trying to get tab-complete help for the object, too.

I found a way around it, though! Just access the native DOM methods .item() and .length directly and emit the node objects into a PowerShell array. The following code pulls the newest page of posts from /r/Powershell, gets the post list anchors via querySelectorAll() then manually enumerates them using the native DOM methods into a Powershell-native array.

$Result = Invoke-WebRequest -Uri "https://www.reddit.com/r/PowerShell/new/"

$NodeList = $Result.ParsedHtml.querySelectorAll("#siteTable div div p.title a")

$PsNodeList = @()
for ($i = 0; $i -lt $NodeList.Length; $i++) { 
    $PsNodeList += $NodeList.item($i)
}

$PsNodeList | ForEach-Object {
    $_.InnerHtml
}

Edit .Length seems to work capitalized or lower-case. I would have expected the DOM to be case-sensitive, so either there's some things going on to help translate or I'm misunderstanding something. Also, the CSS selector is grabbing the source links (self.PowerShell mostly), but that it my CSS selector logic error, not a problem with querySelectorAll(). Note that the results of querySelectorAll() are not live, so modifying them won't modify the original DOM. And I haven't tried modifying them or using their methods yet, but clearly we can grab at the very least .InnerHtml.

Edit 2: Here is a more-generalized wrapper function:

function Get-FixedQuerySelectorAll {
    param (
        $HtmlWro,
        $CssSelector
    )
    # After assignment, $NodeList will crash powershell if enumerated in any way including Intellisense-completion while coding!
    $NodeList = $HtmlWro.ParsedHtml.querySelectorAll($CssSelector)

    for ($i = 0; $i -lt $NodeList.length; $i++) {
        Write-Output $NodeList.item($i)
    }
}

$HtmlWro is an HTML Web Response Object, the output of Invoke-WebReqest. I originally tried to pass .ParsedHtml but then it would crash on assignment. Doing it this way returns the nodes in a Powershell array.

Lattimer answered 6/6, 2016 at 17:44 Comment(4)
Thanks for your response, it is certainly insightful. I was able to follow your suggestion and I'm able to access the $NodeList elements after they are populated in to the $PsNodeList array. However, I noticed this only works if using Invoke-WebRequest. If utilizing New-Object -ComObject InternetExplorer.Application, it throws me Exception from HRESULT: 0x80020101 :( I'm trying to make an interactive scraper, so I would prefer to use the IE ComObject if possible. I'll keep researching. For now, it's at least nice to know there's a workaround for results from Invoke-WebRequest.Pleura
Hmm. I couldn't get the OP IE "works" code to work until I used 32-bit Powershell. But my best efforts couldn't make it return the result of .item(). oops hit enter...still editing I did get an attack of the really-clevers and did something cool but have failed to get it back into Powershell so far. I said "screw it, we have the DOM, let's insert some JavaScript." And so this Powershell code injects a <script> element into the DOM. And you can go into the dev console and type NinjaQuerySelectorAll("a"); and get a result in the console. Meh too few characters. Will post and deleteLattimer
Ugh not enough room for code and it won't let me reply again. K, here's a gist of the code: gist.github.com/midnightfreddie/…Lattimer
While $NodeList.item($i) works, $NodeList.item(0) throws Exception from HRESULT: 0x80020101. To access a node by literal index, use a string: $NodeList.item('0').Antiparticle
A
4

The @midnightfreddie's solution worked fine for me before, but now it throws Exception from HRESULT: 0x80020101 when calling $NodeList.item($i).

I found the following workaround:

function Invoke-QuerySelectorAll($node, [string] $selector)
{
    $nodeList = $node.querySelectorAll($selector)
    $nodeListType = $nodeList.GetType()
    $result = @()
    for ($i = 0; $i -lt $nodeList.length; $i++)
    {
        $result += $nodeListType.InvokeMember("item", [System.Reflection.BindingFlags]::InvokeMethod, $null, $nodeList, $i)
    }
    return $result
}

This one works for New-Object -ComObject InternetExplorer.Application as well.

Antiparticle answered 6/12, 2016 at 18:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.