I'm trying to do some web-scraping via PowerShell, as I've recently discovered it is possible to do so without too much trouble.
A good starting point is to just fetch the HTML, use Get-Member, and see what I can do from there, like so:
$html = Invoke-WebRequest "https://www.google.com"
$html.ParsedHtml | Get-Member
The methods available to me for fetching specific elements appear to be the following:
getElementById()
getElementsByName()
getElementsByTagName()
For example I can get the first IMG tag in the document like so:
$html.ParsedHtml.getElementsByTagName("img")[0]
However after doing some more research in to whether I could use CSS Selectors or XPath I discovered that there are unlisted methods available, since we are just using the HTML Document object documented here:
querySelector()
querySelectorAll()
So instead of doing:
$html.ParsedHtml.getElementsByTagName("img")[0]
I can do:
$html.ParsedHtml.querySelector("img")
So I was expecting to be able to do:
$html.ParsedHtml.querySelectorAll("img")
...in order to get all of the IMG elements. All the documentation I've found and googling I've done supports this. However, in all my testing this function crashes the calling process and reports a heap corruption exception code in the Event Log (0xc0000374).
I'm using PowerShell 5 on Windows 10 x64. I've tried it in a Win10 x64 VM that is a clean build and just patched up. I've also tried it in Win7 x64 upgraded to PowerShell 5. I haven't tried it on anything prior to PowerShell 5 as all our systems here are upgraded, but I probably will once I have time to spool a new vanilla VM for testing.
Has anyone run in to this issue before? All my research so far is a dead end. Are there alternatives to querySelectorAll? I need to scrape pages that will have predictable sets of tags inside unpredictable layouts and potentially no IDs or classes assigned to the tags, so I want to be able to use selectors that allow structure/nesting/wildcards.
P.S. I've also tried using the InternetExplorer.Application COM object in PowerShell, the result is the same, except instead of PowerShell crashing Internet Explorer crashes. This was actually my original approach, here's the code:
# create browser object
$ie = New-Object -ComObject InternetExplorer.Application
# make browser visible for debugging, otherwise this isn't necessary for function
$ie.Visible = $true
# browse to page
$ie.Navigate("https://www.google.com")
# wait till browser is not busy
Do { Start-Sleep -m 100 } Until (!$ie.Busy)
# this works
$ie.document.getElementsByTagName("img")[0]
# this works as well
$ie.document.querySelector("img")
# blow it up
$ie.document.querySelectorAll("img")
# we wanna quit the process, but since we blew it up we don't really make it here
$ie.Quit()
Hope I'm not breaking any rules and this post makes sense and is relevant, thanks.
UPDATE
I tested earlier PowerShell versions. v2-v4 crash using the InternetExplorer.Application COM method. v3-4 crash using the Invoke-WebRequest method, v2 doesn't support it.
$NodeList
elements after they are populated in to the$PsNodeList
array. However, I noticed this only works if usingInvoke-WebRequest
. If utilizingNew-Object -ComObject InternetExplorer.Application
, it throws meException from HRESULT: 0x80020101
:( I'm trying to make an interactive scraper, so I would prefer to use the IE ComObject if possible. I'll keep researching. For now, it's at least nice to know there's a workaround for results fromInvoke-WebRequest
. – Pleura