How to crawl a website that has SAML authentication using ManifoldCF or nutch?
Asked Answered
C

3

18

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says RESPONSECODENOTINDEXABLE.

I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic authentication, NTLM authentication and Session-based access credentials authentication method. I used Session based authentication method which more looks like a form based authentication rather than SAML authentication.

Has anybody crawled a website using manifoldCF which has SAML authentication? And if not manifoldCF, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP basic , Digest and NTLM authentication.

Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.

Congregation answered 8/8, 2016 at 14:7 Comment(0)
B
1

There is no support in Nutch forSSO authentication using SAML. You need to handle it by writing your custom plugin. We have extended proptocol selenium plugin to handle SAML flows.

Brickyard answered 6/7, 2018 at 17:30 Comment(0)
A
0

Not sure whether this helps, just try it out. In nutch, we can provide credentials to login to the page, we have httpclient-auth.xml file in conf directory. There u can provide your host name along with the credentials.

<auth-configuration>
   <credentials username="admin" password="admin123">
      <authscope host="hostname" realm="login"/>
      <default/>
   </credentials>
</auth-configuration>

Similarly you can add any number of credentials to this configuration.

To crawl https site, change plugin.includes property from protocol-http to protocol-httpclient in nutch-conf.xml

Apply answered 2/6, 2017 at 12:31 Comment(0)
B
0

We have modified logic in Nutch protocol-selenium plugin to handle SSO flows. You need to wait for redirect to SSO page. Then using selenium you can handle SSO. Again wait for redirection to original page after SSO.

If 2 factor auth is required, then things become complex. In that case you can configure google authenticator (if allowed by your IdP). You can use that to get get TOTP.

For crawling files behind authentication there is no usual way. You can configure driver to always downlaod files and then use the docwnlaoded file.

You can handle the auth flow using another http clients. If you need dynamic page's content (after all JS and Ajax request completed) then selenium is the best choice and if you are using it, you can move auth part to selenium.

Brickyard answered 21/1, 2019 at 18:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.