Indexing and Querying URLS in Solr

Asked 13/1, 2011 at 18:59 Answered 21/10, 2016 at 0:3

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:

Here is my custom field type:

 <fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

For example:

http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper

If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

Is there a different query filter or tokenizer I should be using?

Sociable answered 13/1, 2011 at 18:59 Comment(2)

did you ever end up sorting this out? – Tattletale 13/9, 2011 at 6:59

Did you figure our what needs to be done here? – Heterolysis 28/3, 2014 at 16:10

If I understand this statement from your question

myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper

You are trying to write a query that would match both:

http://www.twitter.com/AndersonCooper

and

http://www.andersoncooper.com/socialmedia/twitter

(both links contain all of the tokens), but not match either

http://www.facebook.com/AndersonCooper

http://www.twitter.com/AliceCooper

If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:

&q=myField:andersoncooper AND myField:twitter AND myField:com

One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:

&q.op=AND&q=myField:(andersoncooper twitter com)

Undershoot answered 21/10, 2016 at 0:3 Comment(0)

This should be the most simplest solution:

<field name="iconUrl" type="string" indexed="true" stored="true" />

But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www

or make the URL searchable via wildcards at the front (which is slower I guess)

Hazelwood answered 16/1, 2011 at 22:53 Comment(2)

Yeah, string if from StrField, it won't be analyzed, but could be stored / indexed, it's proper for url, I guess. – Salic 1/9, 2015 at 10:44

This won't work for the OP's queries which specify only parts of the url – Undershoot 21/10, 2016 at 0:2

-1

You can try the keyword tokenizer

From the book Solr 1.4 Enterprise Search Server published by Packt

KeywordTokenizerFactory: This doesn't actually do any tokenization or anything at all for that matter! It returns the original text as one term. There are cases where you have a field that always gets one word, but you need to do some basic analysis like lowercasing. However, it is more likely that due to sorting or faceting requirements you will require an indexed field with no more than one term. Certainly a document's identifier field, if supplied and not a number, would use this.

Zorina answered 14/1, 2011 at 14:7 Comment(0)

Recommended topics

Hot tags