I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a few things, and I think I'm close but not sure why it doesn't work:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper
If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query, however the search query ends up being like so:
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
Is there a different query filter or tokenizer I should be using?