How can I use XPath to perform a case-insensitive search and support non-english characters?
Asked Answered
N

5

7

I am performing a search in an XML file, using the following code:

$result = $xml->xpath("//StopPoint[contains(StopName, '$query')]");

Where $query is the search query, and StopName is the name of a bus stop. The problem is, it's case sensitive.

And not only that, I would also be able to search with non-english characters like ÆØÅæøå to return Norwegian names.

How is this possible?

News answered 9/3, 2009 at 12:12 Comment(1)
For those looking for a solution to this problem, here is an article that discusses an alternative approach: codingexplained.com/coding/php/…Birkenhead
M
12

In XPath 1.0 (which is, I believe, the best you can get with PHP SimpleXML), you'd have to use the translate() function to produce all-lowercase output from mixed-case input.

For convenience, I would wrap it in a function like this:

function findStopPointByName($xml, $query) {
  $upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅ"; // add any characters...
  $lower = "abcdefghijklmnopqrstuvwxyzæøå"; // ...that are missing

  $arg_stopname = "translate(StopName, '$upper', '$lower')";
  $arg_query    = "translate('$query', '$upper', '$lower')";

  return $xml->xpath("//StopPoint[contains($arg_stopname, $arg_query)");
}

As a sanitizing measure I would either completely forbid or escape single quotes in $query, because they will break your XPath string if they are ignored.

Methaemoglobin answered 9/3, 2009 at 12:57 Comment(0)
I
10

In XPath 2.0 you can use lower-case() function, which is unicode aware, so it'll handle non-ASCII characters fine.

contains(lower-case(StopName), lower-case('$query'))

To access XPath 2.0 you need XSLT 2.0 parser. For example SAXON. You can access it from PHP via JavaBridge.

Imperialism answered 9/3, 2009 at 12:28 Comment(3)
This gives me following errors: - xmlXPathCompOpEval: function lower-case not found - Unregistered functionNews
You're probably using XPath 1.0, this function is only available in XPath 2.0Imperialism
I solved it with using translate, to convert all characters to lower-case. Thanks for your help :)News
R
3

Non-English names should not be a problem. Just add them to your XPath. (XML is defined as using Unicode).

As for case-insensitivity, ...

XPath 1.0 includes the following statement:

Two strings are equal if and only if they consist of the same sequence of UCS characters.

So even using explicit predicates on the local-name will not help.

XPath 2 includes functions to map case. E.g. fn:upper-case


Additional: using XPath's translate function should allow case mapping to be faked in XPath 1, but the input will need to include every cased code point you and your users will ever need:

"test" = translate($inputString, "abcdefghijklmnopqrstuvwxyz", "ABCDEFGHIJKLMNOPQRSTUVWXYZ")
Rosenkrantz answered 9/3, 2009 at 12:27 Comment(7)
As I commented below, PHP tells me that the function lower-case and upper-case can't be found.. :/News
@termserv: XML is always unicode. Even if your XML files are not in a Unicode-capable encoding, once in memory this will make no difference.Rosenkrantz
@Richard: An up-vote for the answer you took the "translate()" idea from would have been fair.Methaemoglobin
@Tomalak: I forgot, sorry, but asking for an up-vote pretty much negates it.Rosenkrantz
I know. ;-) It's also not that I would desperately need it (in fact, if you had simply credited me without up-voting it would have been okay). Maybe I should have made a smiley right away, as it wasn't meant to be aggressive or anything.Methaemoglobin
Your translate clause is not quite right, your alphabet is little screwed up. translate(..,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')Holmen
@Holmen The order was the same (so would work), but correctly in alphabetical order is better....Rosenkrantz
F
0

In addition:

$xml->xpath("//StopPoint[contains(StopName, '$query')]");

You will need to strip out any apostrophe characters from $query to avoid breaking your expression.

In XPath 2.0 you can double-up the quote being used in the delimiter to put that quote into a string literal, but in XPath 1.0 it's impossible to include the delimiter in the string.

Foliole answered 9/3, 2009 at 16:19 Comment(0)
B
0

You can use a PHP function from XPath to iterate through each character to do a case fold:

function fold_case($t=''){
 // IF BLANK, RETURN BLANK
 if($t==''){return'';}

 // FOR EACH CODEPOINT, FOLD CASE & ADD TO NEW TEXT
 $n='';
 $i=IntlBreakIterator::createCodePointInstance();
 $i->setText($t);
 foreach($i->getPartsIterator() as$c) 
 {$n.=IntlChar::foldCase($c,IntlChar::FOLD_CASE_DEFAULT);}

 // RETURN NEW TEXT
 return$n;
}

An alternative uses the transliterator to convert to lowercase:

function lower_case($t=''){return($t==''?B:transliterator_transliterate('Lower',$t));}

Unfortunately, PHP does not support the Fold ID to replace Lower to make the transliterator do full case-folding rather than the per codepoint iteration.

These can be enabled in the XSLT processor by:

$t=new XSLTProcessor;
$t->importStylesheet($s);
$t->registerPHPFunctions(['fold_case','lower_case']);

They can be included in any XPath expression, as in:

<xsl:variable name="search_text" select="php:function('fold_case',$text)"></xsl:variable>

If the text is in an attribute, force it to be a string by passing string(@text).

This uses the intl extension, which needs to be enabled in cPanel, or however they are enabled on your platform.

To use PHP functions in an XSLT file, specify at the top of the file:

<?xml version="1.0" encoding="utf-8"?>
<x:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:php="http://php.net/xsl"
   exclude-result-prefixes="php">

Note that if you are going to use the XSLT file only with PHP, you can shorten the xsl and php namespace prefixes to one character like:

<?xml version="1.0" encoding="utf-8"?>
<x:stylesheet version="1.0"
   xmlns:x="http://www.w3.org/1999/XSL/Transform"
   xmlns:p="http://php.net/xsl"
   exclude-result-prefixes="p">

This can reduce the size of a large XSLT file and save a lot of typing by prefixing using only x: and p: as namespaces respectively.

While PHP is not likely to upgrade their XML, XSLT or XPath processors from version 1.0, being able to call user or inbuilt PHP functions from XPath provides a lot more flexibility, such as having actual mutable variables, or an inline ternary if function like php:function('iif',(@i),string(@i),string(@c)) (though both arguments are evaluated like in VB6's IIF) for use in xsl:for-each or xsl:sort statements.

See https://smallsite-design.com/art/a-php-functions-in-xslt/ for some guidance in how to reliably pass XML elements and attributes to PHP functions in XPath.

Benne answered 15/7 at 13:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.