How to use PHP to annotate an string with HTML (i.e How. insert HTML tags to an string by offsets mantaining a valid HTML)?
Asked Answered
E

1

6

I'm trying to add HTML tags between words inside a string (wrap words by html tags i.e. HTML annotations). The positions where the HTML tags should be written are delimited by an array of offsets, for example:

//array(Start offset, End offset) in characters
//Note that annotation starts in the Start offset number and ends before the End offset number
$annotationCharactersPositions= array(
   0=>array(0,3),
   1=>array(2,6),
   2=>array(8,10)
);

So to annotate the following HTML text ($source) with the following HTML tag ($tag). That is wrapped the characters delimited by the $annotationPositions array (without taking into account the HTML tags of source).

$source="<div>This is</div> only a test for stackoverflow";
$tag="<span class='annotation n-$cont'>";

the result should be the following (https://jsfiddle.net/cotg2pn1/):

charPos   =--------------------------------- 01---------------------------- 2-------------------------------------------3------------------------------------------45-------67-----------------------------89-------10,11,12,13......
$output = "<div><span class='annotation n-1'>Th<span class='annotation n-2'>i</span></span><span class='annotation n-2'>s</span><span class='annotation n-2'> i</span>s</div> <span class='annotation n-3'>on</span>ly a test for stackoverflow"

How can I program the next function:

    $cont=0;
    $myAnnotationClass="placesOfTheWorld";
    for ($annotationCharactersPositions as $position) {
         $tag="<span class='annotation $myAnnotationClass'>";             
         $source=addHTMLtoString($source,$tag,$position);
         $cont++;
    }

taking into account that the HTML tags of the input string must not be taken into account when counting the characters described in the $annotationCharactersPositions array and each insertion of an annotation (i.e $tag) in the $source text must be taken into account for the encapsulation/annotation of the following annotations.

The idea of this whole process is that given a input text (that may or may not contain HTML tags) a group of characters would be annotated (belonging to one or several words) so that the result would have the selected characters (through an array that defines where each annotation begins and ends) wrapped by HTML tag that can vary (a, span, mark) with a variable number of html attributes (name, class, id, data-*). In addition the result must be a well-formed valid HTML document so that if any annotation is between several annotations, the html should be writing in the output accordingly.

Do you know any library or solution to do this? Maybe PHP DOMDocument functionalities can be useful?¿but how to apply the offsets to the php DomDocument functions? Any idea or help is well received.

Note 1: The input text are UTF-8 raw text with any type of HTML entities embebed (0-n).

Note 2: The input tag could be any HTML tag with variable number of attributes (0-n).

Note 3:The initial position must be inclusive and the final position must be exclusive. i.e. 1º annotation starts before the 2nd character (including the 2 character 'i') and ends before de 6th character (excluding the 6 character 's')

Encephaloma answered 27/5, 2019 at 15:10 Comment(15)
Yes, you'll need to use DomDocument; build it as Dom nodes and forget about using string concatenation if you want any kind of sanity left at the end of the process. But honestly, I'm struggling to work out what you're actually trying to achieve here?Nectareous
What do different values in the arrays mean?Spire
Looks like "starts at char X", and "finishes after char Y". So the first starts as character 1, and finishes after character 3Repute
@spudley Using DomDocument can be an option, but how to add the tags in the indicated positions? I am trying to show an HTML document annotated on the fly by HTML elements.Encephaloma
@AleksG different values on the array means the start and end annotation offsetsEncephaloma
What are the units? Is this characters, words, tags, etc?Spire
@THM thanks for the apreciation! I fix this issue.Encephaloma
@AleksG Thanks for the question. The units are characters, as the examplesEncephaloma
Dear @mickmackusa Many thanks for the word of notices. I fix the issues commented and I have tried to improve the question in this respectEncephaloma
@mickmackusa Thanks for the observation. Consider the first annotation. If you want to make an annotation starting at 0 and ending after the character 2 the annotation should be (0,3). This is marked by the character 0. The initial position must be inclusive and the final position must be exclusive. If this is not done, the annotation of the first character would be (0,0) that there is no displacement. The metodology is similar to selecting a fragment of text, if you want to mark the 1 character in a string the cursor must start from position 0 and end before the 2 character. That is (0,1)Encephaloma
@mickmackusa The second span must be within 1 span to have an output with well-formed HTML and correspond to the array of annotations given. On the contrary, the following would happen: <div> <span class = 'annotation n-1'> Th <span class = 'annotation n-2'> i </ span> s </ span>. As you can see, this HTML does not correspond to the offsets of the given annotations.With this methodology the annotation 2 only includes the character 'i'Encephaloma
Can you double-check the example result you give ("the result should be the following")? It appears that the 1th span (n-2) begins before the 2nd character, but the example $annotationCharactersPositions has 1=>array(3,6). Also consider explaining the motivation for this whole process a little more clearly; it seems likely that someone will suggest a completely different approach that may work better in the long run.Insurrection
@Insurrection many thanks for the word notice! I fix this issue. I've added a little more information to try to improve the question. Thanks!Encephaloma
@mickmackusa many thaks for your questions! I have modified the thread to give answers to your questions. The unique identifier for each class was only an example to indicate that the annotation classes can be variable. Thank you very much for the appreciationEncephaloma
@mickmackusa Thanks for the word of notice! I fix the issue.Encephaloma
F
7

After loading the HTML into a DOM document, you can fetch any text node descendant of an element node with an Xpath expression (.//text()) in an iterable list. This allows you to keep track of the characters before the current text node. On the text node you check if the text content (or a part of it) has to be wrapped into the annotation tag. If so separate it and create a fragment with up to 3 nodes. (text before, annotation, text after). Replace the text node with the fragment.

function annotate(
  \DOMElement $container, int $start, int $end, string $name
) {
  $document = $container->ownerDocument;
  $xpath = new DOMXpath($document);
  $currentOffset = 0;
  // fetch and iterate all text node descendants 
  $textNodes = $xpath->evaluate('.//text()', $container);
  foreach ($textNodes as $textNode) {
    $text = $textNode->textContent;
    $nodeLength = grapheme_strlen($text);
    $nextOffset = $currentOffset + $nodeLength;
    if ($currentOffset > $end) {
      // after annotation: break
      break;
    }
    if ($start >= $nextOffset) {
      // before annotation: continue
      $currentOffset = $nextOffset;
      continue;
    }
    // make string offsets relative to node start
    $relativeStart = $start - $currentOffset;
    $relativeLength = $end - $start;
    if ($relativeStart < 0) {
      $relativeLength -= $relativeStart;
      $relativeStart = 0;
    }
    $relativeEnd = $relativeStart + $relativeLength;
    // create a fragment for the annotation nodes
    $fragment = $document->createDocumentFragment();
    if ($relativeStart > 0) {
      // append string before annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, 0, $relativeStart))
      );
    }
    // create annotation node, configure and append
    $span = $document->createElement('span');
    $span->setAttribute('class', 'annotation '.$name);
    $span->textContent = grapheme_substr($text, $relativeStart, $relativeLength);
    $fragment->appendChild($span);
    if ($relativeEnd < $nodeLength) {
      // append string after annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, $relativeEnd))
      );
    }
    // replace current text node with new fragment
    $textNode->parentNode->replaceChild($fragment, $textNode);
    $currentOffset = $nextOffset;
  }
}

$html = <<<'HTML'
<div><div>This is</div> only a test for stackoverflow</div>
HTML;

$annotations = [
  0 => [0, 3],
  1 => [2, 6],
  2 => [8, 10]
];

$document = new DOMDocument();
$document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($annotations as $index => $offsets) {
  annotate($document->documentElement, $offsets[0], $offsets[1], 'n-'.$index);
}

echo $document->saveHTML();

Output:

<div><div><span class="annotation n-0">Th<span class="annotation n-1">i</span></span><span class="annotation n-1">s is</span></div> <span class="annotation n-2">on</span>ly a test for stackoverflow</div>
Francyne answered 10/6, 2019 at 12:21 Comment(8)
Bravo +1. This is superior to the hacky nonsense I was lumping together. Definitely not posting my trash anymore. Definitely bounty-worthy. Nice work.Gratuitous
Bravo +2. wonderful response!!! In principle everything works correctly! If I find a bug in the future I will indicate it. Only one apreciation for future developers. To use this solution it is needed install the php-intl package. Many thanks @ThwEncephaloma
@Francyne What is the meaning of the <<<'HTML' in your response? How it works?Encephaloma
That is a string syntax called NOWDOC (php.net/manual/de/…). I like to use it for sample data because it needs less escaping.Francyne
@Francyne I find a bug in the code when the input string contains elements as <notHtmlString>. For example for the string "different P<3> structures" or "adenine dinucleotide (NAD<+>)" php throw "htmlParseStartTag: invalid element name in Entity". How could this be solved when the PHP htmlspecialchars function can not be used in this context? (since use it would break the structure of the annotation offsets)Encephaloma
Not a bug. <3> is invalid HTML. So the parser throws a warning and repairs the HTML. You can use libxmls internal error handling to capture the errors. 3v4l.org/1nbFm . More complex and sometimes the only way is to repair the HTML using string functions (and PCRE) before loading it as HTML.Francyne
@Francyne Yes, I already knew that it are not valid html, however the solution provided is not able to handle any character that is special in HTML like "<", "&", ">" since if these characters are encoded before call to your code (as HTML elements), the Input annortation offsets array would not match since some characters have been added to the input string when transforming "&" to "&amp";Encephaloma
The DOMNode::$textContent contains content with decoded entities: 3v4l.org/jGo8c and DOMDocument::saveHTML() will encode the as needed.Francyne

© 2022 - 2024 — McMap. All rights reserved.