How to use PHP to annotate an string with HTML (i.e How. insert HTML tags to an string by offsets mantaining a valid HTML)?

Asked 27/5, 2019 at 15:10 Answered 10/6, 2019 at 12:21

Solved php html string dom annotate

I'm trying to add HTML tags between words inside a string (wrap words by html tags i.e. HTML annotations). The positions where the HTML tags should be written are delimited by an array of offsets, for example:

//array(Start offset, End offset) in characters
//Note that annotation starts in the Start offset number and ends before the End offset number
$annotationCharactersPositions= array(
   0=>array(0,3),
   1=>array(2,6),
   2=>array(8,10)
);

So to annotate the following HTML text ($source) with the following HTML tag ($tag). That is wrapped the characters delimited by the $annotationPositions array (without taking into account the HTML tags of source).

$source="<div>This is</div> only a test for stackoverflow";
$tag="<span class='annotation n-$cont'>";

the result should be the following (https://jsfiddle.net/cotg2pn1/):

charPos   =--------------------------------- 01---------------------------- 2-------------------------------------------3------------------------------------------45-------67-----------------------------89-------10,11,12,13......
$output = "<div><span class='annotation n-1'>Th<span class='annotation n-2'>i</span></span><span class='annotation n-2'>s</span><span class='annotation n-2'> i</span>s</div> <span class='annotation n-3'>on</span>ly a test for stackoverflow"

How can I program the next function:

    $cont=0;
    $myAnnotationClass="placesOfTheWorld";
    for ($annotationCharactersPositions as $position) {
         $tag="<span class='annotation $myAnnotationClass'>";             
         $source=addHTMLtoString($source,$tag,$position);
         $cont++;
    }

taking into account that the HTML tags of the input string must not be taken into account when counting the characters described in the $annotationCharactersPositions array and each insertion of an annotation (i.e $tag) in the $source text must be taken into account for the encapsulation/annotation of the following annotations.

The idea of this whole process is that given a input text (that may or may not contain HTML tags) a group of characters would be annotated (belonging to one or several words) so that the result would have the selected characters (through an array that defines where each annotation begins and ends) wrapped by HTML tag that can vary (a, span, mark) with a variable number of html attributes (name, class, id, data-*). In addition the result must be a well-formed valid HTML document so that if any annotation is between several annotations, the html should be writing in the output accordingly.

Do you know any library or solution to do this? Maybe PHP DOMDocument functionalities can be useful?¿but how to apply the offsets to the php DomDocument functions? Any idea or help is well received.

Note 1: The input text are UTF-8 raw text with any type of HTML entities embebed (0-n).

Note 2: The input tag could be any HTML tag with variable number of attributes (0-n).

Note 3:The initial position must be inclusive and the final position must be exclusive. i.e. 1º annotation starts before the 2nd character (including the 2 character 'i') and ends before de 6th character (excluding the 6 character 's')

Encephaloma answered 27/5, 2019 at 15:10 Comment(15)

Yes, you'll need to use DomDocument; build it as Dom nodes and forget about using string concatenation if you want any kind of sanity left at the end of the process. But honestly, I'm struggling to work out what you're actually trying to achieve here? – Nectareous 5/6, 2019 at 14:58

What do different values in the arrays mean? – Spire 5/6, 2019 at 15:2

Looks like "starts at char X", and "finishes after char Y". So the first starts as character 1, and finishes after character 3 – Repute 5/6, 2019 at 15:5

@spudley Using DomDocument can be an option, but how to add the tags in the indicated positions? I am trying to show an HTML document annotated on the fly by HTML elements. – Encephaloma 5/6, 2019 at 15:6

@AleksG different values on the array means the start and end annotation offsets – Encephaloma 5/6, 2019 at 15:7

What are the units? Is this characters, words, tags, etc? – Spire 5/6, 2019 at 15:30

@THM thanks for the apreciation! I fix this issue. – Encephaloma 5/6, 2019 at 15:39

@AleksG Thanks for the question. The units are characters, as the examples – Encephaloma 5/6, 2019 at 15:40

Dear @mickmackusa Many thanks for the word of notices. I fix the issues commented and I have tried to improve the question in this respect – Encephaloma 6/6, 2019 at 9:12

@mickmackusa Thanks for the observation. Consider the first annotation. If you want to make an annotation starting at 0 and ending after the character 2 the annotation should be (0,3). This is marked by the character 0. The initial position must be inclusive and the final position must be exclusive. If this is not done, the annotation of the first character would be (0,0) that there is no displacement. The metodology is similar to selecting a fragment of text, if you want to mark the 1 character in a string the cursor must start from position 0 and end before the 2 character. That is (0,1) – Encephaloma 6/6, 2019 at 9:44

@mickmackusa The second span must be within 1 span to have an output with well-formed HTML and correspond to the array of annotations given. On the contrary, the following would happen: <div> <span class = 'annotation n-1'> Th <span class = 'annotation n-2'> i </ span> s </ span>. As you can see, this HTML does not correspond to the offsets of the given annotations.With this methodology the annotation 2 only includes the character 'i' – Encephaloma 6/6, 2019 at 9:52

Can you double-check the example result you give ("the result should be the following")? It appears that the 1th span (n-2) begins before the 2nd character, but the example $annotationCharactersPositions has 1=>array(3,6). Also consider explaining the motivation for this whole process a little more clearly; it seems likely that someone will suggest a completely different approach that may work better in the long run. – Insurrection 7/6, 2019 at 16:22

@Insurrection many thanks for the word notice! I fix this issue. I've added a little more information to try to improve the question. Thanks! – Encephaloma 10/6, 2019 at 9:3

@mickmackusa many thaks for your questions! I have modified the thread to give answers to your questions. The unique identifier for each class was only an example to indicate that the annotation classes can be variable. Thank you very much for the appreciation – Encephaloma 10/6, 2019 at 9:19

@mickmackusa Thanks for the word of notice! I fix the issue. – Encephaloma 11/6, 2019 at 8:40

After loading the HTML into a DOM document, you can fetch any text node descendant of an element node with an Xpath expression (.//text()) in an iterable list. This allows you to keep track of the characters before the current text node. On the text node you check if the text content (or a part of it) has to be wrapped into the annotation tag. If so separate it and create a fragment with up to 3 nodes. (text before, annotation, text after). Replace the text node with the fragment.

function annotate(
  \DOMElement $container, int $start, int $end, string $name
) {
  $document = $container->ownerDocument;
  $xpath = new DOMXpath($document);
  $currentOffset = 0;
  // fetch and iterate all text node descendants 
  $textNodes = $xpath->evaluate('.//text()', $container);
  foreach ($textNodes as $textNode) {
    $text = $textNode->textContent;
    $nodeLength = grapheme_strlen($text);
    $nextOffset = $currentOffset + $nodeLength;
    if ($currentOffset > $end) {
      // after annotation: break
      break;
    }
    if ($start >= $nextOffset) {
      // before annotation: continue
      $currentOffset = $nextOffset;
      continue;
    }
    // make string offsets relative to node start
    $relativeStart = $start - $currentOffset;
    $relativeLength = $end - $start;
    if ($relativeStart < 0) {
      $relativeLength -= $relativeStart;
      $relativeStart = 0;
    }
    $relativeEnd = $relativeStart + $relativeLength;
    // create a fragment for the annotation nodes
    $fragment = $document->createDocumentFragment();
    if ($relativeStart > 0) {
      // append string before annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, 0, $relativeStart))
      );
    }
    // create annotation node, configure and append
    $span = $document->createElement('span');
    $span->setAttribute('class', 'annotation '.$name);
    $span->textContent = grapheme_substr($text, $relativeStart, $relativeLength);
    $fragment->appendChild($span);
    if ($relativeEnd < $nodeLength) {
      // append string after annotation as text node
      $fragment->appendChild(
        $document->createTextNode(grapheme_substr($text, $relativeEnd))
      );
    }
    // replace current text node with new fragment
    $textNode->parentNode->replaceChild($fragment, $textNode);
    $currentOffset = $nextOffset;
  }
}

$html = <<<'HTML'
<div><div>This is</div> only a test for stackoverflow</div>
HTML;

$annotations = [
  0 => [0, 3],
  1 => [2, 6],
  2 => [8, 10]
];

$document = new DOMDocument();
$document->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($annotations as $index => $offsets) {
  annotate($document->documentElement, $offsets[0], $offsets[1], 'n-'.$index);
}

echo $document->saveHTML();

Output:

<div><div><span class="annotation n-0">Th<span class="annotation n-1">i</span></span><span class="annotation n-1">s is</span></div> <span class="annotation n-2">on</span>ly a test for stackoverflow</div>

Francyne answered 10/6, 2019 at 12:21 Comment(8)

Bravo +1. This is superior to the hacky nonsense I was lumping together. Definitely not posting my trash anymore. Definitely bounty-worthy. Nice work. – Gratuitous 10/6, 2019 at 12:39

Bravo +2. wonderful response!!! In principle everything works correctly! If I find a bug in the future I will indicate it. Only one apreciation for future developers. To use this solution it is needed install the php-intl package. Many thanks @Thw – Encephaloma 10/6, 2019 at 15:27

@Francyne What is the meaning of the <<<'HTML' in your response? How it works? – Encephaloma 11/6, 2019 at 8:43

That is a string syntax called NOWDOC (php.net/manual/de/…). I like to use it for sample data because it needs less escaping. – Francyne 11/6, 2019 at 10:43

@Francyne I find a bug in the code when the input string contains elements as <notHtmlString>. For example for the string "different P<3> structures" or "adenine dinucleotide (NAD<+>)" php throw "htmlParseStartTag: invalid element name in Entity". How could this be solved when the PHP htmlspecialchars function can not be used in this context? (since use it would break the structure of the annotation offsets) – Encephaloma 11/6, 2019 at 15:32

Not a bug. <3> is invalid HTML. So the parser throws a warning and repairs the HTML. You can use libxmls internal error handling to capture the errors. 3v4l.org/1nbFm . More complex and sometimes the only way is to repair the HTML using string functions (and PCRE) before loading it as HTML. – Francyne 11/6, 2019 at 15:48

@Francyne Yes, I already knew that it are not valid html, however the solution provided is not able to handle any character that is special in HTML like "<", "&", ">" since if these characters are encoded before call to your code (as HTML elements), the Input annortation offsets array would not match since some characters have been added to the input string when transforming "&" to "&amp"; – Encephaloma 11/6, 2019 at 15:59

The DOMNode::$textContent contains content with decoded entities: 3v4l.org/jGo8c and DOMDocument::saveHTML() will encode the as needed. – Francyne 11/6, 2019 at 23:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags