Truncate text containing HTML, ignoring tags
Asked Answered
P

13

43

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

Hello, my <strong>name</st...

What I would want is:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

Also note that I have included an HTML entity &acute; - which would have to be considered as a single character (rather than 7 characters as in this example).

strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

Protium answered 28/7, 2009 at 11:30 Comment(0)
O
52

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

<?php
header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // For UTF-8, we need to count multibyte sequences as one character.
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);
        if ($printedLength >= $maxLength) break;

        if ($tag[0] == '&' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 multibyte sequence through unchanged.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("\n");

Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.

(You should always be using UTF-8, though.)

Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.

Oberg answered 28/7, 2009 at 11:50 Comment(8)
That looks like it might work... although what about HTML entities?Protium
This does not work with international characters because PHP preg_match counts by byte instead of character, for the offset. To see the gist of the solution for that: #9951342Supernumerary
@DaveStein Thanks for pointing that out. Considering that I myself always use UTF-8, that bug is a bit embarrassing. It's fixed in the code now (along with another counting bug I just spotted).Repressive
Would it be true to say this could stumble on <!-- comment opening tags? I just had a bit of a headache trying to work out where some of our page content had gone, and it turned out the above function had truncated after a <!-- ...Flattish
@Flattish Yes. The code above is designed for controlled content sources and doesn't support all features of XHTML, only <htmltags> and &entity; references. Comments, CDATA sections, preprocessor instructions, XML declarations, DOCTYPE declarations and tag names containing characters outside a-z are all out of scope of this function. It sounds like you're trying to parse general HTML, in which case you should strongly consider preprocessing using a real HTML parser, and possibly take precautions against malicious content.Repressive
While this function works perfectly for "saving" HTML tags inside string, it does break words in half in the same time. This seems to be quite unusual. Most functions, doing the same as yours, I've met so far saves both HTML and full words. This seems natural. Plus, I was quite surprise to have direct print inside function, instead of actually returning modified content.Skittish
@supervacuo: That's a non sequitur. Your link (correctly) asserts that an HTML document cannot be parsed using regular expressions (because HTML is not regular). However, I'm not parsing an HTML document using regular expressions, I'm tokenizing individual tags and entities, which is not only perfectly feasible using REs, but indeed quite common. Notice how the overall document structure is handled not using regular expressions, but with a big while loop and a stack of open $tags?Repressive
@SørenLøvborg fair, assumed using regex in the tokeniser would have the same issue but misunderstood how you're using it. Comment withdrawn..Leede
G
5

I've written a function that truncates HTML just as yous suggest, but instead of printing it out it puts it just keeps it all in a string variable. handles HTML Entities, as well.

 /**
     *  function to truncate and then clean up end of the HTML,
     *  truncates by counting characters outside of HTML tags
     *  
     *  @author alex lockwood, alex dot lockwood at websightdesign
     *  
     *  @param string $str the string to truncate
     *  @param int $len the number of characters
     *  @param string $end the end string for truncation
     *  @return string $truncated_html
     *  
     *  **/
        public static function truncateHTML($str, $len, $end = '&hellip;'){
            //find all tags
            $tagPattern = '/(<\/?)([\w]*)(\s*[^>]*)>?|&[\w#]+;/i';  //match html tags and entities
            preg_match_all($tagPattern, $str, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER );
            //WSDDebug::dump($matches); exit; 
            $i =0;
            //loop through each found tag that is within the $len, add those characters to the len,
            //also track open and closed tags
            // $matches[$i][0] = the whole tag string  --the only applicable field for html enitities  
            // IF its not matching an &htmlentity; the following apply
            // $matches[$i][1] = the start of the tag either '<' or '</'  
            // $matches[$i][2] = the tag name
            // $matches[$i][3] = the end of the tag
            //$matces[$i][$j][0] = the string
            //$matces[$i][$j][1] = the str offest

            while($matches[$i][0][1] < $len && !empty($matches[$i])){

                $len = $len + strlen($matches[$i][0][0]);
                if(substr($matches[$i][0][0],0,1) == '&' )
                    $len = $len-1;


                //if $matches[$i][2] is undefined then its an html entity, want to ignore those for tag counting
                //ignore empty/singleton tags for tag counting
                if(!empty($matches[$i][2][0]) && !in_array($matches[$i][2][0],array('br','img','hr', 'input', 'param', 'link'))){
                    //double check 
                    if(substr($matches[$i][3][0],-1) !='/' && substr($matches[$i][1][0],-1) !='/')
                        $openTags[] = $matches[$i][2][0];
                    elseif(end($openTags) == $matches[$i][2][0]){
                        array_pop($openTags);
                    }else{
                        $warnings[] = "html has some tags mismatched in it:  $str";
                    }
                }


                $i++;

            }

            $closeTags = '';

            if (!empty($openTags)){
                $openTags = array_reverse($openTags);
                foreach ($openTags as $t){
                    $closeTagString .="</".$t . ">"; 
                }
            }

            if(strlen($str)>$len){
                // Finds the last space from the string new length
                $lastWord = strpos($str, ' ', $len);
                if ($lastWord) {
                    //truncate with new len last word
                    $str = substr($str, 0, $lastWord);
                    //finds last character
                    $last_character = (substr($str, -1, 1));
                    //add the end text
                    $truncated_html = ($last_character == '.' ? $str : ($last_character == ',' ? substr($str, 0, -1) : $str) . $end);
                }
                //restore any open tags
                $truncated_html .= $closeTagString;


            }else
            $truncated_html = $str;


            return $truncated_html; 
        }
Guzzle answered 6/3, 2012 at 18:30 Comment(4)
This is really good idea, but I'm getting errors as well as warnings. PHP version 5.5.Meingoldas
Thanks @Matt! I'll have to take a look been a while since a wrote that bit of code.Guzzle
Somewhat limited. "<div>data that is too big to fit into the truncated size</div>" returns </div> instead of the text up to the truncated size. Is this a bug or a feature?Bonacci
@mike echo(truncateHTML('<div>something long here that will get truncated</div>', 10)) // => "<div>something long&hellip;</div>" Unsure what's going on in your case. Note as is, this is a class method, without a class. So to use it in a test setting i removed public static from the function declaration. I haven't used PHP in some time now.Guzzle
Y
4

100% accurate, but pretty difficult approach:

  1. Iterate charactes using DOM
  2. Use DOM methods to remove remaining elements
  3. Serialize the DOM

Easy brute-force approach:

  1. Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
  2. Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
  3. Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
  4. Fix it with HTML Tidy
Yacht answered 28/7, 2009 at 12:4 Comment(5)
i upvoted the accurate, but would downvote for the brute force methodTocology
Is the brute force method that bad? First part of it can be made quite accurate (if you're good with regexps), and with Tidy you'll support optional HTML start tags properly (<table><tr><td></tbody></table> is valid HTML4 :), which naive stack-based solution wouldn't.Yacht
If just someone could give a example of the accurate approach :(Fated
Can't php do this kind of manipulation natively with its DOM classes without the need of a new class?? In jQuery it would take me half a second to program this!Deflective
@GuillaumeBois W3C DOM has some support for ranges and iterators that could help, but I'm not aware of a single function specifically for truncation. Similarly I don't think jQuery can do this correctly. You can truncate HTML in a 1-liner, but it could leave unclosed entities or truncate attributes.Yacht
O
4

I used a nice function found at http://alanwhipple.com/2011/05/25/php-truncate-string-preserving-html-tags-words, apparently taken from CakePHP

Orenorenburg answered 12/1, 2012 at 19:43 Comment(0)
I
3

The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.

function substr_html($string, $length)
{
    $count = 0;
    /*
     * $state = 0 - normal text
     * $state = 1 - in HTML tag
     * $state = 2 - in HTML entity
     */
    $state = 0;    
    for ($i = 0; $i < strlen($string); $i++) {
        $char = $string[$i];
        if ($char == '<') {
            $state = 1;
        } else if ($char == '&') {
            $state = 2;
            $count++;
        } else if ($char == ';') {
            $state = 0;
        } else if ($char == '>') {
            $state = 0;
        } else if ($state === 0) {
            $count++;
        }

        if ($count === $length) {
            return substr($string, 0, $i + 1);
        }
    }
    return $string;
}
Interregnum answered 28/7, 2009 at 12:1 Comment(0)
H
3

you can use tidy as well:

function truncate_html($html, $max_length) {   
  return tidy_repair_string(substr($html, 0, $max_length),
     array('wrap' => 0, 'show-body-only' => TRUE), 'utf8'); 
}
Hersh answered 10/9, 2012 at 8:23 Comment(0)
T
2

Could use DomDocument in this case with a nasty regex hack, worst that would happen is a warning, if there's a broken tag :

$dom = new DOMDocument();
$dom->loadHTML(substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26));
$html = preg_replace("/\<\/?(body|html|p)>/", "", $dom->saveHTML());
echo $html;

Should give output : Hello, my <strong>**name**</strong>.

Tilbury answered 28/7, 2009 at 12:41 Comment(0)
T
2

I've made light changes to Søren Løvborg printTruncated function making it UTF-8 compatible:

   /* Truncate HTML, close opened tags
    *
    * @param int, maxlength of the string
    * @param string, html       
    * @return $html
    */  
    function html_truncate($maxLength, $html){

        mb_internal_encoding("UTF-8");

        $printedLength = 0;
        $position = 0;
        $tags = array();

        ob_start();

        while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position)){

            list($tag, $tagPosition) = $match[0];

            // Print text leading up to the tag.
            $str = mb_strcut($html, $position, $tagPosition - $position);

            if ($printedLength + mb_strlen($str) > $maxLength){
                print(mb_strcut($str, 0, $maxLength - $printedLength));
                $printedLength = $maxLength;
                break;
            }

            print($str);
            $printedLength += mb_strlen($str);

            if ($tag[0] == '&'){
                // Handle the entity.
                print($tag);
                $printedLength++;
            }
            else{
                // Handle the tag.
                $tagName = $match[1][0];
                if ($tag[1] == '/'){
                    // This is a closing tag.

                    $openingTag = array_pop($tags);
                    assert($openingTag == $tagName); // check that tags are properly nested.

                    print($tag);
                }
                else if ($tag[mb_strlen($tag) - 2] == '/'){
                    // Self-closing tag.
                    print($tag);
                }
                else{
                    // Opening tag.
                    print($tag);
                    $tags[] = $tagName;
                }
            }

            // Continue after the tag.
            $position = $tagPosition + mb_strlen($tag);
        }

        // Print any remaining text.
        if ($printedLength < $maxLength && $position < mb_strlen($html))
            print(mb_strcut($html, $position, $maxLength - $printedLength));

        // Close any open tags.
        while (!empty($tags))
             printf('</%s>', array_pop($tags));


        $bufferOuput = ob_get_contents();

        ob_end_clean();         

        $html = $bufferOuput;   

        return $html;   

    }
Trapan answered 22/11, 2011 at 14:53 Comment(0)
M
2

Bounce added multi-byte character support to Søren Løvborg's solution - I've added:

  • support for unpaired HTML tags (e.g. <hr>, <br> <col> etc. don't get closed - in HTML a '/' is not required at the end of these (in is for XHTML though)),
  • customisable truncation indicator (defaults to &hellips; i.e. … ),
  • return as a string without using output buffer, and
  • unit tests with 100% coverage.

All this at Pastie.

Marilla answered 28/12, 2011 at 11:19 Comment(1)
this one is working fine but what if i need to cut only after words?Cirone
M
2

Another light changes to Søren Løvborg printTruncated function making it UTF-8 (Needs mbstring) compatible and making it return string not print one. I think it's more useful. And my code not use buffering like Bounce variant, just one more variable.

UPD: to make it work properly with utf-8 chars in tag attributes you need mb_preg_match function, listed below.

Great thanks to Søren Løvborg for that function, it's very good.

/* Truncate HTML, close opened tags
*
* @param int, maxlength of the string
* @param string, html       
* @return $html
*/

function htmlTruncate($maxLength, $html)
{
    mb_internal_encoding("UTF-8");
    $printedLength = 0;
    $position = 0;
    $tags = array();
    $out = "";

    while ($printedLength < $maxLength && mb_preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = mb_substr($html, $position, $tagPosition - $position);
        if ($printedLength + mb_strlen($str) > $maxLength)
        {
            $out .= mb_substr($str, 0, $maxLength - $printedLength);
            $printedLength = $maxLength;
            break;
        }

        $out .= $str;
        $printedLength += mb_strlen($str);

        if ($tag[0] == '&')
        {
            // Handle the entity.
            $out .= $tag;
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                $out .= $tag;
            }
            else if ($tag[mb_strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                $out .= $tag;
            }
            else
            {
                // Opening tag.
                $out .= $tag;
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + mb_strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < mb_strlen($html))
        $out .= mb_substr($html, $position, $maxLength - $printedLength);

    // Close any open tags.
    while (!empty($tags))
        $out .= sprintf('</%s>', array_pop($tags));

    return $out;
}

function mb_preg_match(
    $ps_pattern,
    $ps_subject,
    &$pa_matches,
    $pn_flags = 0,
    $pn_offset = 0,
    $ps_encoding = NULL
) {
    // WARNING! - All this function does is to correct offsets, nothing else:
    //(code is independent of PREG_PATTER_ORDER / PREG_SET_ORDER)

    if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding();

    $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding));
    $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset);

    if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE))
        foreach($pa_matches as &$ha_match) {
                $ha_match[1] = mb_strlen(substr($ps_subject, 0, $ha_match[1]), $ps_encoding);
        }

    return $ret;
}
Mediatory answered 15/1, 2012 at 9:34 Comment(1)
how do I add "..." to the last text?Aesculapian
O
2

The CakePHP framework has a HTML-aware truncate() function in the Text Helper that works for me. See Text. MIT license. Link to source (provided by @Quentin).

Overcharge answered 20/3, 2013 at 18:18 Comment(1)
Here's the source, helped me: github.com/cakephp/cakephp/blob/master/src/Utility/Text.phpVie
R
2

Use the function truncateHTML() from: https://github.com/jlgrall/truncateHTML

Example: truncate after 9 characters including the ellipsis:

truncateHTML(9, "<p><b>A</b> red ball.</p>", ['wholeWord' => false]);
// =>           "<p><b>A</b> red ba…</p>"

Features: UTF-8, configurable ellipsis, include/exclude length of ellipsis, self-closing tags, collapsing spaces, invisible elements (<head>, <script>, <noscript>, <style>, <!-- comments -->), HTML $entities;, truncating at last whole word (with option to still truncate very long words), PHP 5.6 and 7.0+, 240+ unit tests, returns a string (doesn't use the output buffer), and well commented code.

I wrote this function, because I really liked Søren Løvborg's function above (especially how he managed encodings), but I needed a bit more functionality and flexibility.

Renn answered 7/2, 2018 at 19:33 Comment(0)
L
0

This is very difficult to do without using a validator and a parser, the reason being that imagine if you have

<div id='x'>
    <div id='y'>
        <h1>Heading</h1>
        500 
        lines 
        of 
        html
        ...
        etc
        ...
    </div>
</div>

How do you plan to truncate that and end up with valid HTML?

After a brief search, I found this link which could help.

Lituus answered 28/7, 2009 at 11:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.