How do I cut HTML so that the closing tags are preserved?
Asked Answered
C

3

9

How can I create a preview of a blog post stored in HTML? In other words, how can I "cut" HTML, making sure the tags close properly? Currently, I'm rendering the whole thing on the frontend (with react's dangerouslySetInnerHTML) then setting overflow: hidden and height: 150px. I would much prefer a way where I could cut the HTML directly. This way I don't need to send the entire stream of HTML to the frontend; if I had 10 blog post previews, that would be a lot of HTML sent that the visitor would not even see.

If I had the HTML (say this was the entire blog post)

<body>
   <h1>Test</h1>
   <p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>

Trying to slice it (to make a preview) wouldn't work because the tags would become unmatched:

<body>
   <h1>Test</h1>
   <p>This is a long string of text <!-- Oops! unclosed tags -->

Really what I want is this:

<body>
   <h1>Test</h1>
   <p>This is a long string of text</p>
</body>

I'm using next.js, so any node.js solution should work fine. Is there a way I can do this (e.g. a library on the next.js server-side)? Or will I just have to parse the HTML myself (server-side) and then fix the unclosed tags?

Cryometer answered 26/12, 2020 at 16:54 Comment(2)
you can also add a preview text in the database field which takes like 100 characters of the body.Pasteurization
The problem is the posts are literally written in HTML (so I can have easier styling, etc), which means there may be a tag in the body.Cryometer
J
2

post-preview


It was a challenging task and made me struggle for about two days and made me publish my first NPM package post-preview which can solve your problem. Everything is described in its readme, but if you want to know how to use it for your specific problem:

First of all install the package using NPM or download its source code from GitHub

Then you can use it before the user posts their blogpost to the server and send its result (preview) with the full post to the backend and validate its length and sanitize its html and save it to your backend storage (DB etc.) and send it back to users when you want to show them a blog post preview instead of the full post.

example:

The following code will accept the .blogPostContainer HTMLElement as input and returns the summarized HTML string version of it with *maximum 200 characters length.

You can see the preview in the 'previewContainer' .preview:

js:

import  postPreview  from  "post-preview";
const  postContainer = document.querySelector(".blogPostContainer");
const  previewContainer = document.querySelector(".preview");
previewContainer.innerHTML = postPreview(postContainer, 200);

html (complete blog post):

<div class="blogPostContainer">
  <div>
    <h2>Lorem ipsum</h2>
    <p>
      Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
      cupiditate illum vitae eligendi harum. Magnam quam ex dolor nihil natus dolore voluptates
      accusantium. Reprehenderit, explicabo blanditiis?
    </p>
  </div>
  <p>
    Lorem ipsum dolor sit amet consectetur adipisicing elit. Ipsam non incidunt, corporis debitis
    ducimus eum iure sed ab. Impedit, doloribus! Quos accusamus eos, incidunt enim amet maiores
    doloribus placeat explicabo.Eaque dolores tempore, quia temporibus placeat, consequuntur hic
    ullam quasi rem eveniet cupiditate est aliquam nisi aut suscipit fugit maiores ad neque sunt
    atque explicabo unde! Explicabo quae quia voluptatem.
  </p>
</div>

<div class="preview"></div>

result (blog post preview):

<div class="preview">
  <div class="blogPostContainer">
    <div>
      <h2>Lorem ipsum</h2>
      <p>
        Lorem ipsum, dolor sit amet consectetur adipisicing elit. Neque, fugit hic! Quas similique
        cupiditate illum vitae eligendi ha
      </p>
    </div>
  </div>
</div>

It's a synchronous task so if you want to run it against multiple posts at once, you've better run it in a worker for better performance.

Thank you for making me do some research!

Good luck!

Jamshedpur answered 28/12, 2020 at 14:34 Comment(2)
Hi, sorry for not responding sooner. The problem is this answer is for the frontend (with the document.querySelector), which I can't do (I'm not going to use jsdom or something). I wanted to be able to process the HTML in the backend, so I probably have to write it myself. Cutting the text on the frontend sort of defeats the purpose because the load times would stay the same, if not get slower.Cryometer
Hi, as you already mentioned, this solution is intended to solve the problem in the frontend before sending the posts to the backend. so when a user writes a post for the first time you can send the preview back with the complete post and retrieve the preview from the backend for example when you want to show multiple post previews instead of all posts and you can retrieve the actual post when user requests for that specific post. It's a synchronous task so you've better to run it in a worker for performance reasons.Jamshedpur
S
0

It is pretty complicated to guess what is the height of each pre-rendered element. However, you can cut the entry by number of characters with this pseudo rules:

    1. First define the maximum characters you want to keep.
    1. From the start: If you are meeting an HTML tag (Identify it by regexing < .. > or < .. />) go and find the closing tag.
    1. Then continue from where you stopped to search the tag.

A fast suggestion In javascript that I just wrote (probably can be improved, but that's the idea):

let str = `<body>
   <h1>Test</h1>
   <p>This is a long string of text that I may want to cut.. blah blah blah foo bar bar foo bar bar</p>
</body>`;

const MAXIMUM = 100; // Maximum characters for the preview
let currentChars = 0; // Will hold how many characters we kept until now

let list = str.split(/(<\/?[A-Za-z0-9]*>)/g); // split by tags

const isATag = (s) => (s[0] === '<'); // Returns true if it is a tag
const tagName = (s) => (s.replace('<', '').replace('>', '').replace('\/', '')) // Get the tag name
const findMatchingTag = (list, i) => {
    let name = tagName(list[i]);
    let searchingregex = new RegExp(`<\/ *${name} *>`,'g'); // The regex for closing mathing tag
    let sametagregex = new RegExp(`< *${name} *>`,'g'); // The regex for mathing tag (in case there are inner scoped same tags, we want to pass those)
    let buffer = 0; // Will count how many tags with the same name are in an inner hirarchy level, we need to pass those
    for(let j=i+1;j<list.length;j++){
        if(list[j].match(sametagregex)!=null) buffer++;
        if(list[j].match(searchingregex)!=null){
            if(buffer>0) buffer--;
            else{
                return j;
            }
        }
    }
    return -1;
}

let k = 0;
let endCut = false;
let cutArray = new Array(list.length);
while (currentChars < MAXIMUM && !endCut && k < list.length) { // As long we are still within the limit of characters and within the array
    if (isATag(list[k])) { // Handling tags, finding the matching tag
        let matchingTagindex = findMatchingTag(list, k);
        if (matchingTagindex != -1) {
            if (list[k].length + list[matchingTagindex].length + currentChars < MAXIMUM) { // If icluding both the tag and its closing exceeds the limit, do not include them and end the cut proccess
                currentChars += list[k].length + list[matchingTagindex].length;
                cutArray[k] = list[k];
                cutArray[matchingTagindex] = list[matchingTagindex];
            }
            else {
                endCut = true;
            }
        }
        else {
            if (list[k].length + currentChars < MAXIMUM) { // If icluding the tag exceeds the limit, do not include them and end the cut proccess
                currentChars += list[k].length;
                cutArray[k] = list[k];
            }
            else {
                endCut = true;
            }
        }
    }
    else { // In case it isn't a tag - trim the text
        let cutstr = list[k].substring(0, MAXIMUM - currentChars)
        currentChars += cutstr.length;
        cutArray[k] = cutstr;
    }
    k++;
}

console.log(cutArray.join(''))
Shepperd answered 26/12, 2020 at 18:58 Comment(1)
Yeah, cutting by characters will do fine. I'll accept this answer if there's not a better solution in a reasonable amount of time, I was hoping for a nicer way to do it, but yeah, seems like I'll have to process it myself as you did. Thanks!Cryometer
P
0

I have used solution proposed by SomoKRoceS, and it did help me. But later I have discovered a few issues:

  1. If html content that exceeds the limit is wraped in single tag it will omit it entirely.
  2. If tag contains any attributes like class="width100" or style="text-align:center" it wont be matched with provided regExp

I have made some adjustments to overcome these, this solution will cut exact amount of plain text to fit the limit and preserve all html wraps.

class HtmlTrimmer {
  HTML_TAG_REGEXP = /(<\/?[a-zA-Z]+[\s a-zA-Z0-9="'-;:%]*[^<]*>)/g;
  // <p style="align-items: center; width: 100%;">

  HTML_TAGNAME_REGEXP = /<\/?([a-zA-Z0-9]+)[\sa-zA-Z0-9="'-_:;%]*>/;

  getPlainText(html) {
    return html
      .split(this.HTML_TAG_REGEXP)
      .filter(text => !this.isTag(text))
      .map(text => text.trim())
      .join('');
  }

  isTag(text) {
    return text[0] === '<';
  }

  getTagName(tag) {
    return tag.replace(this.HTML_TAGNAME_REGEXP, '$1');
  }

  findClosingTagIndex(list, openedTagIndex) {
    const name = this.getTagName(list[openedTagIndex]);

    // The regex for closing matching tag
    const closingTagRegex = new RegExp(`</ *${name} *>`, 'g');

    // The regex for matching tag (in case there are inner scoped same tags, we want to pass those)
    const sameTagRegex = new RegExp(`< *${name}[\\sa-zA-Z0-9="'-_:;%]*>`, 'g');

    // Will count how many tags with the same name are in an inner hierarchy level, we need to pass those
    let sameTagsInsideCount = 0;
    for (let j = openedTagIndex + 1; j < list.length; j++) {
      if (list[j].match(sameTagRegex) !== null) sameTagsInsideCount++;
      if (list[j].match(closingTagRegex) !== null) {
        if (sameTagsInsideCount > 0) sameTagsInsideCount--;
        else {
          return j;
        }
      }
    }
    return -1;
  }

  trimHtmlContent(html: string, limit: number): string {
    let trimmed = '';
    const innerItems = html.split(this.HTML_TAG_REGEXP);
    for (let i = 0; i < innerItems.length; i++) {

      const item = innerItems[i];
      const trimmedTextLength = this.getPlainText(trimmed).length;
      if (this.isTag(item)) {
        const closingTagIndex = this.findClosingTagIndex(innerItems, i);
        if (closingTagIndex === -1) {
          trimmed = trimmed + item;
        } else {
          const innerHtml = innerItems.slice(i + 1, closingTagIndex).join('');
          trimmed = trimmed
            + item
            + this.trimHtmlContent(innerHtml, limit - trimmedTextLength )
            + innerItems[closingTagIndex];

          i = closingTagIndex;
        }
      } else {
        if (trimmedTextLength + item.length > limit) {
          trimmed = trimmed + item.slice(0, limit - trimmedTextLength);
          return trimmed + '...';
        } else {
          trimmed = trimmed + item;
        }
      }
    }
    return trimmed;
  }
}


const htmlTrimmer = new HtmlTrimmer();
const trimmedHtml = htmlTrimmer.trimHtmlContent(html, 100);
Pogue answered 21/12, 2022 at 16:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.