Prevent HTML Tidy from messing meta tags ( schema markup )
Asked Answered
F

3

6

I am facing a serious problem with HTML Tidy (latest version -- https://html-tidy.org).

In short: HTML tidy convert these lines of HTML codes

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
<div class="wrap">
    <span property="itemListElement" typeof="ListItem">
        <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
            <span property="name">Codes</span>
        </a>
        <meta property="position" content="1">
    </span>
</div>

Into these lines of code -- Please take a close look at META TAGS placement.

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
<div class="wrap">
    <span property="itemListElement" typeof="ListItem">
        <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
            <span property="name">Codes</span>
        </a>
    </span>
    <meta property="position" content="1">
</div>

This is causing some serious issues with schema validations. You can check the codes here: https://search.google.com/structured-data/testing-tool/u/0/

Because of this issue, the client's (URL: https://techswami.in ) breadcrumb navigation is not visible in search results.

What am I beautifying?

My client wanted me to make his/her website's source code look "clean, readable and tidy".

So I am using these lines of codes to make it work for him/her.

Note: this code works 100% perfectly on the following WordPress setup.

  • Nginx with FastCGI Cache/MariaDB
  • PHP7
  • Ubuntu 18.04.1
  • Latest WordPress and is compatible with every cache plugin.

Code:

if( !is_user_logged_in() || !is_admin() ) {
function callback($buffer) {
    $tidy = new Tidy();
    $options = array('indent' => true, 'markup' => true, 'indent-spaces' => 2, 'tab-size' => 8, 'wrap' => 180, 'wrap-sections' => true, 'output-html' => true, 'hide-comments' => true, 'tidy-mark' => false);
    $tidy->parseString("$buffer", $options);
    $tidy->cleanRepair();
    $buffer = $tidy;
    return $buffer;
}
function buffer_start() { ob_start("callback"); }
function buffer_end() { if (ob_get_length()) ob_end_flush(); }
add_action('wp_loaded', 'buffer_start');
add_action('shutdown', 'buffer_end');

}

What help do I need from you guys?

Can you please tell me how do I prevent HTML Tidy from messing the META TAGS. I need the parameters.

Thanks.

Fradin answered 21/8, 2018 at 8:40 Comment(4)
Have you tried another html tidy approach? Looking at tidy-html5 on github ... there was an issue very similar to what you describe here that was resolved for this application - github.com/htacg/tidy-html5/issues/333Jannet
Tell your client it is not possible their website is made up of dynamic components that do not talk to each other so each component does not know how they need to change its own output format. the best you can do is make sure the PHP code you created is clean and tidy. Then inform your uneducated client that view source output is not the source code of the website it's the generated code for the web browser.Balboa
@MartinBarker I think you should read my question once again, I am saying, I am able to beautify the code, I am just facing single issue with <meta> tags within <span> tags. Comming to your second point, when you view source code, it is actually the code of the "current" web-page, application. I know its generated for web-browser and even my client knows it. Thanks for your not-so-useful comment.Fradin
I did read it and my overall point is stop trying to mess with generated source code, because although the validators are reporting it correctly because they are experimental so not to be trusted, That meta tag is not valid w3schools.com/tags/tag_meta.asp property is not valid on a meta tag or in the Global attributes list, meta should not appear outside of the head, so not only is your client asking for the impossible your unable to read standards for what your using...Balboa
F
2

1st of all, my sincere thanks to everyone who tried to help me.

I have found the solution, the only problem with my solution is that it doesn't fix HTML-Tidy issue.

So, now instead of using HTML-Tody I am using this: https://github.com/ivanweiler/beautify-html/blob/master/beautify-html.php

My new code is:

if( !is_user_logged_in() || !is_admin() ) {
    function callback($buffer) {
        $html = $buffer;
        $beautify = new Beautify_Html(array(
          'indent_inner_html' => false,
          'indent_char' => " ",
          'indent_size' => 2,
          'wrap_line_length' => 32786,
          'unformatted' => ['code', 'pre'],
          'preserve_newlines' => false,
          'max_preserve_newlines' => 32786,
          'indent_scripts'  => 'normal' // keep|separate|normal
        ));

        $buffer = $beautify->beautify($html);
        return $buffer;
    }
    function buffer_start() { ob_start("callback"); }
    function buffer_end() { if (ob_get_length()) ob_end_flush(); }
    add_action('wp_loaded', 'buffer_start');
    add_action('shutdown', 'buffer_end');
}

And now every issue related to schema markup has been fixed and the client's site has beautified source code.

Fradin answered 24/8, 2018 at 13:30 Comment(3)
and your destroying site performance for no good reason, this also means that your output HTML is not valid due to the meta tags used by the plugin that is producing invalid HTML content by having <meta property at all and <meta inside the <body> tag!!! so all that is doing is spacing out your code. not validating it in any way.Balboa
@MartinBarker Site was generating valid code, actually, my client is using my custom built theme. Then she wanted me to use "HTML-Tidy" to beautify the code. Now HTML-Tidy messed up the schema markup. So I started looking for alternatives as I even requested the answer on official repo, but no reply. The php code I am using this time just beautifies the code and doesn't mess up the schema -- exactly what I need. Now as for the performance, that's up to the client. She paid me for the work, and I have to deliver it. Hope you understand that. Best Regards.Fradin
No, Your validator is reporting it as valid incorrectly go read the simple version of specs here, w3schools.com/tags/tag_meta.asp or the more full details at developer.mozilla.org/en-US/docs/Web/HTML/Element/meta and search that page and you will find that the attribute property IS NOT VALID (it's not even on the page search the page for the word 'property' and you will find none), this proves your validators are not working 100% to the specifications.Balboa
I
4

The <meta> tag should only be used in the parents elements: <head>, <meta charset>, <meta http-equiv> Additionally, there is no property attribute in the <meta> element.

These are most likely the reasons that HTML-Tidy is cleaning the markup.

Sources

Indre answered 24/8, 2018 at 8:19 Comment(1)
Hi,1st. The above HTML code is not static, its generated by a plugin called "Breadcrumb NavXT" and 2nd according to both Google's schema markup test tool and the w3c validator, the code provided above (1st one) is 100% valid.Fradin
F
2

1st of all, my sincere thanks to everyone who tried to help me.

I have found the solution, the only problem with my solution is that it doesn't fix HTML-Tidy issue.

So, now instead of using HTML-Tody I am using this: https://github.com/ivanweiler/beautify-html/blob/master/beautify-html.php

My new code is:

if( !is_user_logged_in() || !is_admin() ) {
    function callback($buffer) {
        $html = $buffer;
        $beautify = new Beautify_Html(array(
          'indent_inner_html' => false,
          'indent_char' => " ",
          'indent_size' => 2,
          'wrap_line_length' => 32786,
          'unformatted' => ['code', 'pre'],
          'preserve_newlines' => false,
          'max_preserve_newlines' => 32786,
          'indent_scripts'  => 'normal' // keep|separate|normal
        ));

        $buffer = $beautify->beautify($html);
        return $buffer;
    }
    function buffer_start() { ob_start("callback"); }
    function buffer_end() { if (ob_get_length()) ob_end_flush(); }
    add_action('wp_loaded', 'buffer_start');
    add_action('shutdown', 'buffer_end');
}

And now every issue related to schema markup has been fixed and the client's site has beautified source code.

Fradin answered 24/8, 2018 at 13:30 Comment(3)
and your destroying site performance for no good reason, this also means that your output HTML is not valid due to the meta tags used by the plugin that is producing invalid HTML content by having <meta property at all and <meta inside the <body> tag!!! so all that is doing is spacing out your code. not validating it in any way.Balboa
@MartinBarker Site was generating valid code, actually, my client is using my custom built theme. Then she wanted me to use "HTML-Tidy" to beautify the code. Now HTML-Tidy messed up the schema markup. So I started looking for alternatives as I even requested the answer on official repo, but no reply. The php code I am using this time just beautifies the code and doesn't mess up the schema -- exactly what I need. Now as for the performance, that's up to the client. She paid me for the work, and I have to deliver it. Hope you understand that. Best Regards.Fradin
No, Your validator is reporting it as valid incorrectly go read the simple version of specs here, w3schools.com/tags/tag_meta.asp or the more full details at developer.mozilla.org/en-US/docs/Web/HTML/Element/meta and search that page and you will find that the attribute property IS NOT VALID (it's not even on the page search the page for the word 'property' and you will find none), this proves your validators are not working 100% to the specifications.Balboa
H
0

Just for perspective, I tried implementing a minimal self contained example based on:

I ended up with the following code:

<?php
ob_start();
?>

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
    <div class="wrap">
        <span property="itemListElement" typeof="ListItem">
            <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
                <span property="name">Codes</span>
            </a>
            <meta property="position" content="1">
        </span>
    </div>
</div>

<?php

$buffer = ob_get_clean();
$tidy = new Tidy();
$options = array(
    'indent' => true,
    'markup' => true,
    'indent-spaces' => 2,
    'tab-size' => 8,
    'wrap' => 180,
    'wrap-sections' => true,
    'output-html' => true,
    'hide-comments' => true,
    'tidy-mark' => false
);
$tidy->parseString("$buffer", $options);
$tidy->cleanRepair();

echo $tidy;
?>

The output is quite informative on how Tidy restructures your HTML. Here it goes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
  <head>
    <meta property="position" content="1">
    <title></title>
  </head>
  <body>
    <div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
      <div class="wrap">
        <span property="itemListElement" typeof="ListItem"><a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class=
        "taxonomy category"><span property="name">Codes</span></a> </span>
      </div>
    </div>
  </body>
</html>

The meta tag has not disappeared, but instead, it has been pushed up to where it should belong, as pointed out by other commenters.

If you want Tidy to process just the HTML structure, please add option 'input-xml' and set it to true, as such:

$options = array(
    'indent' => true,
    'markup' => true,
    'indent-spaces' => 2,
    'tab-size' => 8,
    'wrap' => 180,
    'wrap-sections' => true,
    'output-html' => true,
    'hide-comments' => true,
    'tidy-mark' => false,
    'input-xml' => true
);

This will output the following:

<div class="breadcrumbs" typeof="BreadcrumbList" vocab="http://schema.org/">
  <div class="wrap">
    <span property="itemListElement" typeof="ListItem">
      <a property="item" typeof="WebPage" title="Codes Category" href="https://mysite.works/codes/" class="taxonomy category">
        <span property="name">Codes</span>
      </a>
      <meta property="position" content="1"></meta>
    </span>
  </div>
</div>
Hematology answered 29/8, 2018 at 22:1 Comment(1)
BTW, it's not HTML Tidy, it's PHP Tidy implementation.Hematology

© 2022 - 2024 — McMap. All rights reserved.