Parse 'ul' and 'ol' tags

Asked 14/5, 2018 at 8:54 Answered 24/5, 2018 at 7:5

Solved ruby-on-rails ruby algorithm ruby-on-rails-4 nokogiri

I have to handle deep nesting of ul, ol, and li tags. I need to give the same view as we are giving in the browser. I want to achieve the following example in a pdf file:

 text = "
<body>
    <ol>
        <li>One</li>
        <li>Two

            <ol>
                <li>Inner One</li>
                <li>inner Two

                    <ul>
                        <li>hey

                            <ol>
                                <li>hiiiiiiiii</li>
                                <li>why</li>
                                <li>hiiiiiiiii</li>
                            </ol>
                        </li>
                        <li>aniket </li>
                    </li>
                </ul>
                <li>sup </li>
                <li>there </li>
            </ol>
            <li>hey </li>
            <li>Three</li>
        </li>
    </ol>
    <ol>
        <li>Introduction</li>
        <ol>
            <li>Introduction</li>
        </ol>
        <li>Description</li>
        <li>Observation</li>
        <li>Results</li>
        <li>Summary</li>
    </ol>
    <ul>
        <li>Introduction</li>
        <li>Description

            <ul>
                <li>Observation

                    <ul>
                        <li>Results

                            <ul>
                                <li>Summary</li>
                            </ul>
                        </li>
                    </ul>
                </li>
            </ul>
        </li>
        <li>Overview</li>
    </ul>
</body>"

I have to use prawn for my task. But prawn doesn't support HTML tags. So, I came up with a solution using nokogiri:. I am parsing and later removing the tags with gsub. The below solution I have written for a part of the above content but the problem is ul and ol can vary.

     RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end


  puts doc.inner_text


1. One
2. Two

1. Inner One
2. inner Two

• hey

1. hiiiiiiiii
2. why
3. hiiiiiiiii


• aniket 


3. sup 
4. there 

3. hey 
4. Three



1. Introduction

1. Introduction

2. Description
3. Observation
4. Results
5. Summary



• Introduction
• Description

• Observation

• Results

• Summary






• Overview

Problem

1) What I want to achieve is how to handle space when working with ul and ol tags
2) How to handle deep nesting when li come inside ul or li come inside ol

Hurtful answered 14/5, 2018 at 8:54 Comment(2)

Is this a homework problem on recursion? It sure seems to be one not that anything is wrong with that but it's a weird real-world problem. – Raber 17/5, 2018 at 19:11

It is not the homework problem. It is the problem which I am facing in my work – Badderlocks 18/5, 2018 at 3:44

I've come up with a solution that handles multiple identations with configurable numeration rules per level:

require 'nokogiri'
ROMANS = %w[i ii iii iv v vi vii viii ix]

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{('a'..'z').to_a[index]}. " },
    3 => ->(index) { "#{ROMANS.to_a[index]}. " },
    4 => ->(index) { "#{ROMANS.to_a[index].upcase}. " }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "\u25E6 " },
    3 => ->(_) { "* " },
    4 => ->(_) { "- " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ul][deepness].call(i)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.fragment(text)

doc.search('ol:root').each do |group|
  binding.pry
  ol_rule(group, deepness: 1)
end

doc.search('ul:root').each do |group|
  ul_rule(group, deepness: 1)
end

You can then remove the tags or use doc.inner_text depending on your environment.

Two caveats though:

Your entry selector must be carefully selected. I used your snippet verbatim without root element, thus i had to use ul:root/ol:root. Maybe "body > ol" works for your situation too. Maybe selecting each ol/ul but than walking each and only find those, that have no list parent.
Using your example verbatim, Nokogiri does not handle the last 2 list items of the first group ol very well ("hey", "Three") When parsing with nokogiri, thus elements already "left" their ol tree and got placed in the root tree.

Current Output:

  1. One
  2. Two
      a. Inner One
      b. inner Two
        ◦ hey
        ◦ hey
      3. hey
      4. hey
  hey
  Three

  1. Introduction
    a. Introduction
  2. Description
  3. Observation
  4. Results
  5. Summary

  • Introduction
  • Description
      ◦ Observation
          * Results
              - Summary
  • Overview

Primal answered 17/5, 2018 at 8:16 Comment(3)

The whole content will be inside the body. But for the first two Inner One & inner Two, it should give number rather than an alphabet. Also, will it work with any other structure of ul and ol?? Lastly from where we have to print the whole data? – Badderlocks 17/5, 2018 at 8:41

Then you can just change the code to pass down too "deepness" params, ol_deepness, ul_deepness and only increment when descending the same group. I used doc.inner_text to extract the text, but that will leave some newlines inbetween. Sorry, i have no more time now. – Primal 17/5, 2018 at 8:44

My code above works with the Nokogiri::HTML.fragment method + ul:root selectors. If your structure is different and you are using the full Nokogiri::HTML.parse() method, than you need to adjust the root selector doc.search('ol:root') with e.g. doc.search('body > ol'). I only can use the example you provided. – Primal 17/5, 2018 at 9:8

Firstly for handling space, I have used a hack in the lambda call. Also, I am using add_previous_sibling function given by nokogiri to append something in starting. Lastly Prawn doesn't handle space when we deal with ul & ol tags so for that I have used this gsub gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }. You can read more from this link

Note: Nokogiri doesn't handle invalid HTML so always provide valid HTML

RULES = {
  ol: {
    1 => ->(index) { "#{index + 1}. " },
    2 => ->(index) { "#{}" },
    3 => ->(index) { "#{}" },
    4 => ->(index) { "#{}" }
  },
  ul: {
    1 => ->(_) { "\u2022 " },
    2 => ->(_) { "" },
    3 => ->(_) { "" },
    4 => ->(_) { "" },
  },
  space: {
    1 => ->(index) { " "  },
    2 => ->(index) { "  " },
    3 => ->(index) { "   " },
    4 => ->(index) { "    " },
  }
}

def ol_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    prefix = RULES[:ol][deepness].call(i)
    space = RULES[:space][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def ul_rule(group, deepness: 1)
  group.search('> li').each_with_index do |item, i|
    space = RULES[:space][deepness].call(i)
    prefix = RULES[:ul][deepness].call(i)
    item.add_previous_sibling(space)
    item.prepend_child(prefix)
    descend(item, deepness + 1)
  end
end

def descend(item, deepness)
  item.search('> ol').each do |ol|
    ol_rule(ol, deepness: deepness)
  end
  item.search('> ul').each do |ul|
    ul_rule(ul, deepness: deepness)
  end
end

doc = Nokogiri::HTML.parse(text)

doc.search('ol').each do |group|
  ol_rule(group, deepness: 1)
end

doc.search('ul').each do |group|
  ul_rule(group, deepness: 1)
end

Prawn::Document.generate("hello.pdf") do
  #puts doc.inner_text
  text doc.at('body').children.to_html.gsub(/^([^\S\r\n]+)/m) { |m| "\xC2\xA0" * m.size }.gsub("<ul>","").gsub("<\/ul>","").gsub("<ol>","").gsub("<\/ol>","").gsub("<li>", "").gsub("</li>","").gsub("\\n","").gsub(/[\n]+/, "\n")
end

Hurtful answered 24/5, 2018 at 7:5 Comment(0)

Whenever you are in a ol, li or ul element, you must recursively check for ol, li and ul. If there are none of them, return (what have been discovered as a substructure), if there are, call the same function on the new node and add its return value to the current structure.

You perform a different action on each node no matter where it is depending on its type and then the function automatically repackage everything.

Etheridge answered 14/5, 2018 at 9:30 Comment(3)

@AniketShivamTiwari sorry I am on my phone. Idea: wouldn't it be easier to just use css selectors to select every li and then to check if its parent is an ordered or unordered list ? Also I just noticed I can't see why your code doesn't behave the way you want it to. – Etheridge 14/5, 2018 at 9:48

@AniketShivamTiwari When I run your code and do a puts content.text, I get this. Isn't that what you want ? I can't understand how your solution doesn't match your expectations. – Etheridge 14/5, 2018 at 11:54

The above code I have written for a sample ul and li tags. It won't work with the example which I have given. – Hurtful 14/5, 2018 at 12:4

Recommended topics

Hot tags