Stripping HTML From A String
Asked Answered
P

6

5

I've tried a number of things but nothing seems to be working properly. I have an Access DB and am writing code in VBA. I have a string of HTML source code that I am interested in stripping all of the HTML code and Tags out of so that I just have plain text string with no html or tags left. What is the best way to do this?

Thanks

Predecessor answered 9/10, 2012 at 16:8 Comment(0)
R
7

One way that's as resilient as possible to bad markup;

with createobject("htmlfile")
    .open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .close
    msgbox "text=" & .body.outerText
end with
Rachele answered 9/10, 2012 at 16:14 Comment(4)
+1 Nice idea, just have to remove the special characters thenKingfish
It should translate entities too &amp; -> &Rachele
oh yeah, could've sworn it didn't ;)Kingfish
Add a reference to "Microsoft HTML Object Library" to get Intellisense and early binding (see example below).Babb
C
6
    Function StripHTML(cell As Range) As String  
 Dim RegEx As Object  
 Set RegEx = CreateObject("vbscript.regexp")  

 Dim sInput As String  
 Dim sOut As String  
 sInput = cell.Text  

 With RegEx  
   .Global = True  
   .IgnoreCase = True  
   .MultiLine = True  
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.  
 End With  

 sOut = RegEx.Replace(sInput, "")  
 StripHTML = sOut  
 Set RegEx = Nothing  
End Function  

This might help you, Good luck.

Chairman answered 9/10, 2012 at 16:10 Comment(0)
K
3

It depends how complex the html structure is and how much data you're wanting out of it.

Depending on the complexity you might get away with regular expressions, but for complex markup trying to parse data from html with regex is like trying to eat soup with a fork.

You can use the htmFile object to turn the flat file into objects that you can interact with, for example:

Function ParseATable(url As String) As Variant 

    Dim htm As Object, table As Object 
    Dim data() As String, x As Long, y As Long 
    Set htm = CreateObject("HTMLfile") 
    With CreateObject("MSXML2.XMLHTTP") 
        .Open "GET", url, False 
        .send 
        htm.body.innerhtml = .responsetext 
    End With 

    With htm 
        Set table = .getelementsbytagname("table")(0) 
        Redim data(1 To table.Rows.Length, 1 To 10) 
        For x = 0 To table.Rows.Length - 1 
            For y = 0 To table.Rows(x).Cells.Length - 1 
                data(x + 1, y + 1) = table.Rows(x).Cells(y).InnerText 
            Next y 
        Next x 

        ParseATable = data 

    End With 
End Function 
Kingfish answered 9/10, 2012 at 16:14 Comment(1)
+1 worth it alone for parse data from html with regex is like trying to eat soup with a forkDexamyl
B
0

Using early binding:

Public Function GetText(inputHtml As String) As String
With New HTMLDocument
    .Open
    .write "<p>foo <i>bar</i> <u class='farp'>argle </zzzz> hello </p>"
    .Close
   StripHtml = .body.outerText
End With
End Function
Babb answered 10/10, 2012 at 8:50 Comment(0)
T
0

An improvement over one of the above... It finds quotes and line feeds and replaces them with the non-HTML equivalents. Also, the original function had a problem with embedded UNC references (ie: <\server\share\folder\file.ext>). It would remove the entire UNC string due to < at the beginning and > at the end. This function fixes that so the UNC gets inserted into the string correctly:

Function StripHTML(strString As String) As String
 Dim RegEx As Object
 Set RegEx = CreateObject("vbscript.regexp")

 Dim sInput As String
 Dim sOut As String
 sInput = Replace(strString, "<\\", "\\")

 With RegEx
   .Global = True
   .IgnoreCase = True
   .MultiLine = True
.Pattern = "<[^>]+>" 'Regular Expression for HTML Tags.
 End With

 sOut = RegEx.Replace(sInput, "")
 StripHTML = Replace(Replace(Replace(sOut, "&nbsp;", vbCrLf, 1, -    1), "&quot;", "'", 1, -1), "\\", "<\\", 1, -1)
 Set RegEx = Nothing
End Function
Telephonic answered 27/5, 2015 at 3:3 Comment(0)
E
0

I found a really simple solutions to this. I currently run an access database and use excel forms to update the system due to system restrictions and shared drive privileges. when I call the data from Access I use: Plaintext(YourStringHere) this will remove all html parts and only leave the text.

hope this works.

Euler answered 6/6, 2016 at 5:37 Comment(2)
What's the definition of "Plaintext" here?Christoffer
StripHTML = Plaintext(StingToConvert)Euler

© 2022 - 2024 — McMap. All rights reserved.