How to remove tags from a string in python using regular expressions? (NOT in HTML)

Asked 7/9, 2010 at 19:48 Answered 30/12, 2015 at 18:18

I need to remove tags from a string in python.

<FNT name="Century Schoolbook" size="22">Title</FNT>

What is the most efficient way to remove the entire tag on both ends, leaving only "Title"? I've only seen ways to do this with HTML tags, and that hasn't worked for me in python. I'm using this particularly for ArcMap, a GIS program. It has it's own tags for its layout elements, and I just need to remove the tags for two specific title text elements. I believe regular expressions should work fine for this, but I'm open to any other suggestions.

Overside answered 7/9, 2010 at 19:48 Comment(4)

do you want <FNT name="Century Schoolbook" size="22">Title</FNT> to be <FNT>Title</FNT> when done, or Title or <>Title<> or ? , not sure from your question what you are after ? – Jamshedpur 7/9, 2010 at 19:51

So what should this string look like after processing? I'm not entirely clear on what you want to do. – Hedonism 7/9, 2010 at 19:51

Sorry. The string should be "Title" after processing. – Overside 7/9, 2010 at 19:52

As a sibling of html, xml is no more regular or context-free than html. I'm not sure the entire scope of your situation, but at a quick glance, regular expressions still look like the wrong tool for the job. – Wolf 7/9, 2010 at 19:57

This should work:

import re
re.sub('<[^>]*>', '', mystring)

To everyone saying that regexes are not the correct tool for the job:

The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.

I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.

Twelvemo answered 7/9, 2010 at 20:7 Comment(2)

This didn't work. It returned the original string. Thanks though – Overside 7/9, 2010 at 20:25

Sorry, I forgot the all-important * character. Try again? – Twelvemo 7/9, 2010 at 20:43

Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.

You can use BeautifulSoup get_text() feature.

from bs4 import BeautifulSoup

text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
soup = BeautifulSoup(text)

print(soup.get_text())

Acetone answered 30/12, 2015 at 18:18 Comment(0)

Searching this regex and replacing it with an empty string should work.

/<[A-Za-z\/][^>]*>/

Example (from python shell):

>>> import re
>>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
>>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
Title

Tale answered 7/9, 2010 at 20:10 Comment(2)

That didn't work either. Could you give me an example of how you would search and replace using this? I tried, and it returned the original string. – Overside 7/9, 2010 at 20:46

Added an example. Did you forget import re? – Tale 7/9, 2010 at 21:32

If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.

Exponential answered 7/9, 2010 at 20:4 Comment(0)

If the source text is well-formed XML, you can use the stdlib module ElementTree:

import xml.etree.ElementTree as ET
mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
element = ET.XML(mystring)
print element.text  # 'Title'

If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.

Valenzuela answered 7/9, 2010 at 20:59 Comment(1)

If FNT would contain another tag in the middle of "Title", only the part up to the inner tag will be printed. – Lectionary 7/2, 2014 at 10:18

-3

Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

Kennakennan answered 7/9, 2010 at 21:0 Comment(2)

Unless the input is not guaranteed to be well-formed XML, in which case regex is arguably the only reasonable tool for the job. I'm also willing to bet that regex will perform significantly faster than handling the string as an XML document. – Tale 7/9, 2010 at 21:41

If the input is not well-formed XML, then implementing a full parser would be the proper way to do this. The grammar is complex enough that regular expressions are not enough. – Kennakennan 8/9, 2010 at 0:43

Recommended topics

Hot tags