Remove doctype containing entity from xml using java

About

Asked 16/11, 2018 at 9:8 Answered 16/11, 2018 at 9:44

Solved java regex xml string regular-language

I'm trying to process an xml, before that i need to remove the doctype and entity declaration from the input xml.

I'm using the below code to remove the doctype and entity:

fileContent = fileContent.replaceAll("<!ENTITY ((.|\n|\r)*?)\">", "");
fileContent = fileContent.replaceAll("<!DOCTYPE((.|\n|\r)*?)>", "");

This removes the entity and then the doctype. This works fine if the xml contains below doctype declarations in the xml:

<!DOCTYPE ichicsr SYSTEM "http://www.w3.org/TR/html4/frameset.dtd">

<!DOCTYPE ichicsr SYSTEM "D:\UPGRADE\NTServices\Server\\Xml21.dtd"
[<!ENTITY % entitydoc SYSTEM "D:\UPGRADE\NTServices\Server\\latin-entities.dtd"> %entitydoc;]>

But if I have the doctype as given below, it doesn't work and the root tag in the xml get stripped off:

<!DOCTYPE ichicsr SYSTEM "D:\UPGRADE\NTServices\Server\\Xml21.dtd" 
[<!ENTITY % entitydoc SYSTEM 'D:\UPGRADE\NTServices\Server\\Xml21.dtd'>
]>

Please let me know if the regular expression I'm using is incorrect or any other action needs to be taken.

Criollo answered 16/11, 2018 at 9:8 Comment(5)

Never use (.|\n|\r)*?, use .*? with Pattern.DOTALL (or inline (?s) variant), or at least [\s\S]*?. – Winkle 16/11, 2018 at 9:11

Try a single replacement replaceAll("<!DOCTYPE[^<>]*(?:<!ENTITY[^<>]*>[^<>]*)?>", "") – Winkle 16/11, 2018 at 9:19

Thanks Wiktor. It worked for me. But, is there a way to handle upper case and lower case doctype and entity using a single pattern ? – Criollo 16/11, 2018 at 9:39

Well, yours does not work because you have " required before > in ENTITIY regex. You may just replace \" with ['\"] there. – Winkle 16/11, 2018 at 9:41

ok.. got it. But, is there a way to handle upper case and lower case doctype and entity using a single pattern – Criollo 16/11, 2018 at 9:43

Your approach does not work because you have " required before final > in the ENTITIY regex. You may just replace \" with ['\"] there.

Besides, never use (.|\n|\r)*? in any regex since it is a performance killer. Instead, use .*? with Pattern.DOTALL (or inline (?s) variant), or at least [\s\S]*?.

However, there is a better way: merge the two regexps into one:

fileContent = fileContent.replaceAll("(?i)<!DOCTYPE[^<>]*(?:<!ENTITY[^<>]*>[^<>]*)?>", "");

See the regex demo.

Details

(?i) - case insensitive Pattern.CASE_INSENSITIVE inline modifier
<!DOCTYPE - literal text
[^<>]* - 0+ chars other than < and >
(?:<!ENTITY[^<>]*>[^<>]*)? - an optional occurrence of
- <!ENTITY
- [^<>]* - 0+ chars other than < and >
- > - a > char
- [^<>]* - 0+ chars other than < and >
> - a > char.

Inductive answered 16/11, 2018 at 9:44 Comment(1)

Thanks a lot for the solution wiktor. – Criollo 16/11, 2018 at 9:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags