Apache Tika exclude some html tags
Asked Answered
A

1

6

I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript> tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none"> is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?

Anhedral answered 22/2, 2019 at 15:0 Comment(0)
G
3

I don't have an immediate solution, but the request seems reasonable so please open an issue on our JIRA for the team to discuss: https://issues.apache.org/jira/projects/TIKA/summary

Gemina answered 28/2, 2019 at 21:42 Comment(2)
Is the solution available now with the latest version of tika? @Tim Allison I am also facing the same issue.Garnierite
Doesn't look like it: issues.apache.org/jira/browse/TIKA-2805 . Ping that issue and see if you can get some attention...Gemina

© 2022 - 2024 — McMap. All rights reserved.