I am testing Apache Tika REST Api via python for parsing HTML files. Everything works except one thing. Interior of <noscript>
tags is also parsed as text and I am having some css styling content in my text, which is undesirable. Also, body of <div style="display:none">
is extracted as well. Is there a way to blacklist some html tags in the Tika rest API?
Apache Tika exclude some html tags
Asked Answered
I don't have an immediate solution, but the request seems reasonable so please open an issue on our JIRA for the team to discuss: https://issues.apache.org/jira/projects/TIKA/summary
Is the solution available now with the latest version of tika? @Tim Allison I am also facing the same issue. –
Garnierite
Doesn't look like it: issues.apache.org/jira/browse/TIKA-2805 . Ping that issue and see if you can get some attention... –
Gemina
© 2022 - 2024 — McMap. All rights reserved.