-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Hi @radkovo ,
We are using CSSBox DOM parser for parsing the HTML source, here is the implementation:
try (DocumentSource docSource = new StreamDocumentSource(JAFIOUtils.toInputStream(htmlSource),
null, "text/html;charset=UTF-8")) {
LOGGER.error("Before parse "+htmlSource);
// Parse the input document
DOMSource parser = new DefaultDOMSource(docSource);
Document doc = parser.parse();
LOGGER.error("After parse "+doc.getFirstChild().getTextContent());
}
For example lets consider the input source or htmlSource is <style></style>Test User <test.user@test.com>
After parsing the output will be Test User <test.user@test.com>.
Here the text content which contains email field enclosed with < and > are decoded to < and >, but as per our requirement, the parser should not decode < and > to < and >.
How to retain the text as it is without decoding or encoding text in this case, @radkovo could you please provide the solution for this issue?