Skip to content

Text containing &lt; or &gt; are decoded to < or > symbols when parsed #71

@GovardhanNag

Description

@GovardhanNag

Hi @radkovo ,

We are using CSSBox DOM parser for parsing the HTML source, here is the implementation:

try (DocumentSource docSource = new StreamDocumentSource(JAFIOUtils.toInputStream(htmlSource),
null, "text/html;charset=UTF-8")) {
LOGGER.error("Before parse "+htmlSource);
// Parse the input document
DOMSource parser = new DefaultDOMSource(docSource);
Document doc = parser.parse();
LOGGER.error("After parse "+doc.getFirstChild().getTextContent());
}

For example lets consider the input source or htmlSource is <style></style>Test User &lt;test.user@test.com&gt;
After parsing the output will be Test User <test.user@test.com>.

Here the text content which contains email field enclosed with &lt; and &gt; are decoded to < and >, but as per our requirement, the parser should not decode &lt; and &gt; to < and >.

How to retain the text as it is without decoding or encoding text in this case, @radkovo could you please provide the solution for this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions