[ACCEPTED]-Stripping Invalid XML characters in Java-xml

Accepted answer
Score: 22

I used Xalan org.apache.xml.utils.XMLChar class:

public static String stripInvalidXmlCharacters(String input) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < input.length(); i++) {
        char c = input.charAt(i);
        if (XMLChar.isValid(c)) {
            sb.append(c);
        }
    }

    return sb.toString();
}

0

Score: 10

I haven't used this personally but Atlassian 9 made a command line XML cleaner that may 8 suit your needs (it was made mainly for 7 JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS 6 console or shell, and locate the XML or 5 ZIP backup file on your computer, here assumed 4 to be called data.xml

Run: java -jar atlassian-xml-cleaner-0.1.jar 3 data.xml > data-clean.xml

This will write 2 a copy of data.xml to data-clean.xml, with 1 invalid characters removed.

Score: 8

I use the following regexp that seems to 5 work as expected for the JDK6:

Pattern INVALID_XML_CHARS = Pattern.compile("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\uD800\uDC00-\uDBFF\uDFFF]");
...
INVALID_XML_CHARS.matcher(stringToCleanup).replaceAll("");

In JDK7 it 4 might be possible to use the notation \x{10000}-\x{10FFFF} for 3 the last range that lies outside of the 2 BMP instead of the \uD800\uDC00-\uDBFF\uDFFF notation that is not 1 as simple to understand.

Score: 3

I have a similar problem when parsing content 23 of an Australian export tariffs into an 22 XML document. I cannot use solutions suggested 21 here such as: - Use an external tool (a 20 jar) invoked from command line. - Ask Australian 19 Customs to clean up the source file.

The 18 only method to solve this problem at the 17 moment is to iterate through the entire 16 content of the source file, character by 15 character and test if each character does 14 not belong to the ascii range 0x00 to 0x1F 13 inclusively. It can be done, but I was wondering 12 if there is a better way using Java methods 11 for type String.

EDIT I found a solution 10 that may be useful to others: Use Java method 9 String#ReplaceAll to replace or remove any 8 undesirable characters in XML document.

Example 7 code (I removed some necessary statements 6 to avoid clutter):

BufferedReader reader = null;
...
String line = reader.readLine().replaceAll("[\\x00-\\x1F]", "");

In this example I remove 5 (i.e. replace with an empty string), non-printable 4 characters within range 0x00 to 0x1F inclusively. You 3 can change the second argument in method 2 #replaceAll() to replace characters with 1 the string your application requires.

Score: 0

Is it possible your invalid characters are 7 present only within the values and not the 6 tags themselves i.e. the XML notionally 5 meets the schema but the values have not 4 been properly sanitized? If so, what about 3 overriding InputStream to create a CleansingInputStream 2 that replaces your invalid characters with 1 their XML equivalents?

Score: 0

Your problem does not concern XML: it concerns 18 character encodings. What it comes down 17 to is that every string, be it XML or otherwise, consists 16 of bytes and you cannot know what characters 15 these bytes represent, unless you are told 14 what character encoding the string has. If, for 13 instance, the supplier tells you it's UTF-8 12 and it's actually something else, you are 11 bound to run into problems. In the best 10 case, everything works, but some bytes are 9 translated into 'wrong' characters. In the 8 worst case you get errors like the one you 7 encountered.

Actually, your problem is even 6 worse: your string contains byte sequences 5 that do not represent characters in any 4 character encoding. There is no texthandling 3 tool, let alone an XML parser, that can 2 help you here. This needs byte-level cleaning 1 up.

More Related questions