DomXmlParserWhitespace.java - Parse XML File without Whitespaces

Q

How to parse an XML file with the DOM API without including whitespaces between XML elements?

✍: FYIcenter

A

In many cases, whitespaces are included in XML fiels before and after XML elements to make the XML file more readable.

For example, the follwoing XML file, User.xml, includes whitespaces:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Copyright (c) 2017 FYIcenter.com -->
<User>
    <ID>101</ID>
    <BirthDate>1970-01-01+00:01</BirthDate>
    <Name>Frank Y. Ivy</Name>
    <Sex>  Male</Sex>
</User>

If you want the DOM XML parser to ignore whitespaces, you need to do two things:

1, Add DTD (Document Type Definition) to define the element structure as shown in UserDTD.xml:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!-- Copyright (c) 2017 FYIcenter.com -->
<!DOCTYPE User [
   <!ELEMENT User (ID, BirthDate, Name, Sex)>
   <!ELEMENT ID (#PCDATA)>
   <!ELEMENT BirthDate (#PCDATA)>
   <!ELEMENT Name (#PCDATA)>
   <!ELEMENT Sex (#PCDATA)>
]>

<User>
    <ID>101</ID>
    <BirthDate>1970-01-01+00:01</BirthDate>
    <Name>Frank Y. Ivy</Name>
    <Sex>  Male</Sex>
</User>

2. Tell the parser to ignore whitespaces: setIgnoringElementContentWhitespace(true), as shown in DomXmlParserWhitespace.java:

// Copyright (c) 2017 FYIcenter.com
import java.io.*;
import javax.xml.parsers.*;
import org.w3c.dom.*;

public class DomXmlParserWhitespace {
   static String dot = "............................................................";
   public static void main(String[] args) throws Exception {
      DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
	  f.setIgnoringElementContentWhitespace(Boolean.parseBoolean(args[1]));
	  
      DocumentBuilder b = f.newDocumentBuilder();
      Document d = b.parse(new File(args[0]));
      System.out.println("Implementation class:\n   "+d.getClass().getName());
	  
      System.out.println("DOM object elements and text contents:");
	  Node n = d.getDocumentElement();
	  printText(n, 1);
   }
   public static void printText(Node n, int l) {
      String v = "";
      if (n.getNodeType()==Node.TEXT_NODE) v = n.getTextContent();
      System.out.println(dot.substring(0,l)+n.getNodeName()+":"+v);
	  NodeList c = n.getChildNodes();
	  for (int i=0; i<c.getLength(); i++) {
	     printText(c.item(i),l+1);
	  }
   }
}

Compile and run the example program, DomXmlParserWhitespace.java, with setIgnoringElementContentWhitespace(false):

>\fyicenter\jdk-1.8.0\bin\javac DomXmlParserWhitespace.java

>\fyicenter\jdk-1.8.0\bin\java DomXmlParserWhitespace UserDTD.xml false

Implementation class:
   com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl
   
DOM object elements and text contents:
.User:
..#text:

..ID:
...#text:101
..#text:

..BirthDate:
...#text:1970-01-01+00:01
..#text:

..Name:
...#text:Frank Y. Ivy
..#text:

..Sex:
...#text:  Male
..#text:

Run it again with setIgnoringElementContentWhitespace(true):

>\fyicenter\jdk-1.8.0\bin\java DomXmlParserWhitespace UserDTD.xml true

Implementation class:
   com.sun.org.apache.xerces.internal.dom.DeferredDocumentImpl
   
DOM object elements and text contents:
.User:
..ID:
...#text:101
..BirthDate:
...#text:1970-01-01+00:01
..Name:
...#text:Frank Y. Ivy
..Sex:
...#text:  Male

The output tells you that Apache Xerces is able to ignore whitespaces based on the DTD definitions.

 

Using XML DOM API with Apache Xerces

⇒⇒FAQ for Apache Xerces XML Parser

2017-12-13, 467👍, 0💬