COMPARING JAVA XML PARSERS
PRESENTED BY
SASANKA SEKHAR BANERJEE
COMPARING JAVA XML PARSERS
During this presentation, we will discuss the following: Need for XML Brief overview of XML Different methods of parsing XML DOM [Document Object Model] SAX [Simple API for XML] JAXP [Java API for XML processing] JAXB [Java API for XML Binding] StAX [Streaming API for XML] XPath Choose the right parser
COMPARING JAVA XML PARSERS NEED FOR XML
Applications essentially consist of two parts - functionality described by the code and the data that is
manipulated by the code. environment.
The in-memory storage and management of data is a key part of any programming language and
Within a single application, the programmer is free to decide how the data is stored and represented.
Problem - Application must exchange data with another application.
Can use an intermediary storage medium, such as a database. But what if the data is to be exchanged directly between two applications, or the applications cannot
access the same database?
In this case, the data must be encoded in some particular format as it is produced. This has often resulted in the creation of application-specific data formats. These formats can be text-based, such as HTML for encoding how to display the encapsulated data, or
binary, such as those used for sending remote procedure calls.
Problem - In either case, there tends to be a lack of flexibility in the data representation,
causing problems when versions change or when data needs to be exchanged between disparate applications, frequently from different vendors.
COMPARING JAVA XML PARSERS XML USAGE
XML
was developed to address these issues. XML is written in plain text, uses self-describing elements and provides a data encoding format that is: Generic Simple Flexible Extensible Portable
XML offers a method of putting structured data in a text file. Structured data conforms to a particular
format; examples are spreadsheets, address books, configuration parameters, and financial transactions.
This plain text data provides software- and hardware-independent way of storing data making it easier
to create data that different applications can share. incompatible applications.
Exchanging data as XML greatly reduces this complexity, since the data can be read by different While upgrading to a new systems large volume of data must be converted and incompatible data is
often lost. XML plain text format. This makes it easier to expand or upgrade to new systems, without losing data.
With XML, data can be available to all kinds of "reading machines" (Handheld computers, voice
machines, news feeds, etc)
COMPARING JAVA XML PARSERS OVERVIEW OF XML
XML document consists of elements, each element has a start tag, content and an end tag. XML document must have exactly one root element, e.g. one tag which encloses the remaining tags. XML document is case-sensitive and required to be well-formatted. Following conditions need to satisfied in order to be well-formatted: A XML document always starts with a prolog Every tag has a closing tag. All tags are completely nested. XML document is valid if it is well-formatted and if it is contains a link to a XML schema and is valid according to the schema.
The following is a valid, well-formatted XML file <?xml version="1.0"?> <!-- This is a comment --> <address> <name>Lars </name> <street> Test </street> <telephone number= "0123"/> </address>
COMPARING JAVA XML PARSERS PARSING XML
Java contains several methods to access XML. The following is a short overview of the available methods.
Document Object Model or DOM
Defines a mechanism for accessing and manipulating well-formed XML. Using the DOM API, the XML document is transformed into a tree structure in memory. The application then navigates the tree to parse the document. If the document is large, it can place a strain on system resources. Defines XML parsing methods. Event based parser, the SAX parser streams a series of events while it reads the document. These events are forwarded to event handlers, which also provide access to the data of the document. Consumes extremely low memory, XML is not required to be loaded into the memory at one time. Need to implement all the event handlers to handle each and every incoming event. Incapable of processing the events when it comes to the DOM's element supports, and need to keep track of the parsers position in the document hierarchy. The application logic gets tougher as the document gets complicated and bigger.
Simple API For XML or SAX
It may not be required that the entire document be loaded but a SAX parser still requires to parse the whole document, similar to the DOM.
It lacks a built-in document support for navigation like the one which is provided by XPath. Along with the existing problem the one-pass parsing syndrome also limits the random access support.
COMPARING JAVA XML PARSERS PARSING XML
Java API for XML Processing or JAXP
It provides a common interface for creating and using SAX and DOM in Java. It does not implement a parser in itself, but defines the behavior that a parser is (at least) to support. The actual parser itself will have to derive these classes and provide concrete classes. It uses FACTORY pattern to create a concrete class and then call methods on these to parse. DocumentBuilderFactory class is used for DOM Parsing and SAXParserFactory is used for SAX parsing. Traversing the DOM using JAXP: Instantiate a factory class. Using the factory class instantiate the provider class. Using the provider class created in the previous step perform the XML processing/parsing
DocumentBuilderFactoty factoryBuilder DocumentBuilder builder Document doc
= DocumentBuilderFactory.newInstance( ); = factoryBuilder.newDocumentBuilder(); = builder.parse( fileName );
COMPARING JAVA XML PARSERS PARSING XML
SAX Parsing using JAXP
In the case of DOM parser, responsibility was passed to the actual parser to parse the XML document and return the DOM document object.
But for SAX, the approach is quite opposite. We call the parse method and pass a handler object this handler will receive notifications about the parsing progress, errors encountered and so on. SAXParserFactory factorySAX = SAXParserFactory.newInstance(); SAXParser sax = factorySAX.newSAXParser(); DefaultHandler handler = new XMLParser(); sax.parse(inputStream, handler); The only major difference is the parse function first, the parse function doesnt return a Document object and, secondly, we need to specify a DefaultHandler-derived class. The handler class is meant to build up the DOM internally, should it need to.
COMPARING JAVA XML PARSERS PARSING XML
Java API For XML Binding or JAXB
DOM is a useful API that build and transform XML documents in memory. Unfortunately, DOM is somewhat slow and resource hungry. To address these problems, the Java Architecture for XML Binding (JAXB) has been developed. JAXB provides a mechanism that simplifies the creation and maintenance of XML-enabled Java applications. It does this by using an XML schema compiler (only DTDs and a subset of XML schemas and namespaces at the time of this writing) that translates XML DTDs into one or more Java classes, thereby removing the burden from the developer to write complex parsing code. The generated classes handle all the details of XML parsing and formatting, including code to perform error and validity checking of incoming and outgoing XML documents, which ensures that only valid, error-free XML is accepted. Because the code has been generated for a specific schema, the generated classes are more efficient than those in a generic SAX or DOM parser. Most important, a JAXB parser often requires a much smaller footprint in memory than a generic parser. Classes created with JAXB do not include tree-manipulation capability, which is one factor that contributes to the small memory footprint of a JAXB object tree.
COMPARING JAVA XML PARSERS PARSING XML
JAXB primarily contains at the two main components: The binding compiler, which binds a given XML schema to a set of generated Java classes The binding runtime framework, which provides unmarshalling, marshalling, and validation functionalities. Unmarshalling a XML document Unmarshalling is the process of converting an XML document into a corresponding set of Java objects. First step is to create a JAXBContext context object which is the starting point for marshalling, unmarshalling, and validation operations. JAXBContext jaxbContext = JAXBContext.newInstance (com.xmlparsers.jaxb.xsd.marketerprofile"); To unmarshall an XML document, create an Unmarshaller from the context: Unmarshaller unmarshaller = jaxbContext.createUnmarshaller(); The unmarshaller returns the unmarshalled object: CreateCustomerProfileResponse profileElement = (CreateCustomerProfileResponse) unmarshaller.unmarshal(new File("src/com/xmlparsers/jaxb/xsd/CIMMarketerProfile.xml")); String marketerProfile = profileElement.getCustomerProfileId();
COMPARING JAVA XML PARSERS PARSING XML
Marshalling a XML document Marshalling involves transforming Java classes into XML format.
MessageType msgType = new MessageType(); msgType.setCode("0"); msgType.setText("Successfull");
MessagesType msgTypes = new MessagesType(); msgTypes.setResultCode("OK"); msgTypes.getMessageType().add(msgType); CreateCustomerProfileResponse marketerProfile = new CreateCustomerProfileResponse(); marketerProfile.getMessagesType().add(msgTypes); marketerProfile.setCustomerProfileId("21345678"); JAXBContext context = JAXBContext.newInstance(CreateCustomerProfileResponse.class); Marshaller m = context.createMarshaller(); m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE); m.marshal(marketerProfile, System.out);
COMPARING JAVA XML PARSERS PARSING XML
Use JAXB when you want to Access data in memory, but do not need tree manipulation capabilities Process only data that is valid Convert data to different types Generate classes based on a DTD or XML schema Build object representations of XML data Use JAXP when you want to Have flexibility with regard to the way you access the data, either serially with SAX or randomly in memory with DOM Use your same processing code with documents based on different DTDs Parse documents that are not necessarily valid Apply XSLT transformations Insert or remove components from an in-memory XML tree
COMPARING JAVA XML PARSERS PARSING XML
Streaming API For XML or StAX
Traditionally, XML APIs are either: Tree based - the entire document is read into memory as a tree structure for random access by the calling application Event based - the application registers to receive events as entities are encountered within the source document. Tree based API are less efficient with respect to the memory usage. In such situations, a streaming API is preferred which uses much less memory since it doesn't have to hold the entire document in memory simultaneously. It can process the document in small pieces making it much faster. SAX is one such event based streaming API which actually pushes data into the application. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not.
StAX was designed as a median between these two opposites. The programmatic entry point is a cursor that represents a point within the document. The application moves the cursor forward - 'pulling' the information from the parser as it needs.
COMPARING JAVA XML PARSERS PARSING XML
Pull API has the following advantages:
Pull APIs are a more comfortable alternative for streaming processing of XML. A Pull API is based around the more familiar Iterator design pattern rather than the less well-known observer design pattern. In a Pull API, the client program asks the parser for the next piece of information rather than the parser telling the client program when the next datum is available. In a Pull API the client program drives the parser whereas in a Push API the parser drives the client.
Why StAX ?
StAX shares with SAX the ability to read arbitrarily large documents. However, in StAX the application is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser StAX exceeds SAX by allowing programs to both read existing XML documents and create new ones. Unlike SAX, StAX is a bidirectional API.
COMPARING JAVA XML PARSERS PARSING XML
Reading XML with StAX:
XMLStreamReader is the key interface in StAX.
This interface represents a cursor that's moved across an XML document from beginning to end. At any given time, this cursor points at one event: text node, start-tag, comment, etc. The cursor always moves forward, never backward, and normally only moves one item at a time. Methods like getName and getText can be invoked to retrieve information. A typical StAX program begins by using the XMLInputFactory class to load an implementation dependent instance of XMLStreamReader.
InputStream in = new FileInputStream(new File("src/com/xmlparsers/jaxb/xsd/CIMMarketerProfile.xml")); XMLInputFactory factory = XMLInputFactory.newInstance(); XMLStreamReader staxParser = factory.createXMLStreamReader(in);
COMPARING JAVA XML PARSERS PARSING XML
while (staxParser.hasNext()) { int event = staxParser.next(); if (event == XMLStreamConstants.END_DOCUMENT) { staxParser.close(); break; } if (event == XMLStreamConstants.START_ELEMENT) { System.out.println(staxParser.getLocalName()); } } The advantage of StAX parsing over SAX parsing is that a parse event may be skipped by invoking the next() method as shown in the following code. For example, if the parse event is of type START_ELEMENT, a developer may determine if the event information is to be obtained or the next event is to be retrieved: if (event == XMLStreamConstants.START_ELEMENT) { System.out.println(staxParser.getLocalName()); }
COMPARING JAVA XML PARSERS PARSING XML
Writing with StAX
// XMLStreamWriter will be obtained from an XMLOutputFactory XMLOutputFactory outputFactory= XMLOutputFactory.newInstance(); XMLStreamWriter XMLStreamWriter= outputFactory.createXMLStreamWriter(System.out); // create a document start with the writeStartDocument() method XMLStreamWriter.writeStartDocument("UTF-8","1.0"); XMLStreamWriter.writeComment("Testing with StAX "); // Output the start of the 'catalog' element using writeStartElement() method XMLStreamWriter.writeStartElement("createCustomerProfileResponse"); XMLStreamWriter.writeNamespace("xsi","http://www.w3.org/2001/XMLSchema-instance"); XMLStreamWriter.writeStartElement("messages"); XMLStreamWriter.writeStartElement("resultCode"); XMLStreamWriter.writeCharacters("Ok"); XMLStreamWriter.writeEndElement();
COMPARING JAVA XML PARSERS PARSING XML
Writing with StAX . contd
XMLStreamWriter.writeStartElement("message"); XMLStreamWriter.writeStartElement("code"); XMLStreamWriter.writeCharacters("I00001"); XMLStreamWriter.writeEndElement(); XMLStreamWriter.writeStartElement("text"); XMLStreamWriter.writeCharacters("Successful"); XMLStreamWriter.writeEndElement(); XMLStreamWriter.writeEndElement(); XMLStreamWriter.writeStartElement("customerProfileId"); XMLStreamWriter.writeCharacters("1103042"); XMLStreamWriter.writeEndElement(); XMLStreamWriter.writeEndElement(); XMLStreamWriter.flush(); XMLStreamWriter.close();
COMPARING JAVA XML PARSERS PARSING XML
XPATH
XPath is a language for addressing parts of an XML document.
XPath, XML Path Language, is an expression language for addressing portions of an XML document or navigating within an XML document. XPath is really helpful for parsing XML- based configuration or properties files. XPath uses path expressions to select nodes or node-sets in an XML document. These path expressions look very much like URL and traditional file system paths. XPath also supports several functions for string manipulation, comparison and others. XML documents are treated as trees of nodes and the root is called the document or root node. There are about seven different kinds of nodes. They are element, attribute, text, namespace, processing-instruction, comment, and root nodes.
COMPARING JAVA XML PARSERS PARSING XML
XPATH
Let us consider the following XML sample: <createCustomerProfileResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <messages> <resultCode>Ok</resultCode> <message> <code>I00001</code> <text>Successful.</text> </message> </messages> <customerProfileId>1103042</customerProfileId> </createCustomerProfileResponse> The root node is < createCustomerProfileResponse>. <messages> and <customerProfileId> are the two Elements. The <resultCode> node is a child of the <messages> element. The resultCode value Ok is a text node.
COMPARING JAVA XML PARSERS PARSING XML
XPATH Path Expression syntax
Expression
nodename
/ // . ..
Description
Selects all child nodes of the named node
Selects from root node Selects nodes from the current node that match the selection no matter where they are Selects the current node Selects the parent of the current node
@
* @* node()
Selects attributes
Matches any element node Matches any attribute nodes Matches any node of any kind
COMPARING JAVA XML PARSERS PARSING XML
XPATH Reading XML
InputStream resultStream = new FileInputStream(new File("src/com/xmlparsers/jaxb/xsd/CIMMarketerProfile.xml")); java.io.BufferedReader aReader = new java.io.BufferedReader(new java.io.InputStreamReader(resultStream, "UTF8")); StringBuffer aResponse = new StringBuffer();
String aLine = aReader.readLine(); while(aLine != null) { aResponse.append(aLine); aLine = aReader.readLine(); } resultStream.close(); if (aResponse.length() > 0 && (int) aResponse.charAt(0) == 0xFEFF) { aResponse.deleteCharAt(0); }
COMPARING JAVA XML PARSERS PARSING XML
XPATH Reading XML javax.xml.parsers.DocumentBuilder docBuilder = javax.xml.parsers.DocumentBuilderFactory.newInstance().newDocumentBuilder(); java.io.StringReader stringReader = new java.io.StringReader(aResponse.toString()); org.w3c.dom.Document doc = docBuilder.parse(new org.xml.sax.InputSource(stringReader)); javax.xml.xpath.XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath(); String customerProfileId = xpath.evaluate("/*/customerProfileId/text()", doc);