BOX FAQ

FAQ

What does BOX stand for?
What are the goals of BOX?
What license governs the use of BOX?
Is this only for Java?
Shouldn't we just pass standard XML?
Hasn't someone already done this?
How much smaller is it?
How much faster is it?
Would it be even faster to zip BOX?
What kinds of XML documents does BOX handle best?
Why doesn't BOX support feature XYZ?
Why is Xerces required?
Why is Xalan required?
Why are attributes written out before the element in which they appear?
Why are namespace declarations written out before the element in which they appear?
My question wasn't answered! Who do I ask?

Simplicity
Compactness
Speed
Multi-language support

Current the GNU Lesser license is used. The goal is to require improvements made to BOX to be published. If you find this license too restrictive, please let us know and we'll consider changing it.

Is this only for Java?

Currently yes. However, for BOX to be truly successful we need implementations in a variety of programming languages. Maybe you can create an implemenation in another language and contribute it!

Hasn't someone already done this?

Yes and no. There have been other attempts to define more compact, binary representations of XML. The most notable is perhaps the WAP Binary XML Content Format. The problem with previous approaches is that they are too complex. BOX strives to be simple so that creating implementations in a wide variety of programming languages is practical.

How much smaller is it?"

For really small XML documents, the difference may not be significant. In general, larger XML documents repeat text more. Element and attribute names are repeated, as are some of their text values. This leads to greater compression. The book "XML Bible" by Elliotte Rusty Harold comes with a CD that contains a file called 1998fullstatistics.xml. This contains statistics from the 1998 Major League Baseball season. The file has a size of 1042K (1,066,769 bytes). The BOX representation of this file is only 157K (160,672 bytes) which is 15% of the original size.

For comparison, WinZip compresses it 69K (70,258 bytes). While this is smaller, it can't be processed as rapidly because it has to be unzipped and parsed with an XML parser.

How much faster is it?

In tests comparing shipping this document across a socket connection using both BOX encoding and XML text, BOX averaged 46% of the time required to ship XML text. The test parses the XML file into a DOM Document. It then measures the time required to encode the DOM Document into BOX, send it across the socket connection, and decode it back into a DOM Document on the other side. The same is done with DOM text.

One caveat to this result is the impact of HotSpot. It introduces a "priming" effect that causes the first usage of BOX and the XML parser to take considerable longer than subsequent requests. BOX took 1203ms for the first run and averaged 205ms for 25 runs after that. XML took 1250ms for the first run and averaged 458ms for 25 runs after that.

Would it be even faster to zip BOX?

Somewhat surprisingly, no! To test this, the sendBOX method in test.socket.Client was changed to use a ZipOutputStream on top of the BufferedOutputStream used to send BOX across a socket connection. The run method in test.socket.BOXServer was changed to use a ZipInputStream for reading the zipped BOX sent from the Client. This was not only slower than sending BOX, it was also slower than sending plain XML!

What kinds of XML documents does BOX handle best?

BOX works best on XML documents that have lots of repeated text. This includes element names, text outside tags, attributes names and values, comments, namespace prefixes and URIs, and processing instruction targets and values.

Shouldn't we just pass standard XML?

XML is a great data format. Some of its greatest strengths are that it is human-readable and it can be processed by a large number of languages on a large number of platforms. While BOX is not really human-readable, it can be easily decoded into standard XML for human consumption. The advantages of BOX over XML are that it is more compact and it can be parsed into DOM objects or SAX events without using an XML parser, faster than an XML parser can parse standard XML. BOX may be ideal as a replacement for XML in SOAP messages. Wireless devices with limited space for software and limited processing speed would also benefit.

Why doesn't BOX support feature XYZ?

Perhaps it's an oversight. Let us know what is missing and we'll consider adding it. It's also possible that the feature was intentionally omitted because adding it would increase the complexity of the implementation. Simplicity is deemed very important. We want to make it as easy as possible to develop BOX encoding and decoding implementations in a large number of languages. We don't want BOX to be limited to Java-to-Java applications. This is key to achieving widespread acceptance of BOX as an alternative to XML.

Why is Xerces required?

Units tests use classes in the "samples" directory that ships with Xerces in order to compare the canonical versions of XML documents. The source for the classes used has been copied into the BOX source tree and has been modified slightly. Those classes depend on classes in xerces.jar. This is the only dependency on Xerces. Otherwise an alternate parser such as Crimson could be used.

Why is Xalan required?

Xalan contains an implementation of JAXP.
The following JAXP packages are used by BOX:

javax.xml.parsers
javax.xml.transform
javax.xml.transform.dom
javax.xml.transform.sax
javax.xml.transform.stream

JDK 1.4 includes xalan.jar (in rt.jar within the JRE directory) so it only needs to be downloaded and explicitly added to CLASSPATH when working with an earlier version of the JDK.

Why are attributes written out before the element in which they appear?

When BOX is being decoded into SAX events, all the attributes of an element must be known before the ContentHandler startElement method can be called. If the attribute data appeared after the corresponding element data, we'd have to hold onto the element name and call startElement when any node type other than an attribute was encountered. Coding this is cumbersome. I know because that's the way I originally wrote it! ;-)

In addition, since namespace declarations appear as attributes, the rationale explained in the next question applies.

Why are namespace declarations written out before the element in which they appear?

Namespace declarations appear as attributes of an element. Namespaces defined in the start tag of an element are in effect for that element. This means that a namespace prefix defined as an attribute of an element can be used on that element. For example,

<foo:myElement xmlns:foo="http://www.ociweb.com/myNamespace">

Writing out information on namespace declarations before the element in which they are defined allows the element to be processed when it is encountered in a stream of BOX data instead of waiting until all its attributes are processed.

In addition, when BOX is being decoded into SAX events, this makes it easy to call the ContentHandler startPrefixMapping method before the startElement method.

My question wasn't answered! Who do I ask?

Send it to Mark. He'll answer you directly and consider adding it to the FAQ.