FAQ
- What does BOX stand for?
- What are the goals of BOX?
- What license governs the use of BOX?
- Is this only for Java?
- Shouldn't we just pass standard XML?
- Hasn't someone already done this?
- How much smaller is it?
- How much faster is it?
- Would it be even faster to zip BOX?
- What kinds of XML documents does BOX handle best?
- Why doesn't BOX support feature XYZ?
- Why is Xerces required?
- Why is Xalan required?
-
Why are attributes written out before
the element in which they appear?
-
Why are namespace declarations written out before
the element in which they appear?
- My question wasn't answered! Who do I ask?
What does BOX stand for?
Binary Optimized XML
What are the goals of BOX?
- Simplicity
- Compactness
- Speed
- Multi-language support
What license governs the use of BOX?
Current the
GNU Lesser license is used.
The goal is to require improvements made to BOX to be published.
If you find this license too restrictive, please let
us know
and we'll consider changing it.
Is this only for Java?
Currently yes. However, for BOX to be truly successful we need
implementations in a variety of programming languages.
Maybe you can create an implemenation in another language
and contribute it!
Hasn't someone already done this?
Yes and no. There have been other attempts to define more compact,
binary representations of XML. The most notable is perhaps the
WAP Binary XML Content Format.
The problem with previous approaches is that they are too complex.
BOX strives to be simple so that creating implementations in a
wide variety of programming languages is practical.
How much smaller is it?"
For really small XML documents, the difference may not be significant.
In general, larger XML documents repeat text more.
Element and attribute names are repeated,
as are some of their text values.
This leads to greater compression.
The book "XML Bible" by Elliotte Rusty Harold comes with a CD that
contains a file called 1998fullstatistics.xml.
This contains statistics from the 1998 Major League Baseball season.
The file has a size of 1042K (1,066,769 bytes).
The BOX representation of this file is only 157K (160,672 bytes)
which is 15% of the original size.
For comparison, WinZip compresses it 69K (70,258 bytes).
While this is smaller, it can't be processed as rapidly because it
has to be unzipped and parsed with an XML parser.
How much faster is it?
In tests comparing shipping this document across a socket connection
using both BOX encoding and XML text, BOX averaged 46% of the time
required to ship XML text.
The test parses the XML file into a DOM Document.
It then measures the time required to encode the DOM Document into BOX,
send it across the socket connection, and decode it back into a
DOM Document on the other side. The same is done with DOM text.
One caveat to this result is the impact of HotSpot.
It introduces a "priming" effect that causes the first usage of BOX
and the XML parser to take considerable longer than subsequent requests.
BOX took 1203ms for the first run and
averaged 205ms for 25 runs after that.
XML took 1250ms for the first run and
averaged 458ms for 25 runs after that.
Would it be even faster to zip BOX?
Somewhat surprisingly, no! To test this, the sendBOX method in
test.socket.Client was changed to use a ZipOutputStream on top of
the BufferedOutputStream used to send BOX across a socket connection.
The run method in test.socket.BOXServer was changed to use a
ZipInputStream for reading the zipped BOX sent from the Client.
This was not only slower than sending BOX, it was also slower than
sending plain XML!
What kinds of XML documents does BOX handle best?
BOX works best on XML documents that have lots of repeated text.
This includes element names, text outside tags,
attributes names and values, comments, namespace prefixes and URIs,
and processing instruction targets and values.
Shouldn't we just pass standard XML?
XML is a great data format.
Some of its greatest strengths are that it is human-readable
and it can be processed by a large number of languages
on a large number of platforms.
While BOX is not really human-readable,
it can be easily decoded into standard XML for human consumption.
The advantages of BOX over XML are that it is more compact and
it can be parsed into DOM objects or SAX events
without using an XML parser,
faster than an XML parser can parse standard XML.
BOX may be ideal as a replacement for XML in SOAP messages.
Wireless devices with limited space for software and
limited processing speed would also benefit.
Why doesn't BOX support feature XYZ?
Perhaps it's an oversight.
Let us know what is missing and we'll consider adding it.
It's also possible that the feature was intentionally omitted
because adding it would increase the complexity of the implementation.
Simplicity is deemed very important.
We want to make it as easy as possible to develop
BOX encoding and decoding implementations in a large number of languages.
We don't want BOX to be limited to Java-to-Java applications.
This is key to achieving widespread acceptance of BOX
as an alternative to XML.
Why is Xerces required?
Units tests use classes in the "samples" directory that ships with Xerces
in order to compare the canonical versions of XML documents.
The source for the classes used has been copied into the BOX source tree
and has been modified slightly.
Those classes depend on classes in xerces.jar.
This is the only dependency on Xerces.
Otherwise an alternate parser such as Crimson could be used.
Why is Xalan required?
Xalan contains an implementation of JAXP.
The following JAXP packages are used by BOX:
- javax.xml.parsers
- javax.xml.transform
- javax.xml.transform.dom
- javax.xml.transform.sax
- javax.xml.transform.stream
JDK 1.4 includes xalan.jar (in rt.jar within the JRE directory)
so it only needs to be downloaded and explicitly added to CLASSPATH
when working with an earlier version of the JDK.
Why are attributes written out before
the element in which they appear?
When BOX is being decoded into SAX events,
all the attributes of an element must be known before the
ContentHandler
startElement method can be called.
If the attribute data appeared after the corresponding element data,
we'd have to hold onto the element name and call
startElement when
any node type other than an attribute was encountered.
Coding this is cumbersome.
I know because that's the way I originally wrote it! ;-)
In addition, since namespace declarations appear as attributes,
the rationale explained in the next question applies.
Why are namespace declarations written out before
the element in which they appear?
Namespace declarations appear as attributes of an element.
Namespaces defined in the start tag of an element
are in effect for that element.
This means that a namespace prefix defined as
an attribute of an element can be used on that element.
For example,
<foo:myElement
xmlns:foo="http://www.ociweb.com/myNamespace">
Writing out information on namespace declarations
before the element in which they are defined
allows the element to be processed when it is encountered
in a stream of BOX data instead of waiting until
all its attributes are processed.
In addition, when BOX is being decoded into SAX events,
this makes it easy to call the
ContentHandler
startPrefixMapping method before the
startElement method.
My question wasn't answered! Who do I ask?
Send it to
Mark.
He'll answer you directly and consider adding it to the FAQ.