XML vs YAML vs JSON: A Study To Find Answers
XML is commonly used for web application messaging - sending information back to a browser from a web server, or sending information between web services. It's dead easy to do this and it works very well, hence XML has become the de-facto choice for data exchange for web applications.
Alternatives such as YAML and JSON have found significant support in recent years. Both aim to be a more suitable alternative to XML in some cases.
How much interest is there in knowing which is best? Let's see.
- Google: xml vs yaml - 66 million
- Google: xml vs json - 323 thousand
- Google: yaml vs json - 1 million
Ok, so that's not an exact search. But is does suggest a huge amount of interest in a comparison between XML, YAML and JSON. (And no, Google, I didn't mean "xml vs xml" nor "yaml vs jason", but thanks anyway).
What's the problem?
XML might not be the best choice in all cases, but that's no revelation.
Dare Obasanjo referred to JSON as being "another nail in the coffin of XML on the Web"
.
Tim Bray solved the problem for us 2 years ago.
David Megginson decided it all ends up looking like XML when you add a little complexity, but did note that:
JSON [has] the important advantage [of making] the most trivial cases easy to represent.
James Bennett reminds us that JSON works:
because most people don't really need all that overhead, and because it's often possible to do really interesting things with really simple formats
Even 6 years ago David Mertz pointed out "some situations where YAML provides a better object serialization format than XML"
.
And, of course, Dustin Diaz informed the masses that JSON was not only fast but so easy it'll make you sick.
There's no end to the argument, but also not much factual evidence either.
Ultimately, I think Jeff Atwood best sums up the gist of the issue.
I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it.
So we know XML is not ideal, and JSON or YAML may be better in some cases. JSON might be faster, YAML might be better (and more beautiful).
But in what cases would you go for one instead of the other? What benefits might you see and where? I want cold hard facts, numbers, charts and answers.
Who knows?
How much academic research has been made in the field? Let's see what journal articles have been published that compare either XML, YAML or JSON in any way.
- ACM Digital Library: zero, from about 250 thousand articles
- IEEE: zero, from about 2 million articles
Ok, I'm getting desperate. The ACM and the IEEE are not small. They should have at least something of relevance.
Searching ... searching ... searching ...
Nope, turns out the ACM and IEEE journal archives contain nothing of direct relevance. There's even one article that relates to a completely different YAML.
Google Scholar, can you help?
Well, there is one academic article that explicitly compares XML, YAML and JSON (PDF, 200Kb).
It seems that both YAML and JSON are faster to encode for up to about 5000 elements, then XML takes over. It also looks like both YAML and JSON require twice as much memory as XML when decoding. I couldn't determine whether the article relates this to real world performance (the article speaks Portuguese, I don't).
The point: not much academic research appears to have been undertaken and there's a huge amount of interest in some form of performance comparison.
There is no clear sign of any scientifically-arranged, repeatable, verifiable hard-evidence-based comparison. So I'll do just that.
Goodbye life for the next 2-3 months, and hello data object serialization formats for the new world.
What will be studied?
I'll run some tests to determine which of the three technologies offers the least:
- encode time
- decode time
- transmission time
- overhead
The tests will be strictly scientific - I'll be doing my best to remove or minimise any influencing factors. Everything is going to be precise, exact and - most importantly - repeatable.
The results themselves might excite or scare a small number of developers. For the benefit of the rest of the world, I'll also be looking into why this is actually useful.
- Will people notice that JSON is 607 ms faster?
- Will your web server explode less frequently if you use YAML to talk to web services?
Just to top it all off, I'll also look at whether we need to be sending string-based serialised data between web services and whether we might be better off opting for much much faster choices such as Google Protocol Buffers. And anything else along the way that may be relevant, time permitting.
I need your help!
I've set up some test into the perception of time delays. I'd like to initiate some form of distributed stress testing on some web services. There will surely be plenty of tests and tasks that would benefit from a few minutes of everyone's time.
When will the results be ready?
This is part of my final year project, due at the end of April 2009. I'll have some results before then (I hope!) and will write up short pieces where possible. I'll try to make full and final results available after I finish my final exams, so that'll be some time around the end of June 2009.
Great!!! Waiting for april 2009 :)
Some things to consider: - different language speeds working with strings/xml - different implementations/libraries: for e.g. php json libraries differ very much http://gggeek.altervista.org/sw/article_20061113.html
Thanks Vasile!
For PHP-based tests, I had planned on using just the built-in JSON functions. From the link you gave, it looks like that's the fastest option which is good.
Sounds great!
End of June 2009 is just around the corner, how goes the study?
would be realy interesting to see some results ! :)
Thanks for the comments Jason and Marko!
The study finished up around mid-April. The results are in some cases quite interesting and in other cases quite predictable.
I'm working on a couple of academic journal articles on this subject and for those to be accepted for publication I can't publish anything too much on the same subject before hand.
I'd like to have the journal articles finished within the next month and then it's just a matter of waiting for them to be accepted and then published.
The really really interesting results come from the side-study I was carrying out looking into the extent to which people perceive delays. This study was important as I needed to answer the question: if data format X is "faster" than data format Y, to what degree must it be faster for end users to perceive a difference?
It is traditionally held that people cannot generally perceive delays of less than 100ms (1/10th of a second) and that it's only around the 100ms level that the percentage of people capable of perceiving such delays becomes statistically significant. I have shown this not to be true and that there are very commonly occurring conditions under which the majority of people can perceive 100ms delays or less.
I can't really say anything about the XML vs JSON vs YAML subject just yet. But I shall try to write a follow-up article soon that reveals at least something for now.
I wish I could say more but for the next couple of months I'll still have to keep things to myself.
After the journal articles are published (or rejected!) I can say much more. I have a whole series of articles ready which will reveal much more.
Drop me an email if you want to know when I release some useful, interesting results.
Is that how you compare them? But what about fitness to the task? As far as I know JSON does not have Schema or DTD. You can't validate JSON or YAML so you need to define your criteria. You are only comparing the small subset of schemaless data transfer. And even then only the relative "performance".
True, schemaless only - or self-describing, for associative arrays.
At $JOB we sometimes use XML to communicate data structures, but the problem is that XML isn't a good fit for common data structures: scalars (numbers and string), arrays and associative arrays. Whereas both YAML and JSON encodings both exactly match the data structure in the scripting language, so we know that fidelity will be retained.
any news about your publications?
Hey, I'm interested in your results, partial or completed. Are they available?
The author of webignition.net has written an excellent article. You have made your point and there is not much to argue about. It is like the following universal truth that you can not argue with: I smoke weed. Thanks for the info.