Rick

Rick
Rick

Friday, January 17, 2014

Boon / Jackson discussion between Hightower and Cowtowncoder.

I am mostly putting this here so I can read it later just in case the issue gets deleted. Boon got a visit Cowboycoder.

cowtowncoder opened this issue 

Improvement to `JacksonSerializer.roundTrip()`

No milestone
No one is assigned
Existing code serializes content as a String, but given that Jackson
 heavily optimizes byte stream case, and since parse tests take in a byte[] or InputStream, it would make more sense to use:
byte[] content = mapper.writeValueAsBytes(alltype);
and feed that to parser. Combination will be significantly faster than construction to and parsing from a String (esp. when using StringWriter; there is mapper.writeValueAsString() method as well that is marginally faster).
The same output method probably also makes sense for serialize() method, looking at matching Boon serializer implementation.
Sorry, I don't get it. The benchmark focuses on deserialization, not serialization => read from bytes, not write to bytes.
Ah, maybe you're talking about additional tests that Boon's author has done in his fork?
Ok sorry if that is the case: I did start with the work, and did not verify the original code base.
I will verify and close this if irrelevant.
Indeed. Did not realize there is an actual fork there. Apologies for noise.
Closed
 cowtowncoder closed the issue 
It has code for serializing an parsing as you mentioned.
RE: Existing code serializes content as a String, but given that Jackson heavily optimizes byte stream case, and since parse tests take in a byte[] or InputStream, it would make more sense to use:
byte[] content = mapper.writeValueAsBytes(alltype);
Seems logical as an benchmark. Not sure it makes the other benchmark invalid as there are users who do want strings. Boon is optimized to serialize with char[]. I don't have a byte[] serializer.
RE: and feed that to parser. Combination will be significantly faster than construction to and parsing from a String (esp. when using StringWriter; there is mapper.writeValueAsString() method as well that is marginally faster).
RE: The same output method probably also makes sense for serialize() method, looking at matching Boon serializer implementation.
The thought of a byte serializer never crossed my mind. Currently there is only a char[] serializer. I like the idea. Not sure it is high on my list of priorities as most of the buffer sizes that I am dealing with in production are satisfied fine with String.toBytes. I'll test Jackson vs. Boon with to/fro bytes only. This should put Boon at an equal disadvantage of where Jackson is now. I don't think it negates the earlier benchmark, and in fact I think the benchmark in question is a much more common case. But that is conjecture for both of us.
@RichardHightower right, it is difficult to have proper apples-to-apples tests. In this case it is not so much that output asbyte[] is more efficient (it may be marginally so but not significantly), but that parsing is.
I think it is perfectly reasonable to have different sources/targets to test, and report these separately (esp. for parsing). For serialization, perhaps it would make sense to also have OutputStream output type, which should be relatively fair, not assuming that serializer has an efficient way of producing a byte[] or not?
For round-tripping use case it seems to me that the intermediate format should be the most efficient one for implementation; this does require some knowledge of implementation. I was thinking that this use case perhaps emulates case like writing a JSON file to disk, reading it back; or web service call (although payload would differ in directions), or maybe send-modify-return.
One last thing: I would have filed this at the forked repo, but I couldn't find an issue tracker. Maybe github does not have all the facilities for forks.
@RichardHightower I thought I recognized your name from NFJS? Unless I am confusing you with another mr Hightower. :)
I appreciate your explanation on goals for Boon. It is always interesting to read about that part, since it makes it much easier to understand various implementations choices and strategies.
The part about writeValueAsString() is really just about how StringWriter works, which likeByteArrayOutputStream keeps on doubling up, reallocating its buffer. If the end result is String, that isn't very optimal unless initial size guess is correct. With segments, one can reduce allocations. But it's probably not a huge deal in the end. Same is true for writeValueAsBytes, intermediate storage is segmented, and final allocation is done when total length is known, instead of doubling up until full size is known, and then making one more copy.
Bigger impact for round-trip was simply just that the intermediate Object would be efficient to create & consume. I had a look at results you found, and round-trip was one where I couldn't quite see where the difference comes from.
Other cases (where Boon does very well -- impressive!) make more sense to me.
Jackson optimizes heavily for POJO data-binding case to/from raw byte input; and specifically binding as Maps is probably somewhat sub-optimal: I suspect that the results for that one dictionary JSON where keys are numbers is particular tricky for Jackson due to symbol table churn. Basically, Jackson assumes that keys are mostly repetitive; but if all keys are unique, this does not hold. Which is fine, except for larger files starts to degrade performance.
One thing I have been curious about has been performance cost of doing range checks (to allow incremental input): that is, whether requiring all input to be in memory would allow short-cuts. I tried removing of those checks once, with pre-allocated buffer, but did not see much difference. So for the way Jackson streaming parser is implemented, there isn't much benefit from requiring full input. But this could well be different with difference parsing technique.
Maybe I should see how Boon does parsing: it's been a while since I have had a look elsewhere. Last one fast FastJSON, which actually had very cool tricks for data-binding. Its approach is much more integrated than Jackson's (sort of like SAX + data-binding in one bundle), and that is something that could yield improvements too. Current division between streaming and data-binding is useful, but it has some non-zero cost.
Kafka and Cassandra support, training for AWS EC2 Cassandra 3.0 Training