Dmitry Jemerov's Weblog: Blowing it all out

February 14, 2005

Blowing it all out

It looks like a recent post in Michael Kaplan’s blog, where he demonstrates usage of surrogate pairs, exposes a bug in the implementation of System.IO.BinaryReader.ReadString() in .NET 1.1. The bug appears in Omea as ArgumentException “Conversion buffer overflow” when trying to read the body of the post from the resource store (which stores strings in UTF-8 encoding).

I have studied the Rotor sources of binaryreader.cs and utf8encoding.cs, and while they probably don’t exactly match the .NET 1.1 implementation, I think they give me a good idea of what’s actually going on.

As far as I understand, the problem is the following. BinaryReader.ReadString() reads the string in 128–byte chunks, using the Decoder class to store the intermediate state of the encoding conversion. It also creates a 128–char buffer where it puts the results of converting each chunk. Thus, it assumes that Decoder.GetChars() will not return more characters than it got bytes.

However, if I understand correctly, the assumption will be violated if the last byte of the byte sequence encoding a surrogate pair immediately follows the boundary of the 128–byte chunk, and all other bytes in the chunk represent regular ASCII characters. In this case, the UTF8 decoder will return the complete surrogate pair as the first two characters of the new chunk, and it will be followed by 127 regular ASCII characters. The result: trying to store 129 characters in a 128–character buffer.

I guess I am really lucky to have hit this problem… fortunately, it is fairly easy to replace BinaryReader.ReadString() with custom code that will not have this problem, and I’ll do just that.

Posted by Dmitry Jemerov at February 14, 2005 10:16 AM | TrackBack

Comments

I'm coming across this same issue in System.IO.StreamReader.ReadLine(). The exception is listed at the bottom for reference.

The line being read is only 49 bytes long, "Domain Name: XN--H32B15ENYC30BG1K.COM (.....com)". The values for the first for dots are "F1 BD BA B8".

This wouldn't seem to exceed 128 bytes, but maybe the buffer size is only set to the size of the read buffer.

Did you find a work around for this issue?

Exception listing:
System.ArgumentException: Conversion buffer overflow.
at System.Text.UTF8Encoding.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteCount, Char[] chars, Int32 charIndex, UTF8Decoder decoder)
at System.Text.UTF8Decoder.GetChars(Byte[] bytes, Int32 byteIndex, Int32 byteCount, Char[] chars, Int32 charIndex)
at System.IO.StreamReader.ReadBuffer()
at System.IO.StreamReader.ReadLine()
at INTZ.FormDailyStats.CountRecords(String file, Int32& queried, Int32& returned) in (source filename at line number.)

Thanks,
Ed :)
http://www.BestPricedDomains.com

Posted by: Ed Amaral at August 24, 2005 09:36 PM

I too was encountering a Conversion buffer overflow exception using the following code...

String contentString = "Some data";

ubyte[] docData = new ubyte[Convert.ToInt32(contentString.length())];
char [] data = new char[contentString.length()];
data = contentString.toCharArray();
System.Text.UTF8Encoding utf8 = new System.Text.UTF8Encoding();
System.Text.Encoder encoder = utf8.GetEncoder();
encoder.GetBytes(data, 0, contentString.length(), docData, 0, true);
b64Document = Convert.ToBase64String(docData);

I was able to get around the overflow error using the following...

ubyte docData[] = System.Text.Encoding.get_UTF8().GetBytes(contentString);
b64Document = Convert.ToBase64String(docData);

It's kind of strange that this worked, as it would appear to be doing the same thing.

Anyway I wanted to POST this as there really wasn't much out there regarding this exception.

I hope this helps others.

Best Regards,

Mike Cronin
Data On Call

Posted by: Mike Cronin at March 28, 2006 04:15 AM