November 25th, 2008
Encoding C# strings as Byte[] (Byte Arrays) and back again - 3
If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!
![Encoding and Converting C# strings into Byte[] byte arrays](http://www.dijksterhuis.org/wp-content/uploads/2008/11/byte.jpg)
When working with io streams (such when sending and receiving information from a NetworkStream) you often have to convert C# strings into Byte[] (byte arrays) and back again. At this point it is important to consider how you would like to encode your string. This post shows how you can pass a string to a method that only accepts byte arrays — and how you can turn byte arrays back into strings again.
One major difference between C and C# is the fact that all strings are stored as Unicode.
In the old days when computers were still newish ASCII devised a standard for the first 128 characters, so a byte (which can hold up to 256 characters) was sufficient for communication. As time went by, and computers had to speak more languages the second half (128-255) was mapped to various languages. Many different encoding schemes (also called code pages) were designed, including ones that could hold Japanese & Chinese (some 6000+ characters) while still fitting this information into just 256 available bytes.
It was however still impossible to write a single e-mail that contained Ancient Greek, Chinese and modern Russian. So work was started on the Unicode project. For Unicode it was decided that a 2 byte combination (65,536 values) was sufficient to hold all the worlds languages.
The basic unit of a memory cell, or a communication stream is still the byte. A function which sends or receives information thus has to work with Byte[] (byte arrays).
Solution #1 – Convert Unicode to ASCII / String to an ASCII Byte[]
If you intend to send only the most basic of messages which can be satisfied with just A-Z, a-z & 0-9 and a few other characters you can convert the C# string using the ASCII encoder. You will however lose any characters that are not defined by ASCII. So while this is a good idea if your application is only used in North America, the rest of the world will probably not thank you for this design decision.
Convert a string to a byte[]
// Native C# strings are unicode encoded
string StringMessage = "Hello World How Are you? Pi \u03C0 Yen \uFFE5";
// We can show the characters on the command line
Console.WriteLine("{0}", StringMessage);
// We can convert directly a byte array, but some information is lost
System.Text.ASCIIEncoding ASCII = new System.Text.ASCIIEncoding();
Byte[] BytesMessage = ASCII.GetBytes(StringMessage);
To convert a byte[] back into a string
Byte[] BytesMessage; // Your message System.Text.ASCIIEncoding ASCII = new System.Text.ASCIIEncoding(); String StringMessage = ASCII.GetString( BytesMessage );
Solution #2 – Convert the Unicode string to a Unicode ASCII representation / String to encoded byte[]
These days a Western web browser can read Chinese pages, and send and receive e-mails to and from anywhere. But as many existing systems (including e-mail!) still limit transmission to the ASCII set of characters a number of standards exist to encode the 16 bit Unicode strings into 7 or 8 bit communication. The most commonly used encoding method is UTF-8 which reliably combines Unicode into 8 bit data.
Convert a string to a UTF-8 encoded byte[]
// Native C# strings are unicode encoded string StringMessage = "Hello World How Are you? Pi \u03C0 Yen \uFFE5"; // We can convert directly a byte array System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding(); Byte[] BytesMessage = UTF8.GetBytes(StringMessage);
Convert a UTF-8 Byte Array back into a string
Byte[] BytesMessage; // Your message System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding(); String StringMessage = UTF8.GetString( BytesMessage );
As a side note: a UTF-8 encoded unicode character does not simply translate to 2 bytes. So the length of the created Byte[] is not simply 2 times the number of characters in the string.
In fact each Unicode character can possibly be encoded as 1 – 4 bytes. If you would like to know more about the encoding scheme, have a look at the Wikipedia UTF-8 page.
Image credit: roland
Tags: byte, byte array, converting, string









Except where otherwise noted, content on this site is
January 31st, 2009 at 12:55 am
I needed a quick reminder on encoding byte arrays, and ended up rethinking my approach because of the estra detail in your post. appreciate it.
February 13th, 2009 at 1:37 pm
Hello
I am writing an app in C# which reads in a c/h (c-language) files, and removing certain lines of comments that meets a criteria. These c/h files contain both English and Japanese comments. such as this line, if it will show up…
u1_ret = (U1)SOME_CONST; /* 仮に未確定とする */
The problem is that when writing the file back, the the Japanese comments are corrupted and show up as squares like this…
u1_ret = (U1)SOME_CONST; /* ���ɖ��m��Ƃ���*/
I am creating the streams as follows:
srInputFile = new StreamReader(filename);
swOutputFile = new StreamWriter(outputDir + “\\_” + outputFileName, false, srInputFile.CurrentEncoding);
Here is how i am writing to file
while (!srInputFile.EndOfStream)
{
LineIn = srInputFile.ReadLine();
if (IsTextMatch(LineIn) == false)
{
swOutputFile.WriteLine(LineIn.Normalize());
}
}
I tried different encodings but the problem is still the same…
I appreciate your help…
regards
ld
February 15th, 2009 at 12:58 pm
This is a shot in the dark without seeing the original files of course. You will need to establish the source encoding first. If it isn’t unicode then its probably a Japanese ISO-2022-JP or SHIFT-JIS encoded file. In Shift-JIS certain combinations of multiple high ASCII characters form a single Japanese character.
Did you try something like:
srInputFile = new StreamReader(filename,Encoding.GetEncoding(“iso-2022-jp”))
That should make sure the text is imported from ISO-2022 and correctly converted to Unicode on reading.