Home About

November 25th, 2008

Encoding C# strings as Byte[] (Byte Arrays) and back again - 3

When working with io streams (such when sending and receiving information from a NetworkStream) you often have to convert C# strings into Byte[] (byte arrays) and back again. At this point it is important to consider how you would like to encode your string. This post shows how you can pass a string to a method that only accepts byte arrays — and how you can turn byte arrays back into strings again.

One major difference between C and C# is the fact that all strings are stored as Unicode.

In the old days when computers were still newish ASCII devised a standard for the first 128 characters, so a byte (which can hold up to 256 characters) was sufficient for communication. As time went by, and computers had to speak more languages the second half (128-255) was mapped to various languages. Many different encoding schemes (also called code pages) were designed, including ones that could hold Japanese & Chinese (some 6000+ characters) while still fitting this information into just 256 available bytes.

It was however still impossible to write a single e-mail that contained Ancient Greek, Chinese and modern Russian. So work was started on the Unicode project. For Unicode it was decided that a 2 byte combination (65,536 values) was sufficient to hold all the worlds languages.

The basic unit of a memory cell, or a communication stream is still the byte. A function which sends or receives information thus has to work with Byte[] (byte arrays).

Solution #1 – Convert Unicode to ASCII / String to an ASCII Byte[]

If you intend to send only the most basic of messages which can be satisfied with just A-Z, a-z & 0-9 and a few other characters you can convert the C# string using the ASCII encoder. You will however lose any characters that are not defined by ASCII. So while this is a good idea if your application is only used in North America, the rest of the world will probably not thank you for this design decision.

Convert a string to a byte[]

// Native C# strings are unicode encoded
string StringMessage = "Hello World How Are you? Pi \u03C0 Yen \uFFE5";

// We can show the characters on the command line
Console.WriteLine("{0}", StringMessage);
           
// We can convert directly a byte array, but some information is lost
System.Text.ASCIIEncoding ASCII  = new System.Text.ASCIIEncoding(); 
Byte[] BytesMessage = ASCII.GetBytes(StringMessage);

To convert a byte[] back into a string

Byte[] BytesMessage; // Your message
System.Text.ASCIIEncoding ASCII  = new System.Text.ASCIIEncoding(); 
String StringMessage = ASCII.GetString( BytesMessage );

Solution #2 – Convert the Unicode string to a Unicode ASCII representation / String to encoded byte[]

These days a Western web browser can read Chinese pages, and send and receive e-mails to and from anywhere. But as many existing systems (including e-mail!) still limit transmission to the ASCII set of characters a number of standards exist to encode the 16 bit Unicode strings into 7 or 8 bit communication. The most commonly used encoding method is UTF-8 which reliably combines Unicode into 8 bit data.

Convert a string to a UTF-8 encoded byte[]

// Native C# strings are unicode encoded
string StringMessage = "Hello World How Are you? Pi \u03C0 Yen \uFFE5";

// We can convert directly a byte array
System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding();
Byte[] BytesMessage = UTF8.GetBytes(StringMessage);

Convert a UTF-8 Byte Array back into a string

Byte[] BytesMessage; // Your message
System.Text.UTF8Encoding UTF8 = new System.Text.UTF8Encoding();
String StringMessage = UTF8.GetString( BytesMessage );

As a side note: a UTF-8 encoded unicode character does not simply translate to 2 bytes. So the length of the created Byte[] is not simply 2 times the number of characters in the string.

In fact each Unicode character can possibly be encoded as 1 – 4 bytes. If you would like to know more about the encoding scheme, have a look at the Wikipedia UTF-8 page.

3 Responses to “Encoding C# strings as Byte[] (Byte Arrays) and back again”

  1. Tyler Says:

    I needed a quick reminder on encoding byte arrays, and ended up rethinking my approach because of the estra detail in your post. appreciate it.

  2. Laith Says:

    Hello

    I am writing an app in C# which reads in a c/h (c-language) files, and removing certain lines of comments that meets a criteria. These c/h files contain both English and Japanese comments. such as this line, if it will show up…
    u1_ret = (U1)SOME_CONST; /* 仮に未確定とする */

    The problem is that when writing the file back, the the Japanese comments are corrupted and show up as squares like this…
    u1_ret = (U1)SOME_CONST; /* ���ɖ��m��Ƃ���*/

    I am creating the streams as follows:
    srInputFile = new StreamReader(filename);
    swOutputFile = new StreamWriter(outputDir + “\\_” + outputFileName, false, srInputFile.CurrentEncoding);

    Here is how i am writing to file
    while (!srInputFile.EndOfStream)
    {
    LineIn = srInputFile.ReadLine();
    if (IsTextMatch(LineIn) == false)
    {
    swOutputFile.WriteLine(LineIn.Normalize());
    }
    }

    I tried different encodings but the problem is still the same…
    I appreciate your help…

    regards
    ld

  3. Martijn Says:

    This is a shot in the dark without seeing the original files of course. You will need to establish the source encoding first. If it isn’t unicode then its probably a Japanese ISO-2022-JP or SHIFT-JIS encoded file. In Shift-JIS certain combinations of multiple high ASCII characters form a single Japanese character.

    Did you try something like:

    srInputFile = new StreamReader(filename,Encoding.GetEncoding(“iso-2022-jp”))

    That should make sure the text is imported from ISO-2022 and correctly converted to Unicode on reading.


Most popular
Recent Comments
  • ARS: great plugin! I love it! but, it will be so nice if you can add attribute ‘title’ as one of...
  • Nelson: Saved me from doing it myself. Good article.
  • andy: i am currently playing taiwanese server wow in 奈辛瓦里(PVP) and i would like to realm transfer to somewhere there...
  • berties: any english speaking playing on a taiwanese server?
  • web application development: has C# search volume really so constant over the years? really surprising.