[ACCEPTED]-Parsing Binary Data in C?-network-programming

Accepted answer
Score: 35

I have to disagree with many of the responses 53 here. I strongly suggest you avoid the 52 temptation to cast a struct onto the incoming 51 data. It seems compelling and might even 50 work on your current target, but if the 49 code is ever ported to another target/environment/compiler, you'll 48 run into trouble. A few reasons:

Endianness: The 47 architecture you're using right now might 46 be big-endian, but your next target might 45 be little-endian. Or vice-versa. You can 44 overcome this with macros (ntoh and hton, for 43 example), but it's extra work and you have 42 make sure you call those macros every time you reference 41 the field.

Alignment: The architecture you're using 40 might be capable of loading a mutli-byte 39 word at an odd-addressed offset, but many 38 architectures cannot. If a 4-byte word 37 straddles a 4-byte alignment boundary, the 36 load may pull garbage. Even if the protocol 35 itself doesn't have misaligned words, sometimes 34 the byte stream itself is misaligned. (For 33 example, although the IP header definition 32 puts all 4-byte words on 4-byte boundaries, often 31 the ethernet header pushes the IP header 30 itself onto a 2-byte boundary.)

Padding: Your compiler 29 might choose to pack your struct tightly 28 with no padding, or it might insert padding 27 to deal with the target's alignment constraints. I've 26 seen this change between two versions of 25 the same compiler. You could use #pragmas 24 to force the issue, but #pragmas are, of 23 course, compiler-specific.

Bit Ordering: The ordering 22 of bits inside C bitfields is compiler-specific. Plus, the 21 bits are hard to "get at" for your runtime 20 code. Every time you reference a bitfield 19 inside a struct, the compiler has to use 18 a set of mask/shift operations. Of course, you're 17 going to have to do that masking/shifting 16 at some point, but best not to do it at 15 every reference if speed is a concern. (If 14 space is the overriding concern, then use 13 bitfields, but tread carefully.)

All this 12 is not to say "don't use structs." My favorite 11 approach is to declare a friendly native-endian 10 struct of all the relevant protocol data 9 without any bitfields and without concern 8 for the issues, then write a set of symmetric 7 pack/parse routines that use the struct 6 as a go-between.

typedef struct _MyProtocolData
{
    Bool myBitA;  // Using a "Bool" type wastes a lot of space, but it's fast.
    Bool myBitB;
    Word32 myWord;  // You have a list of base types like Word32, right?
} MyProtocolData;

Void myProtocolParse(const Byte *pProtocol, MyProtocolData *pData)
{
    // Somewhere, your code has to pick out the bits.  Best to just do it one place.
    pData->myBitA = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_A_MASK >> MY_BIT_A_SHIFT;
    pData->myBitB = *(pProtocol + MY_BITS_OFFSET) & MY_BIT_B_MASK >> MY_BIT_B_SHIFT;

    // Endianness and Alignment issues go away when you fetch byte-at-a-time.
    // Here, I'm assuming the protocol is big-endian.
    // You could also write a library of "word fetchers" for different sizes and endiannesses.
    pData->myWord  = *(pProtocol + MY_WORD_OFFSET + 0) << 24;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 1) << 16;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 2) << 8;
    pData->myWord += *(pProtocol + MY_WORD_OFFSET + 3);

    // You could return something useful, like the end of the protocol or an error code.
}

Void myProtocolPack(const MyProtocolData *pData, Byte *pProtocol)
{
    // Exercise for the reader!  :)
}

Now, the rest of your code 5 just manipulates data inside the friendly, fast 4 struct objects and only calls the pack/parse 3 when you have to interface with a byte stream. There's 2 no need for ntoh or hton, and no bitfields 1 to slow down your code.

Score: 14

The standard way to do this in C/C++ is 11 really casting to structs as 'gwaredd' suggested

It 10 is not as unsafe as one would think. You 9 first cast to the struct that you expected, as 8 in his/her example, then you test that struct 7 for validity. You have to test for max/min 6 values, termination sequences, etc.

What 5 ever platform you are on you must read Unix Network Programming, Volume 1: The Sockets Networking API. Buy 4 it, borrow it, steal it ( the victim will 3 understand, it's like stealing food or something... ), but 2 do read it.

After reading the Stevens, most 1 of this will make a lot more sense.

Score: 12

Let me restate your question to see if I 11 understood properly. You are looking for 10 software that will take a formal description 9 of a packet and then will produce a "decoder" to 8 parse such packets?

If so, the reference 7 in that field is PADS. A good article introducing 6 it is PADS: A Domain-Specific Language for Processing Ad Hoc Data. PADS is very complete but unfortunately 5 under a non-free licence.

There are possible 4 alternatives (I did not mention non-C solutions). Apparently, none 3 can be regarded as completely production-ready:

If 2 you read French, I summarized these issues 1 in Génération de décodeurs de formats binaires.

Score: 10

In my experience, the best way is to first 39 write a set of primitives, to read/write 38 a single value of some type from a binary 37 buffer. This gives you high visibility, and 36 a very simple way to handle any endianness-issues: just 35 make the functions do it right.

Then, you 34 can for instance define structs for each of your 33 protocol messages, and write pack/unpack 32 (some people call them serialize/deserialize) functions 31 for each.

As a base case, a primitive to 30 extract a single 8-bit integer could look 29 like this (assuming an 8-bit char on the host 28 machine, you could add a layer of custom 27 types to ensure that too, if needed):

const void * read_uint8(const void *buffer, unsigned char *value)
{
  const unsigned char *vptr = buffer;
  *value = *buffer++;
  return buffer;
}

Here, I 26 chose to return the value by reference, and 25 return an updated pointer. This is a matter 24 of taste, you could of course return the 23 value and update the pointer by reference. It 22 is a crucial part of the design that the 21 read-function updates the pointer, to make 20 these chainable.

Now, we can write a similar 19 function to read a 16-bit unsigned quantity:

const void * read_uint16(const void *buffer, unsigned short *value)
{
  unsigned char lo, hi;

  buffer = read_uint8(buffer, &hi);
  buffer = read_uint8(buffer, &lo);
  *value = (hi << 8) | lo;
  return buffer;
}

Here 18 I assumed incoming data is big-endian, this 17 is common in networking protocols (mainly 16 for historical reasons). You could of course 15 get clever and do some pointer arithmetic 14 and remove the need for a temporary, but 13 I find this way makes it clearer and easier 12 to understand. Having maximal transparency 11 in this kind of primitive can be a good 10 thing when debugging.

The next step would 9 be to start defining your protocol-specific 8 messages, and write read/write primitives 7 to match. At that level, think about code 6 generation; if your protocol is described 5 in some general, machine-readable format, you 4 can generate the read/write functions from 3 that, which saves a lot of grief. This is 2 harder if the protocol format is clever enough, but often 1 doable and highly recommended.

Score: 5

You might be interested in Google Protocol Buffers, which is basically 6 a serialization framework. It's primarily 5 for C++/Java/Python (those are the languages 4 supported by Google) but there are ongoing 3 efforts to port it to other languages, including 2 C. (I haven't used the C port at all, but 1 I'm responsible for one of the C# ports.)

Score: 3

You don't really need to parse binary data 4 in C, just cast some pointer to whatever 3 you think it should be.

struct SomeDataFormat
{
    ....
}

SomeDataFormat* pParsedData = (SomeDataFormat*) pBuffer;

Just be wary of endian 2 issues, type sizes, reading off the end 1 of buffers, etc etc

Score: 2

Parsing/formatting binary structures is 9 one of the very few things that is easier to do 8 in C than in higher-level/managed languages. You 7 simply define a struct that corresponds 6 to the format you want to handle and the 5 struct is the parser/formatter. This works 4 because a struct in C represents a precise 3 memory layout (which is, of course, already 2 binary). See also kervin's and gwaredd's 1 replies.

Score: 1

I'm not really understand what kind of library 14 you are looking for ? Generic library that 13 will take any binary input and will parse 12 it to unknown format? I'm not sure there 11 is such library can ever exist in any language. I 10 think you need elaborate your question a 9 little bit.

Edit:
Ok, so after reading Jon's answer 8 seems there is a library, well kind of library 7 it's more like code generation tool. But 6 as many stated just casting the data to 5 the appropriate data structure, with appropriate 4 carefulness i.e using packed structures 3 and taking care of endian issues you are 2 good. Using such tool with C it's just an 1 overkill.

Score: 1

Basically suggestions about casting to struct work 10 but please be aware that numbers can be 9 represented differently on different architectures.

To 8 deal with endian issues network byte order 7 was introduced - common practice is to convert 6 numbers from host byte order to network 5 byte order before sending the data and to 4 convert back to host order on receipt. See 3 functions htonl, htons, ntohl and ntohs.

And really consider 2 kervin's advice - read UNP. You won't regret 1 it!

More Related questions