Processing binary structured data with Python

来源：互联网发布：nginx 后端服务器宕机编辑：程序博客网时间：2024/06/16 07:36

中文版本请看这里

Please keep the orginal address of this post when reprinting. http://blog.csdn.net/ir0nf1st/article/details/70151190

<0x00> Preface

When processing binary file or receiving byte stream from network, the binary structured data in the stream may contain signed numbers. According to the pre-defined stream protocol, developer has already have prior knowledge about the alignment/byteorder/word length/sign bit position of the binary structured data, but when developing with Python, there is no explicit way to pass these information to Python interperter and makes it diffcult to process binary data, especially binary signed numbers. This article gives out a way to process binary signed numbers correctly with Python.

<0x01> A glimpse into Python numeric type

In many other programming languages, the most significant bit of a number is used as sign bit. But Python numeric type has different implementation. Here is CPython definition of Python long type(All the following code are based on Python V2.7):

include/longobject.h

/* Long (arbitrary precision) integer object interface */    typedef struct _longobject PyLongObject; /* Revealed in longintrepr.h */

include/longintrepr.h

/* Long integer representation.    The absolute value of a number is equal to     SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)    Negative numbers are represented with ob_size < 0;    zero is represented by ob_size == 0.    In a normalized number, ob_digit[abs(ob_size)-1] (the most significant    digit) is never zero.  Also, in all cases, for all valid i,     0 <= ob_digit[i] <= MASK.    The allocation function takes care of allocating extra memory    so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.     CAUTION:  Generic code manipulating subtypes of PyVarObject has to    aware that longs abuse  ob_size's sign bit. */    struct _longobject {      PyObject_VAR_HEAD      digit ob_digit[1];  };

include/object.h

/* PyObject_VAR_HEAD defines the initial segment of all variable-size  * container objects.  These end with a declaration of an array with 1  * element, but enough space is malloc'ed so that the array actually  * has room for ob_size elements.  Note that ob_size is an element count,  * not necessarily a byte count.  */  #define PyObject_VAR_HEAD               \      PyObject_HEAD                       \      Py_ssize_t ob_size; /* Number of items in variable part */

From the above source code and comments, we know that Python _longobject, namely long type or long object, uses 'ob_size' field in PyObject_VAR_HEAD to represent sign of a number and its 'ob_digit' field contains only magnitude of a number.

On the other hand, when we initilize a Python long object(to assign a numeric value to a Python long object), Python interperter will not take the most significant bit of a number as sign bit but as significant bit.

Let's take number -500 as an example to show what will happen, to make it simple I use 16bit word length here.

We know that a negtive number is represented in it two's complement in computer, firstly convert -500 to its two's complement:

Decimal- 500HexDecimal- 0x 01 F4Binary- 0b 0000 0001 1111 0100sign-magnitude(put sign into MSB) 0b 1000 0001 1111 0100one's complement(invert all bits except the sign bit) 0b 1111 1110 0000 1011two's complement(add one to one's complement) 0b 1111 1110 0000 1100two's complement in HexDecimal 0x FE 0C
The following code simulate receving a string '\xFE\x0C' from a stream and then processing and assigning it to a Python integer object.

>>> stream = '\xFE\x0C'  >>> number = (ord(stream[0]) << 8) + ord(stream[1])  >>> '0x{:0X}'.format(number)  '0xFE0C'  >>> print number  65036  >>>

As explained, the result is not what we want and the code need to be revised when initilaizing Python number object with binary signed number.

<0x02> Convey sign information using minus sign

Now we know Python does not take sign information from the sign bit of a number, we need to find out another way to convey sign information to Python interperter to make it process our number correctly. The minus sign '-' a.k.a negative operator in Python can be used to fulfill this purpose.

Let's take a look how negative operator was implemented in CPython:

objects/longobject.c

static PyObject *  long_neg(PyLongObject *v)  {      PyLongObject *z;      if (v->ob_size == 0 && PyLong_CheckExact(v)) {          /* -0 == 0 */          Py_INCREF(v);          return (PyObject *) v;      }      z = (PyLongObject *)_PyLong_Copy(v);      if (z != NULL)          z->ob_size = -(v->ob_size);      return (PyObject *)z;  }

So the negtive operator only negative the 'ob_size' field of a long object and left 'ob_digit' field untouched.

We also know that if a number 'value' is negtive, then we have this formula: value = - abs(value). If we pass the negtive operator along with the abstract value of a negative number to Python interperter, Python interperter will be able to handle negative number correctly.

Continue with the above example, we have the two's complement of a negative number, the next step is to calculate its abstrct value from its two's complement.

Pseudo code looks like this:

if sign_bit_of_value is 1 {    abs_value = bit_wise_invert(value - 1)}

Implementation in Python:

>>> number = 0xFE0C  >>> if (number & 0x8000) != 0:  ...     number = -((number - 1) ^ 0xFFFF)  ...  >>> print number  -500  >>>

You probabaly had found that I didn't use '~', the Python invert operator, but exclusive or with 0xFFFF to implement bit_wise_invert.

How about using invert operator?

>>> number = 0xFE0C  >>> if (number & 0x8000) != 0:  ...     number = -(~(number - 1))  ...  >>> print number  65036  >>>

Take a look into invert operator implementation of CPython:

static PyObject *  long_invert(PyLongObject *v)  {      /* Implement ~x as -(x+1) */      PyLongObject *x;      PyLongObject *w;      w = (PyLongObject *)PyLong_FromLong(1L);      if (w == NULL)          return NULL;      x = (PyLongObject *) long_add(v, w);      Py_DECREF(w);      if (x == NULL)          return NULL;      Py_SIZE(x) = -(Py_SIZE(x));      return (PyObject *)x;  }

As stated by the comment in the code, the invert operator did not really implement bit wise invert and can't be used to fullfile our purpose.

<0x03> A more general module for binary structured data processing in Python

The above sections revealed some Python internal implementation on numeric type processing and may help you to understand Python a little deeper. There is also a struct module can be used to process binary structured data. It is more developer friendly and has stronger error handling/reporting mechnism. For real application development, I suggest to use struct.

Here is sample code using struct:

>>> import struct  >>> stream = '\xFE\x0C'  >>> number, = struct.unpack('>h', stream)  >>> print number  -500  >>>

See more detail introduction of struct here.

0 0