Memcpy(), a fast and portable implementation

来源：互联网发布：mpv for mac下载编辑：程序博客网时间：2024/05/17 19:17

原文：http://www.vik.cc/daniel/portfolio/memcpy.htm

注：　本文由于包含原文和译文，故请勿转载，否则视为侵权。

本文地址：http://blog.csdn.net/linyt/archive/2009/04/07/4053164.aspx

Memcpy(), a fastand portable implementation

Memcpy(), 一个快速的和可移植的实现

1.Introduction

1. 导言

A co-worker of mine, FredrikBredberg, once made an implementation of memcpy() and he was very proud of theresult. His implementation was faster than many standardized C library routinesfound in the embedded market. When looking at his code, I found several placeswhere improvements could be made. I made an implementation, which was quite alot faster than Fredrik's and this started a friendly competition to make thefastest portable C implementation of memcpy(). Both our implementations gotbetter and better and looked more alike and finally we had an implementationthat was very fast and that beats both the native library routines in Windowsand Linux, especially when the memory to be copied is not aligned on a 32 bitboundary.

我的同事FredrikBredberg曾实现过memcpy，并为此结果感到非常自豪。他的实现比嵌入式市场上很多标准C库例程（memcpy）还要快。查阅他的代码时，我发现代码中有几个地方可以进一步改善。接着我实现一个比Fredrik更快的memcpy，于是为打造最快的，可移植的memcpy C实现，我们开始了友好的竞赛。我们的两个实现变得越来越快，看起来也很相似，最终我们提供了一个非常快的实现。尤其当复制的内存不是32位边界对齐时，该实现优于Windows和Linux下的本地库例程。

The following paragraphs containdescriptions to some of the techniques used in the final implementation.

下面几段介绍最终实现所使用的一些技术。

2. Mimicthe CPU's Architecture

2. 模拟CPU架构

One of the biggest advantages inthe original Bredberg implementation was that the C code was made to imitatethe instruction sets on the target processor. He discovered that differentprocessors had different instructions for handling memory pointers. On a Motorola68K processor the code

Bredberg原先实现的最大优点之一是C代码模拟目标处理器的指令集。他发现处理内存指针的指令因目标处理器而异。在摩托罗拉68K处理器上，代码

*dst8++= *src8++;

that copies one byte from theaddress src8 to the address dst8 and increases both pointers, compiled into asingle instruction:

从地址src8上复制一个字节到地址dst8上，并增加这两指针之值，被编译成一条指令：

MOV.B(A0)+, (A2)+

This piece of code can be putinto a while loop and will copy memory from the address src to the addressdest:

这代码片可放到while循环内，并能从地址src复制内存到地址dest:

        void*memcpy(void *dest, const void *src, size_t count) {
            char*dst8 = (char *)dest;
            char*src8 = (char *)src;

            while(count--) {
                *dst8++= *src8++;
            }
            returndest;
}

While this is pretty good for theMotorola processor, it is not very efficient on a PowerPC that does not haveany instructions for post incrementing pointers. The PowerPC uses fourinstructions for the same task that only required one instruction on theMotorola processor. However, the PowerPC has a set of instructions to load andstore data that utilize pre increment of pointers which means that thefollowing code only results in two instructions when compiled on the PowerPC:

虽然在摩托罗拉处理器上，这相当不错。但在PowerPC处理器上并不高效，由于它没有任何后置自增指针指令。摩托罗拉处理器需要一条指令实现的任务，在PowerPC上却需要4条指令来实现。然而，PowerPC提供一套利用前自增指针来加载以及储存数据的指令。这意味意下面代码在PowerPC下编译只生成两条指令：

*++dst8++= *++src8;

In order to use thisconstruction, the pointers have to be decreased before the loop begins and thefinal code becomes:

为了利用PowerPC这一特点，在循环前先将指针自减，最终代码变成：

        void*memcpy(void *dest, const void *src, size_t count) {
            char*dst8 = (char *)dest;
            char*src8 = (char *)src;

            --src8;
            --dst8;

            while(count--) {
                *++dst8= *++src8;
            }
            returndest;
        }

Unfortunately the ARM processorhas no instructions for either pre increment or post increment of pointers.This means that the pointer needs to be incremented manually. If the exampleabove is compiled to an ARM processor, the while loop would actually looksomething like:

不幸的是，ARM处理器没有任何指针前置或后置自增，必须使用额外指令来增加指针值。如果上面例子在ARM处理器上编译，while循环的代码与下面代码相当。

        while(count--) {
            dst8[0]= src8[0];
            ++dst8;
            ++src8;
        }

The ARM processor luckily hasanother feature. It can read and write to memory at a fixed offset from a basepointer. This resulted in a third way of implementing the same task:

幸运地，ARM处理器有另一个特性，那就是它可以通过基址和因定偏移量来指定读写内存。那这任务的第三种实现方法就顺理成章了：

        void*memcpy(void *dest, const void *src, size_t count) {
            char*dst8 = (char *)dest;
            char*src8 = (char *)src;

            if(count & 1) {
                dst8[0]= src8[0];
                dst8+= 1;
                src8+= 1;
            }

            count/= 2;
            while(count--) {
                dst8[0]= src8[0];
                dst8[1]= src8[1];

                dst8+= 2;
                src8+= 2;
            }
            returndest;
        }

Here the number of turns the loophas to be executed is half of what it was in the earlier examples and thepointers are only updated half as often.

与最前的例子相比，这里循环执行的次数减少一半，同时指针值更改的次数也减半。

3.Optimizing memory accesses

3. 优化内存访问

In most systems, the CPU clockruns at much higher frequency than the speed of the memory bus. My firstimprovement to the Bredberg original was to read 32 bits at the time from thememory. It is of course possible to read larger chunks of data on some targetswith wider data bus and wider data registers. The goal with the Cimplementation of memcpy() was to get portable code mainly for embeddedsystems. On such systems it is often expensive to use data types like doubleand some systems doesn't have a FPU (Floating Point Unit).

大多数系统中，CPU时钟频率远高于内存总线的速度。我在Bredberg原先代码上的第一个改进就是从内存一次读取32位数据。当然，在某些有更宽数据总线和数据寄存上的目标处理器上，可以一次读取更大的数据块。Memcpy()的C实现的目标主要是在嵌入式系统上实现可移植代码。在这些嵌入式系统上，使用诸如double这些数据类型是非常昂贵的，并且某些系统没有FPU（FloatingPoint Unit）。

By trying to read and write memoryin 32 bit blocks as often as possible, the speed of the implementation isincreased dramatically, especially when copying data that is not aligned on a32-bit boundary.

尽可能按32位块来读写内存，实现的速度急剧增大，特别拷贝那些非32位边界对齐的数据时。

It is however quite tricky to dothis. The accesses to memory need to be aligned on 32-bit addresses. Theimplementation needs two temporary variables that implement a 64-bit slidingwindow where the source data is kept temporary while being copied into thedestination. The example below shows how this can be done when the destinationbuffer is aligned on a 32-bit address and the source buffer is 8 bits off thealignment:

然而，实现这点是相当需要技巧的，因为被访问的内存必须是32位地址对齐的。实现需要两个用于实现64位滑动窗口的临时变量，源数据存放于该滑动窗口，然后再拷贝到目标地址上。下面例子展示了当目标缓冲区为32位对齐和源缓冲区为偏8位对齐时，如何做到这一点。

        srcWord= *src32++;

        while (len--) {
            dstWord  =srcWord << 8;
            srcWord  =*src32++;
            dstWord |=srcWord >> 24;
            *dst32++ =dstWord;
        }

4.Optimizing branches

4. 优化分支

Another improvement is to make iteasier for the compiler to generate code that utilizes the processors compareinstructions in an efficient way. This means creating loops that terminateswhen a compare value is zero. The loop

另一个提升是使编译器最容易生成那些高效地利用处理器比较指令的代码，即编写比较值为0时结束的循环。循环

while(++i > count)

often generates more complicatedand inefficient code than

生成的代码通常复杂于和低效于循环

while(count--)
生成的代码。

Another thing that makes the codemore efficient is when the CPU's native loop instructions can be used. Acompiler often generates better code for the loop above than for the followingloop expression

另一个提高效率的方法是使用CPU本地的loop指令。编译器为上述循环生成的机器代码通常要比下面这个要高效。

while(count -= 2)

5.Conclusion

5. 总结

The techniques described heremakes the C implementation of memcpy() a lot faster and in many cases fasterthan commercial ones. The implementation can probably be improved even more,especially by using wider data types when available. If the target and thecompiler supports 64-bit arithmetic operations such as the shift operator,these techniques can be used to implement a 64-bit version as well. I tried tofind a compiler with this support for SPARC but I didn't find one. If 64-bitoperations can be made in one instruction, the implementation will be fasterthan the native Solaris memcpy() which is probably written in assembly.

本文描述的技术使memcpy()的C实现快了很多，在大多数情况比商业产品的实现要快。此实现还可能得到改善,特别是，如果有的话，使用更宽的数据类型。如果目标处理器和编译器支持64位的算术操作，如位移操作，这些技术同样可以用来实现64位的版本。我尝试在SPARC上寻找支持64位算术操作的编译，但以失败告终。若（找到这样的编译器，）64位的操作可以由一条指令来完成，memcpy实现将比本地的Solaris memcpy()要快，尽管Solaris memcpy()很可能是由汇编来编写的。

Another improvement that can beto some 32-bit architectures made is that if both the source and thedestination buffer are 64-bit aligned or have the same alignment, a double typecan be used to speed up the copying. I tried this with the gcc compiler for x86and the result was a memcpy() that was faster in almost all cases than thenative Linux memcpy().

对于一些32位架构，如果源和目标缓冲区都是64位对齐的，可以使用double来提高拷贝的速度，此为另一个改善。在x86的gcc编译器上，我尝试使用double类型，结果几乎所有情况下都快于Linux本地的memcpy()。

6. Thecomplete source code

6. 完整的源代码

The following code contains thecomplete memcpy() implementation. The code is configured for an intel x86target but it is easy to change configuration as desired.

下面代码包含完整的memcpy()实现。代码已配置为intel x86了，但很容易更改成如你所愿的配置。

/********************************************************************
** File:     memcpy.c
**
** Copyright (C) 2005 Daniel Vik
**
** This software is provided 'as-is', without any express or implied
** warranty. In no event will the authors be held liable for any
** damages arising from the use of this software.
** Permission is granted to anyone to use this software for any
** purpose, including commercial applications, and to alter it and
** redistribute it freely, subject to the following restrictions:
**
** 1. The origin of this software must not be misrepresented; you
**    must not claim that you wrote the originalsoftware. If you
**    use this software in a product, anacknowledgment in the
**    use this software in a product, anacknowledgment in the
**    product documentation would be appreciated butis not
**    required.
**
** 2. Altered source versions must be plainly marked as such, and
**    must not be misrepresented as being theoriginal software.
**
** 3. This notice may not be removed or altered from any source
**    distribution.
**
**
** Description: Implementation of the standard library functionmemcpy.
**             Thisimplementation of memcpy() is ANSI-C89 compatible.
**
**             Thefollowing configuration options can be set:
**
**           LITTLE_ENDIAN   - Usesprocessor with little endian
**                             addressing.Default is big endian.
**
**           PRE_INC_PTRS    - Usepre increment of pointers.
**                             Defaultis post increment of
**                             pointers.
**
**           INDEXED_COPY    - Copyingdata using array indexing.
**                             Usingthis option, disables the
**                             PRE_INC_PTRSoption.
**
**
** Best Settings:
**
** Intel x86:  LITTLE_ENDIAN and INDEXED_COPY
**
*******************************************************************/

/********************************************************************
** Configuration definitions.
*******************************************************************/

#define LITTLE_ENDIAN
#define INDEXED_COPY

/********************************************************************
** Includes for size_t definition
*******************************************************************/

#include <stddef.h>

/********************************************************************
** Typedefs
*******************************************************************/

typedef unsigned char  u8;
typedef unsigned short u16;
typedef unsigned long  u32;

/********************************************************************
** Remove definitions when INDEXED_COPY is defined.
*******************************************************************/

#if defined (INDEXED_COPY)
#if defined (PRE_INC_PTRS)
#undef PRE_INC_PTRS
#endif /*PRE_INC_PTRS*/
#endif /*INDEXED_COPY*/

/********************************************************************
** Definitions for pre and post increment of pointers.
*******************************************************************/

#if defined (PRE_INC_PTRS)

#define INC_VAL(x) *++(x)
#define START_VAL(x) (x)--
#define CAST_32_TO_8(p, o)       (u8 *)((u32)p + o + 4)
#define WHILE_DEST_BREAK         3
#define PRE_LOOP_ADJUST        - 3
#define PRE_SWITCH_ADJUST      + 1

#else /*PRE_INC_PTRS*/

#define START_VAL(x)
#define INC_VAL(x) *(x)++
#define CAST_32_TO_8(p, o)       (u8 *)((u32)p + o)
#define WHILE_DEST_BREAK         0
#define PRE_LOOP_ADJUST
#define PRE_SWITCH_ADJUST

#endif /*PRE_INC_PTRS*/

/********************************************************************
** Definitions for endians
*******************************************************************/

#if defined (LITTLE_ENDIAN)

#define SHL >>
#define SHR <<

#else /*LITTLE_ENDIAN*/

#define SHL <<
#define SHR >>

#endif /*LITTLE_ENDIAN*/

/********************************************************************
** Macros for copying 32 bit words of  different alignment.
** Uses incremening pointers.
*******************************************************************/

#define CP32_INCR() {                       /
    INC_VAL(dst32) = INC_VAL(src32);        /
}

#define CP32_INCR_SH(shl, shr) {            /
    dstWord   = srcWord SHL shl;            /
    srcWord   = INC_VAL(src32);             /
    dstWord  |= srcWord SHR shr;            /
    INC_VAL(dst32) = dstWord;               /
}

/********************************************************************
** Macros for copying 32 bit words of  different alignment.
** Uses array indexes.
*******************************************************************/

#define CP32_INDEX(idx) {                   /
    dst32[idx] = src32[idx];                /
}

#define CP32_INDEX_SH(x, shl, shr) {        /
    dstWord   = srcWord SHL shl;            /
    srcWord   = src32[x];                   /
    dstWord  |= srcWord SHR shr;            /
    dst32[x] = dstWord;                     /
}

/********************************************************************
** Macros for copying 32 bit words of different alignment.
** Uses incremening pointers or array indexes depending on
** configuration.
*******************************************************************/

#if defined (INDEXED_COPY)

#define CP32(idx)               CP32_INDEX(idx)
#define CP32_SH(idx, shl, shr)  CP32_INDEX_SH(idx, shl, shr)

#define INC_INDEX(p, o)         ((p) += (o))

#else /*INDEXED_COPY*/

#define CP32(idx)               CP32_INCR()
#define CP32_SH(idx, shl, shr)  CP32_INCR_SH(shl, shr)

#define INC_INDEX(p, o)

#endif /*INDEXED_COPY*/

/********************************************************************
**
** void *memcpy(void *dest, const void *src, size_t count)
**
** Args:     dest    - pointer to destination buffer
**           src     - pointer to source buffer
**           count   - number of bytes to copy
**
** Return:   A pointer to destination buffer
**
** Purpose:  Copies count bytes from src to dest. No overlap check
**           isperformed.
**
*******************************************************************/

void *memcpy(void *dest, const void *src, size_t count) {
    u8 *dst8 = (u8 *)dest;
    u8 *src8 = (u8 *)src;

    if (count < 8) {
        if (count >= 4 && ((((u32)src8 | (u32)dst8)) & 3) == 0) {
            *((u32 *)dst8) = *((u32 *)src8);
            dst8  += 4;
            src8  += 4;
            count -= 4;
        }

        START_VAL(dst8);
        START_VAL(src8);

        while (count--) {
            INC_VAL(dst8) = INC_VAL(src8);
        }

        return dest;
    }

    START_VAL(dst8);
    START_VAL(src8);

    while (((u32)dst8 & 3L) != WHILE_DEST_BREAK) {
        INC_VAL(dst8) = INC_VAL(src8);
        count--;
    }

    switch ((((u32)src8) PRE_SWITCH_ADJUST) & 3L) {
    default:
        {
            u32 *dst32 = (u32 *)(((u32)dst8) PRE_LOOP_ADJUST);
            u32 *src32 = (u32 *)(((u32)src8) PRE_LOOP_ADJUST);
            u32 length = count / 4;

            while (length & 7) {
                CP32_INCR();
                length--;
            }

            length /= 8;

            while (length--) {
                CP32(0);
                CP32(1);
                CP32(2);
                CP32(3);
                CP32(4);
                CP32(5);
                CP32(6);
                CP32(7);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, 0);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 1:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(8, 24);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 8, 24);
                CP32_SH(1, 8, 24);
                CP32_SH(2, 8, 24);
                CP32_SH(3, 8, 24);
                CP32_SH(4, 8, 24);
                CP32_SH(5, 8, 24);
                CP32_SH(6, 8, 24);
                CP32_SH(7, 8, 24);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -3);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 2:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(16, 16);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 16, 16);
                CP32_SH(1, 16, 16);
                CP32_SH(2, 16, 16);
                CP32_SH(3, 16, 16);
                CP32_SH(4, 16, 16);
                CP32_SH(5, 16, 16);
                CP32_SH(6, 16, 16);
                CP32_SH(7, 16, 16);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -2);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 3:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(24, 8);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 24, 8);
                CP32_SH(1, 24, 8);
                CP32_SH(2, 24, 8);
                CP32_SH(3, 24, 8);
                CP32_SH(4, 24, 8);
                CP32_SH(5, 24, 8);
                CP32_SH(6, 24, 8);
                CP32_SH(7, 24, 8);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -1);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }
    }
}