gem5 prefetcher

来源：互联网发布：js 方法return 节点编辑：程序博客网时间：2024/06/05 01:03

最近在gem5上做预取实验，添加自己的预取算法，这里采用hardware stream prefetcher , 修改了几个bug才给实验调试通过，发文记录下实验过程。
gem5上添加自己的预取算法步骤：
（1）路径gem5-master/configs/common/Caches.py下，开启预取：

class L1Cache(Cache):   assoc = 2   tag_latency = 2   data_latency = 2   response_latency = 2   mshrs = 4   tgts_per_mshr = 20   prefetcher = StridePrefetcher(degree=8, latency=1.0) #添加此行代码   # prefetch_policy='tagged' #若是老版本， 添加此行代码

class L2Cache(Cache):  assoc = 8  tag_latency = 20  data_latency = 20  response_latency = 20  mshrs = 20  tgts_per_mshr = 12  write_buffers = 8  prefetcher = StridePrefetcher(degree=8, latency=1.0) #添加此行代码  # prefetch_policy='tagged'  #若是老版本， 添加此行代码

（2）在路径gem5-master/src/mem/cache/prefetch/下，添加自己的预取算法：主要是stream.hh和stream.cc文件。
（3）在路径gem5-master/src/mem/cache/prefetch/下的Prefetcher.py中配置：

class StreamPrefetcher(QueuedPrefetcher):    type = 'StreamPrefetcher'    cxx_class = 'StreamPrefetcher'    cxx_header = "mem/cache/prefetch/stream.hh"    table_sets = Param.Int(16, "Number of sets in PC lookup table")    table_assoc = Param.Int(4, "Associativity of PC lookup table")    tableSize = Param.Int(8, "Number of sets in PC lookup table")    distance = Param.Int(5, "Associativity of PC lookup table")    use_master_id = Param.Bool(True, "Use master id based history")    degree = Param.Int(4, "Number of prefetches to generate")

（4）在路径gem5-master/src/mem/cache/prefetch/下的Sconscript中配置：

Import('*')SimObject('Prefetcher.py')Source('base.cc')Source('queued.cc')Source('stride.cc')Source('tagged.cc')Source('stream.cc') #添加此行代码

注意：在这里若没有配置，在编译的时候会报错：

> build/X86/python/_m5/param_StreamPrefetcher_wrap.o: In function > `_wrap_StreamPrefetcherParams_create': > /home/jyf/download/gem_nvmain/gem5-master/build/X86/python/_m5/param_StreamPrefetcher_wrap.cc:4549: > undefined reference to `StreamPrefetcherParams::create()' collect2: > error: ld returned 1 exit status scons: *** [build/X86/gem5.opt] Error > 1 scons: building terminated because of errors.

原因是由于缺少以上配置，stream.cc没有生成stream.o文件，文件无法连接。

在编译的过程中会生成stream.o ， sreamPrefetcher.hh(gem5-master/build/ARM/params/下)，
param_StreamPrefetcher_wrap.cc（build/ARM/python/_m5/下）。这些文件里面都有StreamPrefetcher * create()相关联。
（5）由于我找的hardware stream预取算法比较老，版本不匹配，这里还需要修改 stream.cc ,stream.hh源码，：

stream.cc 中：
Addr 改为 AddrPriority

StreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,                                    std::vector<AddrPriority> &addresses) {    uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;    //uint32_t core_id = pkt->req->contextId();    //if (core_id < 0) {    if (!pkt->req->contextId()) {        DPRINTF(HWPrefetch, "ignoring request with no core ID");        return;    }    .......    for (uint8_t d = 1; d <= degree; d++) {                        Addr pf_addr = table[i]->endAddr + blkSize * d;                        AddrPriority addrp;                           addrp.first=pf_addr;                            addresses.push_back(addrp);                        //addresses.push_back(pf_addr);                          DPRINTF(HWPrefetch, "Queuing prefetch to %#x.\n", pf_addr);                    }    ......for (uint8_t d = 1; d <= degree; d++) {                        Addr pf_addr = table[i]->endAddr - blkSize * d;                        AddrPriority addrp;                           addrp.first=pf_addr;                           addresses.push_back(addrp);                          //addresses.push_back(pf_addr);                         DPRINTF(HWPrefetch, "Queuing prefetch to %#x.\n", pf_addr);                    }    }

stream.hh中：
Addr 改为 AddrPriority

void calculatePrefetch(const PacketPtr &pkt, std::vector<AddrPriority> &addresses);

这里若没有修改，会报出以下错误：子类没有实现父类的虚函数，实际上是版本不兼容的问题。

stream.cc:182:34: error: invalid new-expression of abstract class type 'StreamPrefetcher'，   virtual void calculatePrefetch(const PacketPtr &pkt,std::vector<addrpriority> &addresses) ;

(6) 设置cpu-type = Timing
gem5-master/configs/common/cpuConfig.py
源码分析：
cpul类型为：默认是detailed

_cpu_aliases_all = [    ("timing", "TimingSimpleCPU"),    ("atomic", "AtomicSimpleCPU"),    ("minor", "MinorCPU"),    ("detailed", "DerivO3CPU"),    ("kvm", ("ArmKvmCPU", "ArmV8KvmCPU", "X86KvmCPU")),    ("trace", "TraceCPU"),    ]

更改为 timing: m5.objects.TimingSimpleCPU,

def config_etrace(cpu_cls, cpu_list, options):    if issubclass(cpu_cls, m5.objects.TimingSimpleCPU):        # Assign the same file name to all cpus for now. This must be        # revisited when creating elastic traces for multi processor systems.        for cpu in cpu_list:            # Attach the elastic trace probe listener. Set the protobuf trace            # file names. Set the dependency window size equal to the cpu it            # is attached to.            cpu.traceListener = m5.objects.ElasticTrace(                                instFetchTraceFile = options.inst_trace_file,                                dataDepTraceFile = options.data_trace_file,                                depWindowSize = 3 * cpu.numROBEntries)            # Make the number of entries in the ROB, LQ and SQ very            # large so that there are no stalls due to resource            # limitation as such stalls will get captured in the trace            # as compute delay. For replay, ROB, LQ and SQ sizes are            # modelled in the Trace CPU.            cpu.numROBEntries = 512;            cpu.LQEntries = 128;            cpu.SQEntries = 128;    else:        fatal("%s does not support data dependency tracing. Use a CPU model of"              " type or inherited from TimingSimpleCPU.", cpu_cls)

（7）重新编译：

sudo scons EXTRAS=../nvmain ./build/ARM/gem5.opt

这里可能还会报错:

No module name specified using %module or -module.scons: *** [build/ARM/python/_m5/param_VirtIO9PBase_wrap.cc] Error 1

让人摸不着头脑，最后，把之前编译的都删除了

rm -rf ARM

重新编译，这次编译成功。
注意：编译的过程中，若有任何改动源码的部分，最好删除重新编译，不然会报些很莫名其妙的错误。

下面附上stream.cc和stream.hh源码：
stream.cc

#include "debug/HWPrefetch.hh"#include "mem/cache/prefetch/stream.hh"StreamPrefetcher::StreamPrefetcher(const StreamPrefetcherParams *p): QueuedPrefetcher(p),  tableSize(p->tableSize),  useMasterId(p->use_master_id),  degree(p->degree),  distance(p->distance) {    for(int i=0; i<MaxContexts; i++) {        StreamTable[i] = new StreamTableEntry*[tableSize];        for(int j=0; j<tableSize; j++) {            StreamTable[i][j] = new StreamTableEntry[tableSize];            StreamTable[i][j]->LRU_index = j;            resetEntry(StreamTable[i][j]);        }    }}StreamPrefetcher::~StreamPrefetcher() {     for (int i = 0; i < MaxContexts; i++) {            for (int j = 0; j < tableSize; j++) {                delete[] StreamTable[i][j];            }        }};// Training and Prefetching of streamsvoidStreamPrefetcher::calculatePrefetch(const PacketPtr &pkt,                                    std::vector<AddrPriority> &addresses) {    uint32_t core_id = pkt->req->hasContextId() ? pkt->req->contextId() : -1;    //uint32_t core_id = pkt->req->contextId();    //if (core_id < 0) {    if (!pkt->req->contextId()) {        DPRINTF(HWPrefetch, "ignoring request with no core ID");        return;    }    Addr blk_addr = pkt->getAddr() & ~(Addr)(blkSize-1); // cache block aligned address.    assert(core_id < MaxContexts);    StreamTableEntry** table;    table = StreamTable[core_id];                          // Per core stream training.    uint32_t i;    // Check if there is a stream entry with the same address as blk_addr    for (i = 0; i < tableSize; i++) {        switch (table[i]->status) {        case MONITOR:            if(table[i]->trainedDirection == ASCENDING) {                // Ascending order                if((table[i]->startAddr < blk_addr ) && ( table[i]->endAddr > blk_addr)) {                    // Hit to a stream, which is monitored. Issue prefetch requests based on the degree and the direction                    for (uint8_t d = 1; d <= degree; d++) {                        Addr pf_addr = table[i]->endAddr + blkSize * d;                        addresses.push_back(AddrPriority(pf_addr,0));                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.\n", pf_addr);                    }                    if((table[i]->endAddr + blkSize * degree) - table[i]->startAddr <= distance) {                        table[i]->endAddr   = table[i]->endAddr + blkSize * degree;                    } else {                        table[i]->startAddr = table[i]->startAddr + blkSize * degree;                        table[i]->endAddr   = table[i]->endAddr   + blkSize * degree;                    }                    break;                }            } else if(table[i]->trainedDirection == DESCENDING) {                // Descending order                if((table[i]->startAddr > blk_addr ) && (table[i]->endAddr < blk_addr)) {                    for (uint8_t d = 1; d <= degree; d++) {                        Addr pf_addr = table[i]->endAddr - blkSize * d;                        addresses.push_back(AddrPriority(pf_addr,0));                        DPRINTF(HWPrefetch, "Queuing prefetch to %#x.\n", pf_addr);                    }                    if(table[i]->startAddr - (table[i]->endAddr - blkSize * degree) <= distance){                        table[i]->endAddr   = table[i]->endAddr - blkSize * degree;                    } else {                        table[i]->startAddr = table[i]->startAddr - blkSize * degree;                        table[i]->endAddr   = table[i]->endAddr   - blkSize * degree;                    }                    break;                }            } else{                assert(0);            }            break;        case TRAINING:            if ((abs(table[i]->allocAddr - blk_addr) <= (distance/2) * blkSize) ){                // Check whether the address is in +/- of distance                if(table[i]->trendDirection[0] == INVALID){                    table[i]->trendDirection[0] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;                } else {                    assert(table[i]->trendDirection[1] == INVALID);                    table[i]->trendDirection[1] = (blk_addr - table[i]->allocAddr > 0) ? ASCENDING : DESCENDING;                    if(table[i]->trendDirection[0] == table[i]->trendDirection[1]) {                        table[i]->trainedDirection = table[i]->trendDirection[0];                        table[i]->startAddr = table[i]->allocAddr;                        if(table[i]->trainedDirection != INVALID){                            // Based on the trainedDirection (+1:Ascending, -1:Descending) update the end address of a stream                            table[i]->endAddr = blk_addr + (table[i]->trainedDirection) * blkSize * degree;                        }                        // Entry is ready for issuing prefetch requests                        table[i]->status = MONITOR;                    } else {                        resetEntry(table[i]);                    }                }                break;            }            break;        default:            break;        }  // End of Switch    }  // End of for loop    uint32_t HIT_index=i;    int INVALID_index = tableSize;    for (int i=0; i<tableSize; i++) {        //find empty entry        if(table[i]->status==INV) {            INVALID_index = i;            break;        }    }    int TEMP_index = -1;    int LRU_index = -1000000;    for (int i=0; i<tableSize; i++) {        //find empty entry        if(table[i]->LRU_index > TEMP_index) {            TEMP_index = table[i]->LRU_index;            LRU_index  = i;        }    }    assert(TEMP_index == tableSize - 1);    int entry_id;    if(HIT_index!=tableSize) {  //hit        entry_id = HIT_index;    } else if (INVALID_index!=tableSize) {        //Existence of invalid streams        assert(table[INVALID_index]->status == INV);        table[INVALID_index]->status = TRAINING;        table[INVALID_index]->allocAddr = blk_addr;        entry_id = INVALID_index;    } else {        //Replace the LRU stream-entry        assert(table[LRU_index]->status!=INV);        resetEntry(table[LRU_index]);        table[LRU_index]->status = TRAINING;        table[LRU_index]->allocAddr = blk_addr;        entry_id = LRU_index;    }    // Shifting the table entries after the eviction of lru-id    for (int i=0; i<tableSize; i++) {        if(table[i]->LRU_index < table[entry_id]->LRU_index){            table[i]->LRU_index = table[i]->LRU_index + 1;        }    }    table[entry_id]->LRU_index = 0;}voidStreamPrefetcher::resetEntry(StreamTableEntry *this_entry){    this_entry->status                = INV;    this_entry->trendDirection[0]     = INVALID;    this_entry->trendDirection[1]     = INVALID;    this_entry->allocAddr             = 0;    this_entry->startAddr             = 0;    this_entry->endAddr               = 0;    this_entry->trainedDirection      = INVALID;}StreamPrefetcher*StreamPrefetcherParams::create(){    return new StreamPrefetcher(this);}

stream.hh

#ifndef __MEM_CACHE_PREFETCH_STREAM_HH__#define __MEM_CACHE_PREFETCH_STREAM_HH__#include "mem/cache/prefetch/queued.hh"#include "params/StreamPrefetcher.hh"// Direction of stream for each stream entry in the stream tableenum StreamDirection{        ASCENDING = 1,                      // For example - A, A+1, A+2        DESCENDING = -1,                    // For example - A, A-1, A-2        INVALID = 0};// Status of a stream entry in the stream table.enum StreamStatus{            INV       = 0,            TRAINING  = 1,                  // Stream training is not over yet. Once trained will move to MONITOR status            MONITOR   = 2                   // Monitor and Request: Stream entry ready for issuing prefetch requests};class StreamPrefetcher : public QueuedPrefetcher {  protected:    static const uint32_t MaxContexts = 64; // Creates per-core stream tables for upto 64 processor cores    uint32_t tableSize;                     // Number of entries in a stream table    const bool useMasterId;                 // Use the master-id to train the streams    uint32_t degree;                        // Determines the number of prefetch reuquests to be issued at a time    uint32_t distance;                      // Determines the prefetch distance   /* StreamTableEntry      Stores the basic attributes of a stream table entry.   */  class StreamTableEntry {      public:        int  LRU_index;        Addr allocAddr;                     // Address that initiated the stream training        Addr startAddr;                     // First address of a stream        Addr endAddr;                       // Last address of a stream        StreamDirection trainedDirection;   // Direction of trained stream (Ascending or Descending)        StreamStatus    status;             // Status of the stream entry        StreamDirection trendDirection[2];  // Stores the last two stream directions of an entry  };  void resetEntry (StreamTableEntry *this_entry);  /* Creating a StreamTable for each core with      Tablesize as the number of stream entries   */  StreamTableEntry **StreamTable[MaxContexts];  public:  StreamPrefetcher(const StreamPrefetcherParams *p);  ~StreamPrefetcher();  /* Function called by cache controller to initiate      the stream training process  */  void calculatePrefetch(const PacketPtr &pkt, std::vector<AddrPriority> &addresses);};#endif // __MEM_CACHE_PREFETCH_STREAM_HH__

参考：
gem5预取实验
在添加自己的预取实验的过程中，可以参考gem5自带的一些预取算法的实现：stride.cc ,stride.hh 等。

阅读全文

0 0