A Parallel Full-System Emulator for Risc Architure Host

来源：互联网发布：淘宝加钱换购怎么设置编辑：程序博客网时间：2024/06/05 12:47

H.-Y. Jeong et al. (eds.), Advanced in Computer Science and Its Applications,
Lecture Notes in Electrical Engineering 279,
1045
DOI: 10.1007/978-3-642-41674-3_145, © Springer-Verlag Berlin Heidelberg 2014
A Parallel Full-System Emulator for Risc Architure Host
Xiao-Wu Jiang, Xiang-Lan Chen, Huang Wang, and Hua-Ping Chen
Department of Computer Science, University of Science and Technology of China
No. 443, Huangshan Road, Baohe District, Hefei City, Anhui Province, PRC
{wesker,ustc}@mail.ustc.edu.cn,
{xlanchen,hpchen}@ustc.edu.cn
Abstract. In this paper, we port a parallel full-system emulator to RISC host to
achieve higher performance by utilize all the multi-core resources from physical
CPU, in contrast the traditional full-system emulator is sequentially in SMP
emulation and can only use one core of host machine. We mainly deal with the
atomic instruction translation to RISC ll/sc pairs, and apply lightweight lock-free
FIFO queue algorithms using both interleaving and non-interleaving ll/sc pairs.
The tests show that the performance of parallel full-system emulator have high
efficiency.
Keywords: Parallel emulation, Atomical, lock-free queue.
1 Introduction
RISC is a type of microprocessor architecture that utilizes a small, highly-optimized set
of instructions. It is in the late 1970s and early 1980s that the first RISC projects came
out,they are IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2. In 1980s and the
early 1990s, a wide variety of similar RISC processors were used in Unix workstation
market as well as in printers, routers, etc. It is the beginning of 21st century when RISC
architecture dominate the majority of low end and mobile systems. This situation
happened mainly because the low power and low cost compared to X86. Now the
typical RISC architectures are arm and MIPS.
As the performance of a single processer had nearly reached its rooftop. Other
technologies are used to ensure the Moore's Law. Symmetric multiprocessing is the
most efficient one. Manufacturers typically integrate multiple cores into a single
integrated circuit diet, which is known as a chip multiprocessor or CMP. After 2006,
intel and AMD first introduce x86 CMP cpu Core Duo and Athlon64 X2 to desktop
user. Four years later, armv7-a based multi-core cortex-A9 and MIPS64-compatible
quad-core loongson 3A appear to the public. Now desktop CPU has reached deca-core
(e.g. Intel Xeon E7-2850) and mobile CPU has reached octa-core(e.g. samsung Exynos
5 Octa).
Now RISC architecture are developing towards desktop. AMD has announced to
produce arm cpu for server, google had already run ChromeOS on arm, and Microsoft
also establish windows RT to support arm. Besides some desktop and laptop are 1046 X.-W. Jiang et al.
inspired by loongson 3 family, a multicore MIPS64-compatible cpu developed by
Chinese Academy of Sciences.
Even through RISC Architecture, especially arm and MIPS, are rapidly developing,
In desktop area these Architecture still lack of applications compared to X86.
Full-system emulation has been peoved to be an effiective way to imigrate existen
applications to other architectures. There is already some full-system emulators that
enable X86 OS running on arm or MIPS. A typical tool is QEMU, whilch is sequential
in SMP emulation and can emulate multiple architecture on multiple architecture,
including X86 on arm/MIPS. Thanks to the high efficiency of X86, Sequential QEMU
is fast enough to use On x86 machine. Even though MIPS and arm have no less cores
than X86, for a single core, it’s far less fast compared to X86. For example, MIPS64-
compatible loongson 3A has 4 cores at 900Mhz while Intel i5-2400 has 4 cores at
3.1Ghz. when running a 7z in one thread, Intel is 5.8x faster than loongson. So when
emulating a X86 machine on arm/MIPS, it is very importent to use all of the cores
rather than only one to make the guest machine fluently.
X86 have some parallel full-system emulator which can use all of the resources of
host machine. But on arm or MIPS there is no parallel full-system emulator now.
In order to make full-system emulation faster, a parallel full-system emulator
COREMU is ported to RISC architecture. In the remainder of this paper we presents
how we port this parallel full-system emulator to RISC. Section 2 introduce full-system
emulator and the parallelizing strategy of full-system emulator on X86. Section 3 focus
on solving the atomic instruction translation from CISC x86 to RISC MIPS/arm.
Section 4 present lock-free FIFO queue algorithms using ll/sc pairs in interruption
simulation. Section 5 report experimental results comparing our Parallel full-system
emulator to ordinary full-system emulator and host OS.This paper mainly talk about
MIPS, arm is the same except lock-free queue algorithm in section 4.3.
2 QEMU and Parallel Full-System Emulator
2.1 QEMU and QEMU’s Multiprocessor Emulation
QEMU[1] is a hosted virtual machine monitor: It emulates CPU through dynamic
binary translation(DBT) and provides a set of device models, enabling it to run a
variety of unmodified guest operating systems. QEMU has user-mode emulation and
full-system emulation mode, in which QEMU emulates a full computer system,
including one or more processors and peripherals.
For a single emulated processor, QEMU translates the emulated code to TCG(Tiny
Code Generator) and then translates the TCG to host instructions. After a block of
emulated code translated to instructions on host machine, QEMU will execute it. In
QEMU’s multiprocessor emulation, QEMU emulates a SMP machine with multiple
processors and a certain device to support inter-core communications (such as APIC in
x86). QEMU emulates these processor sequentially in a round-robin strategy: each
emulated processor has a time slice to execute. After that, physical CPU turn to the next
emulated processor to execute. And between each time slice, physical CPU turn to
execute some peripherals simulation and inter-core communications. A Parallel Full-System Emulator for Risc Architure Host 1047
2.2 Parallel Full-System Emulator
While QEMU being as a sequentially full system emulator, there exist a few kinds of
parallel full-system emulator: Parallel SimOS, COREMU[2], PQEMU[5] and
HQEM[6]. Parallel SimOS is designed for alpha architecture and the other three are
specially designed for X86 machine, both of them are not able to run on MIPS
architecture now. Compared to other parallel emulator, COREMU have high scalability
and high performance. Our job is majorly based on it.
COREMU is hosted on X86 and targeted on multiple architecture especially x86 and
arm. It wraps the translation-execution logic to a single thread, and then bind these
threads to different physical CPU cores. Besides, it warps all peripherals emulation to
an individual thread called IO-thread. COREMU majorly use multithread to achieve
parallel. It offered an efficiently emulate synchronization primitive to coordinate
concurrent access to the emulated shared memory from each emulated processor. To
deal with inter-core communications COREMU use lock-free FIFO queue.
When building a parallel full system emulator like COREMU on MIPS, we mainly
deal with the atomic instruction translation strategy for lightweight memory
transactions and lock-free FIFO queue for inter-core communications.
Fig. 1. Sequential and parallel full-system emulation
3 Atomic Instruction on MIPS Host
3.1 Atomic Instruction on X86
CAS(compare-and-swap) is an atomic instruction which is widely used in
multithreading to achieve synchronization. The C function of CAS in Figure 1 shows
the basic behavior of CAS, which provide the guarantee of atomicity.
Fig. 2. CAS in C Fig. 3. translation of atomic inc use CAS
COREME use CASN(Multiword CAS) algorithm in atomic instruction translation,
which execute multiple CAS to simulate the atomic instruction on guest machine.
int CAS(int *mem,int oldval,int newval){
int old_reg_val=*reg;
if(old_reg_val==oldval)
*reg=newval;
return old_reg_val;
}
void inc(int *reg){
do{
int old=*reg;
int new=old+1;
}while(CAS(reg,old,new)!=old)
} 1048 X.-W. Jiang et al.
Figure 2 shows the translation of atomic inc use CAS in C.COREMU use CASN
majorly because it targeting at X86 host, and X86 has cmpxchg as its CAS instruction.
3.2 Atomic Instruction on MIPS
Different from X86, MIPS is a RISC architecture and has no CAS instructions. MIPS
provides ll(Load Linked) and sc(Store Conditional Word) (on arm the instructions
named llrex, screx)to achieve atomic read-modify-write (RMW) operation. ll reg,mem
load a word from memory to reg, and remember this operation. sc reg,mem store a word
to the same location in memory. When a sc instruction fetch memory, it will check
whether the location is modified after the last ll instruction. If it didn’t modified, reg
will set 1 for the success of execution, while if it has been modified the reg will set 0 for
the failure of execution. LL/SC has two advantages over CAS: reads and writes are
separate instructions, and both instructions can be performed using only two registers.
3.3 Aligned Instruction
X86 atomic instruction contains inc, dec, add, xchg, and, or, xadd, bit_testandset,
bit_testandreset, etc. But OS and applications won’t use them all. From experiment we
find that linux kernel and applications on it only use inc, dec, xchg, cmpxchg and xadd.
This paper use ll/sc pair and inline assembly to achieve lightweight memory
transaction. Figure 3-6 show the translated inc, xchg, xadd, cmpxchg in MIPS.
Fig. 4. inc mem
Fig. 5. xchg reg,mem
Fig. 6. xadd reg,mem
Fig. 7. cmpxchg mem,old,new
3.4 Unaligned Instruction
The above research shows the solution to all 32bit aligned memory access, but as a
CISC architecture, X86 has non-32bit aligned memory access while MIPS required
32biit aligned memory access. The experiment result shows that all these unaligned
memory access exist in 8bit or 16bit bit xchg and cmpxchg, and the memory will not
across two 32bit memory address.
1: ll t,*mem
addi t,t,1
sc t,*mem
beqz t,1b
1: ll temp1,*mem
move temp,reg
move reg,temp1
sc temp,*mem
beqz temp,1b
1: ll temp,*dst
add temp1,reg,temp
move reg,temp
sc temp1,*dst
beqz temp1,1b
1: ll temp,mem
bne temp,old,2f
move temp,new
2: sc temp,mem
beqz temp,1b A Parallel Full-System Emulator for Risc Architure Host 1049
With this feature, we deal with this unaligned instruction as below: when QEMU got
a unaligned instruction, then just expand the address to 32bit aligned (new_addr=addr
& ~0x3) and operate the whole 32bit atomically, then we can ensure the atomicity of
the original operation.
4 Lock-Free Queue in Interruption Simulation
4.1 Interruption Simulation
As emulation in QEMU is sequential, the asynchronous communication between core
to core/device emulated in a synchronous way. All of the processor running logic are
schedule by round-robin fashion. When a core is schedule out, QEMU will do those
synchronous events including device interruption and inter-processor interruption.
However in parallel emulation more than one emulated core are running at the same
time, interrupt vector may be modified parallel by each running core. COREMU use a
lock-free FIFO queue to achieve asynchronous communication.
4.2 Lock-Free Queue in X86
Unlike ll/sc pair in MIPS, CAS in X86 can’t not detect ABA problem[8]. A typical
ABA problem like below:
• Process1 reads value A from shared memory
• Process1 then preempted allowing process2 to run.
• Process2 modifies the shared memory value A to value B and back to A before
preemption.
• Process1 begins execution again, sees that the shared memory value has not changed
and continues.
ABA problem is a major problem when designing a Lock-free queue algorithm because
the node type of queue are always pointer, and a same pointer may result from an
enqueue with the same malloc. COREMU add a counter to each queue node and use
CAS2 to avoid ABA problem in lock-free queue. CAS2 check a queue node which
contains a pointer and a counter. The counter never be the same after each en/dequeue
operation. Besides X86 has native CAS2 instruction: cmpxchg8b/16b.
4.3 Lock-Free Queue in MIPS
There is a way to use interleaving ll/sc pairs directly to form a lock-free FIFO queue as
Claude Evequoz talk about in his paper[3]. We apply this algorithm on arm because
arm support interleaving of ll/sc pairs. 1050 X.-W. Jiang et al.
Q: array[0..Q_LENGTH-1] of *NODE;
unsigned int Head, Tail;
bool enqueue(node *p){
unsigned int t,tail;
node *slot;
while(true){
t = Tail;
if(t == Head + Q_LENGTH)
return FULL_QUEUE;
tail = t % Q_LENGTH;
slot=LL(&Q[tail]);
if(t == Tail)
if(slot != null){
if(LL(&Tail)
SC(&Tail,t+1);
}
else if(SC(Q[tail],node)){
if(LL(&Tail)==t)
SC(&Tail,t+1)
return OK;
}
}
}
// Circular list initialized with null
// Extraction and insertion indices
node *Dequeue(void){
unsigned int h,head;
node *slot;
while(true){
h = Head;
if(h == Tail)
return null;
head = h % Q_LENGTH;
slot = LL(&Q[head]);
if(h == Head)
if(slot == null){
if(LL(&Head) == h)
SC(&Head,h+1);
}
else if(SC(&Q[head],null){
if(LL(&Head) == h)
SC(&Head,h+1);
return slot;
}
}
}
Fig. 8. Lock free FIFO queue using ll/sc pair
As algorithm using ll/sc always need either nesting or interleaving of ll/sc pairs, we
can’t use the algorithm based on it because MIPS do not support it. Actually a single
en/dequeue contains two operations: modify Head/Tail pointer and en/queue node.
Two operation must be execute at one atomic time while MIPS only support one.
Paper[7] offered us a way to build CASN which can atomically run multiword CAS.
We first use ll/sc pair to build a software version of CAS, and then generate the CAS2.
In this way we can use lock-free queue in COREMU, but CAS2 in MIPS is much
heavier than CAS, not to speak of there is no native support of CAS. So it is very
important to reduce the use of CAS2 in lock-free queue algorithm.
We use lock-free algorithm found by John D.valois[4], which especially reduce
CASN instruction. Both enqueue and dequeue has only one CAS2 and one xadd. This
algorithm is based on a standard circular array. There are three special values, HEAD
TAIL and EMPTY, and node value. Initially, two adjacent locations are set to HEAD
and TAIL while others are set to EMPTY. To enqueue the value x, a process find the
unique location containing the special TAIL value.CAS2 is then used to change two
adjacent location from <TAIL, EMPTY> to <x, TAIL>.The dequeue operation is
similar, using the CAS2 operation to change <HEAD, x> to <EMPTY, HEAD> and the
return the x.
Besides, we keep two counters: the number of enqueue and the number of dequeue.
Both of them are increase by FAA whenever an en/dequeue process complete. These
two counter helps to quickly find the HEAD and TAIL.FAA can be simulated by xadd
1,mem in Figure 4, CAS2 can be generate by multiple CAS.
When reaching the beginning and ending of the array, this algorithm still work,
because software CAS2 do not require two memory adjacent. A Parallel Full-System Emulator for Risc Architure Host 1051
5 Experiments and Discussion
In order to test the performance between origin QEMU and our modified QEMU, and
test the performance between multithread program in native machine and in our
modified QEMU, two benchmark are chosen. These benchmark are performed on a 4
core (900Mhz) loongson 3A , a quad core MIPS cpu, running Debian 6 with kernel
version 2.6.36.3. The guest OS is Debian 6 with version 2.6.32-5.
Firstly, we write a simple multithread pi which is designed to calculate pi in totally N
step in T threads concurrently. Each thread calculate ߨ
௧
, and finally calculate Ɏ.
ߨ
௧
ൌ ෍
ͺ
ͳ͸ሺ‹ ൅ –ሻ
ଶ
െ ͳ͸ሺ݅ܶ ൅ ݐሻ ൅ ͵
ேȀ்
௜ୀଵ
Ɏ ൌ ෍ ߨ
௜
்
௜ୀଵ
All these result shows below. OriQemu short for origin QEMU ModQemu short for
our modified QEMU, n(1,2,4) means QEMU run with –smp n option(emulating an n
core machine). The result shows the efficiency of modified QEMU, It is 3x faster than
original QEMU when the number of core on emulated machine is set to 4 and the
number of thread is set to 4. The speedup rate reached 3 and efficiency reached nearly
3/4 compared to the number of physical core.
Fig. 9. The time of multithread pi
Secondly, we test the performance between modified QEMU and native machine
though 7z, a widely used multithread compress application which contains a building
benchmark. The dictionary size is set to 256KB. The result shows the compress and
decompress speed on both native and modified QEMU, A higher threadnum makes a
higher compress/decompress rate, and both in native and modified QEMU the
application hold almost the same speedup rate: compress 2.79 to 2.68 and depress 3.68
to 3.65.1052 X.-W. Jiang et al.
10
100
1000
10000
1 2 4 8 1 6
SPEED(KB/S)
NUMBER OF THREADS
host_compress host_decompress
guest_compuress guest_decompress
Fig. 10. The speed of 7z compress/decompress
6 Conclusion
We find an atomic instruction translating strategy for ll/sc pairs on RISC and use a
more light-weight lock-free FIFO queue on asynchronous communication emulation.
Finally we successfully emulate X86 in parallel on MIPS target. The experiments
proved its efficiency compared to original QEMU and host machine.
References
1. Bellard, F.: QEMU, a fast and portable dynamic translator. USENIX (2005)
2. Wang, Z., Liu, R., Chen, Y., Wu, X., Chen, H., Zhang, W., Zang, B.: COREMU: a scalable
and portable parallel full-system emulator. In: Cascaval, C., Yew, P.-C. (eds.) PPOPP, pp.
213–222. ACM (2011)
3. Evéquoz, C.: Non-Blocking Concurrent FIFO Queues with Single Word Synchronization
Primitives. In: ICPP, pp. 397–405. IEEE Computer Society (2008)
4. Valois, J.D.: Implementing Lock-Free Queues. In: Proceedings of the Seventh International
Conference on Parallel and Distributed Computing Systems, Las Vegas, NV (1994)
5. Ding, J.-H., Chang, P.-C., Hsu, W.-C., Chung, Y.-C.: PQEMU: A Parallel System Emulator
Based on QEMU. In: ICPADS, pp. 276–283. IEEE (2011)
6. Hong, D.-Y., Hsu, C.-C., Yew, P.-C., Wu, J.-J., Hsu, W.-C., Liu, P., Wang, C.-M., Chung,
Y.-C.: HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores.
Paper presented at the Meeting of the CGO (2012)
7. Harris, T.L., Fraser, K., Pratt, I.: A Practical Multi-word Compare-and-Swap Operation. In:
Malkhi, D. (ed.) DISC 2002. LNCS, vol. 2508, Springer, Heidelberg (2002)
8. Dechev, D., Pirkelbauer, P., Stroustrup, B.: Understanding and Effectively Preventing the
ABA Problem in Descriptor-Based Lock-Free Designs. Paper presented at the Meeting of the
ISORC (2010)