Performance Insights to Intel® Hyper-Threading Technology

来源：互联网发布：电信网络能申请外网吗编辑：程序博客网时间：2024/06/04 22:30

文章出处：https://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/

Performance Insights to Intel® Hyper-Threading Technology

Submitted by Antonio Valles ... on Fri, 11/20/2009 - 13:52

Executive Summary

Intel® Hyper-Threading Technology (Intel® HT Technology)¹ is a hardware feature supported in many Intel® architecture-based server and client platforms that enables one processor core to run two software threads simultaneously. Also known as Simultaneous Multi-Threading, Intel HT Technology improves throughput and increases energy efficiency. Intel HT Technology provides greater benefits in Intel® Core™ i7 processors and other processors based on the Nehalem core including the Intel® Xeon® processor 5500 series than was possible in the Pentium® 4 processor era when it was first introduced, producing better performance and power efficiency gains across a broad range of applications.

As with any hardware feature, not all software may benefit from this capability. See the section "Understanding Limitations and Maximizing Performance" for details. This paper explains how Intel HT Technology works and shows a variety of performance results across several classes of software running on clients, workstations, and servers. Methods for assessing Intel HT Technology performance are introduced, and analysis of performance degradations are included. Guidance is provided to both software developers and end customers about how to gauge the effect of Intel HT Technology on a particular application and how to use (or not use) the technology to best effect.

The discussion here refers to Intel® 64 architecture, with particular emphasis on Intel processors based on the Nehalem core including the Intel Core i7 Processor and the Intel Xeon processor 5500 series. In addition, this paper will also apply to future Intel® processors derived from the Nehalem core. This paper does not apply to the Atom®Processor family, nor does it apply to the Itanium® processor family, as both have significantly different Intel HT Technology implementations.

Hardware Mechanisms of Intel HT Technology

Intel HT Technology allows one physical processor core to present two logical cores to the operating system, which allows it to support two threads at once. The key hardware mechanism underlying this capability is an extra architectural state supported by the hardware, as shown in Figure 1.

Figure 1. Intel® HT Technology enables a single processor core to maintain two architectural states, each of which can support its own thread. Many of the internal microarchitectural hardware resources are shared between the two threads.

The block diagram of the Nehalem core-based processor in Figure 2 shows multiple cores, each of which has two threads when Intel HT Technology is enabled. Processors based on the Nehalem architecture exist with varying numbers of cores, as shown in the graphic.

Figure 2. The Intel® Core™ i7 processor and architecturally similar processors can have varying numbers of cores, each of which can support two threads when Intel® HT Technology is enabled.

For each thread, the processor maintains a separate, complete architectural state that includes its own set of registers as defined by the Intel 64 architecture. The operating system (OS) manages the threads just as it would if they were each running on their own core; the management of those threads is identical (from the OS's point of view) as if they were each running on their own physical core. Some internal microarchitectural structures are shared between threads (see Table 1).

Table 1. The Nehalem microarchitecture uses four different implementation policies per core to handle thread resources: Replicated, Partitioned, Competively Shared, or Unaware.

The execution pipeline of processors based on Intel® Core™ microarchitecture is four instructions wide, meaning that it can execute up to four micro-operations per clock cycle. As shown in Figure 3, however, the software thread being executed often does not have four instructions eligible for simultaneous execution. Common reasons for fewer than four instructions per clock being retired include dependencies between the output of one instruction and the input of another, as well as latency waiting for data to be fetched from cache or memory.

Intel HT Technology improves performance through increased instruction level parallelism by having two threads with independent instruction streams, eliminating data dependencies between threads and increasing utilization of the available execution units. This effect typically increases the number of instructions executed in a given amount of time within a core, as shown in Figure 3. The impact of this greater efficiency is experienced by users as higher throughput (since more work gets completed per clock cycle) and higher performance per watt (since fewer idle execution units consume power without contributing to performance). In addition, when one thread has a cache miss, branch mispredict, or any other pipeline stall, the other thread continues processing instructions at nearly the same rate as a single thread running on the core. Intel HT Technology augments other advanced architectural features, higher clock speeds, and additional cores with a capability that is relatively inexpensive in terms of space on the silicon and production cost.

Figure 3. By giving the processor access to two threads in the same time slice, Intel® HT Technology reduces the level of idle hardware resources, which typically increases efficiency and throughput.

Intel HT Technology was originally implemented in the Intel NetBurst® microarchitecture, including the Pentium® 4 processor. While the feature's core technology is much the same as it is implemented in the current generation of processors, the microarchitecture as a whole has evolved to the point where Intel HT Technology is dramatically more effective in current platforms, due to changes such as those summarized in Table 2.

Table 2. Some key microarchitectural changes between the Pentium® 4 Processor and the Nehalem Core

Software Use of Intel HT Technology

The performance gain attained on a given software application due to any specific hardware feature is dependent on the characteristics of the software and how the software interacts with that feature. For instance, an application that runs out of processor cache will not benefit from more memory bandwidth. Likewise, how much gain an application gets from a particular features depends on how much that feature complements an application. This is why different processors provide different speed benefits for different applications. In this regard, Intel HT Technology is identical to all other processor features. The performance gain attainable using Intel HT Technology is dependent not only on the processor design, but also on the characteristics of the software that is executing. Software developers and end users alike can benefit from understanding the characteristics an application must have to be a candidate for Intel HT Technology performance improvement. These are outlined below.

Application is designed for parallel execution. Generally, this means the application is either multi-threaded and/or can be executed with multiple processes. An application with multiple processes usually executes distinct tasks in parallel. An application that is threaded can divide a given task into separate parts, and each part is then given to a software thread to execute in parallel with the other software threads. This type of threading model is called domain decomposition or data-decomposition. There are other threading models, but domain decomposition many times provides the most flexibility by enabling the application to tailor itself to run on a variety of different platforms that have N>1 hardware threads. For help with threading, see the Intel® Developer Zone Parallel Programming and Multi-Core Developer Community.
Application scales with the number of hardware threads. Many applications have a limit where creating extra software threads no longer improves performance due to application and/or system bottlenecks. Creating more threads in the presence of application design bottlenecks of system resource bottlenecks will not increase performance and may create additional overhead that will translate to lower performance. In order for an application to scale well with Intel HT Technology, it must scale well with increasing cores. Various tools are available to optimize applications to scale well including Intel® VTune™ Analyzer with Intel® Thread Profiler, which can help identify bottlenecks, load balance issues, and parallelism opportunities.
Application maximizes hardware parallelism capabilities. Applications maximize the hardware parallelism capabilities by creating the right number of software threads. In most cases, this means utilizing a full-subscription model (using the same number of software threads as the number of hardware threads). In some instances, especially in server-type applications, over-subscription (more software threads than hardware threads) may provide better performance. Under-subscription (fewer software threads than hardware threads) is not fully utilizing the parallelism capabilities of the platform and may translate to reduced performance gain. As stated above, the application has to be able to scale before deciding to create increased software threads. Use OS APIs or CPU enumeration routines to create the right number of software threads.

Measuring and Analyzing Intel HT Technology Performance

Distinguishing Cores and Intel HT Technology Threads

The OS does not indicate which threads are located together on one core. The screen shot of the Windows* TaskManager in Figure 4 shows four threads running on a system with two cores and Intel HT Technology enabled. There is no indication of which threads belong to which core. This is also true for Linux*, though we do not show the Linux equivalent here.

In the figure, the second and fourth hardware threads are being utilized the most. Are these hardware threads sibling Intel HT Technology threads or separate cores? It is not possible to tell from the Task Manager, though there are other methods to determine which threads belong to which core.

Figure 4. Microsoft Windows* Task Manager showing an application with two software threads running on a system based on the Intel® Core™ i7 processor (Nehalem core) with Intel® HT Technology.

An end user can distinguish between cores and sibling Intel HT Technology threads by using CPUID interfaces and APIC IDs to enumerate the CPU topology. Each logical processor has its own ID. CPUID provides an interface to decompose an APIC ID (see figure 5) to subfields corresponding to each hierarchical level. For a sample program and in-depth details see the paper, "Intel® 64 Architecture Processor Topology Enumeration."

Figure 5. Each hardware thread has its own APIC ID

Logging Core and Intel HT Technology Thread Utilization Once the thread mapping onto the cores is understood, then common OS utilities such as perfmon/ typeperf/ sar can be used to log CPU activity and these can be mapped into hardware threads and cores. This approach can provide insight in showing whether sibling Intel HT Technology threads are being utilized instead of separate cores. The Microsoft Typeperf* utility can be used to log CPU utilization per hardware thread, as shown in Figures 6 and 7.

Figure 6. Calling the Microsoft Typeperf* utility

Figure 7. TypePerf output (foo.csv file) containing CPU utilization per hardware thread

Interpreting Perfmon and sar Output

Perfmon and sar are utilities provided with the Windows and Linux operating systems, respectively, that provide CPU utilization data. Interesting behaviors that are not necessarily intuitive can be observed in this data by enabling and disabling Intel HT Technology. These tools simply count active time for each processor in the system and report this active time for each CPU over some regular interval that is configurable by the user. The tools also report total CPU utilization as a percentage, which is simply an average of all CPUs for each interval.

Note that what is called CPU Utilization is actually average CPU utilization. On a per-thread basis, a thread is either running code or it is idle at any given instant of time. The tools provide the average percentage of time that the thread was active over the interval of interest. For total utilization, it is possible to be at 25 percent utilization at an instant of time if, for instance, three threads are idle and one is active.

For purposes of illustration, consider a dual-core processor that supports Intel HT Technology. If Intel HT Technology is disabled, perfmon and sar will report two CPUs, and if Intel HT Technology is enabled, the tools will report CPU utilization for four CPUs.

Examining Performance-Data Examples

For the following examples, assume that Intel HT Technology provides a 1.25x performance gain when two threads are running on a core versus one thread running on a core. Assume further that one thread running by itself is capable of one unit of work per second (1 u/s).

In the case shown in Figure 8, two threads are running at 100% utilization, each on its own core. Total CPU Utilization calculated by perfrmon and sar is 100%, (100+100)/2. The system is completing two units of work per second (2 u/s).

Figure 8. The system represented here has Intel® HT Technology disabled and two threads running, both at 100% utilization.

In the case shown in Figure 9, two threads are running at 100% utilization and two are idle. Each is running on its own core. Total CPU Utilization calculated by perfmon and sar is 50%, (100+0+100+0)/4. The system is again completing two units of work per second (2 u/s).

Figure 9. The system represented here has Intel® HT Technology enabled; two threads are running at 100% utilization and two threads are idle (at 0% utilization). Each core has one logical processor at 100% utilization and one logical processor at 0%.

In the case shown in Figure 10, two threads are running at 100% utilization and two are idle. Both active threads are running on the same core. Total CPU Utilization calculated by perfmon and sar is 50%, (100+100+0+0)/4. The system is completing 1.25 units of work per second (u/s) in this case. (Our assumed workload gets 1 u/s on one thread and an Intel HT Technology gain of 1.25X for 1.25 u/s).

Figure 10. The system represented here has Intel® HT Technology enabled; again, two threads are running at 100% utilization and two threads are idle (at 0% utilization). Here, one core has two threads running at 100% and one core has two threads at 0%.

Note that, while Perfmon and sar report 50% total average CPU utilization for the graphs in both Figures 9 and 10, the amount of work completed per unit time is substantially different (2 u/s versus 1.25 u/s.).
The above examples illustrate the following three points:

Ideal scheduling would be to place active threads on cores before scheduling on threads on the same core when maximum performance is the goal. This is best left to the operating system. All multi-threaded operating systems support Intel HT Technology, while later versions have more support for scheduling threads in the most ideal manner to maximize performance gains.
CPU utilization is not a good estimate of the true load on the system or the headroom the system has remaining to do additional work. As we see in Figure 9 and Figure 10, both report 50% utilization, but the amount of work being accomplished is different due to the differences in thread scheduling. The system in figure 9 is doing 2 u/s while the system in figure 10 is doing 1.25 u/s.
Scaling work output with CPU utilization to estimate throughput at 100% utilization is not an accurate means of estimating maximum throughput of the system. Indeed, this method of estimating maximum performance at 100% utilization is fraught with difficulties in general with or without Intel HT Technology.

Measuring Intel HT Technology Performance Measuring CPU headroom and/or CPU utilization decrease vs. Using Time or Rate Performance metrics

Consider an application that is used as a middle-layer for other applications. To simplify, lets call this middle-layer application AppX. The developers of AppX worked hard to minimize CPU utilization. Thus, for years AppX has been looking at CPU utilization as one of its performance indicators; CPU utilization indicated how much CPU headroom a third-party application had available when AppX was running. Figure 11 shows the CPU utilization data of AppX showing Core i7 data with Intel HT Technology disabled and enabled. The developer looked at the CPU utilization and incorrectly concluded that Intel HT Technology more than doubled the performance of AppX as CPU utilization dropped from 45.5% to 21.6%!

Figure 11. AppX CPU utilization, incorrectly concluding a 2x performance improvement with Intel® HT Technology

After discussions with Intel engineers, AppX developers realized that CPU utilization was not the best way to measure Intel HT Technology performance improvements. AppX used the same number of software threads with Intel HT Technology enabled and disabled. The CPU utilization merely indicated that Intel HT Technology threads were not being utilized during the run; the other hardware threads still had the same average CPU utilization as before. Although it is accurate to say that the CPU utilization decreased by 2x, that does not translate to a 2x speedup.

To accurately asses an Intel HT Technology speedup, AppX developers decided to run a few of their customers' multi-threaded apps that can scale to the number of hardware threads used. The apps were tested with Intel HT Technology enabled and disabled configurations and performance metrics such as time or rate (not CPU utilization) were used. Intel recommends using elapsed time or measurements of work done per unit time to assess performance changes due to optimizations.

Latency versus Throughput

To get a better understanding of how Intel HT Technology impacts performance, consider a single-socket Intel Core i7 processor-based system with four cores (the same information applies to systems with more than one socket and more or fewer cores per socket). If Intel HT Technology is disabled, there are four threads of execution, one on each processor core. This increases to two threads per processor core when Intel HT Technology is enabled.

Let us assume an application is running that is capable of spawning enough threads to keep all CPU threads utilized 100% of the time, and all threads are independent of each other. (This is a valid assumption for many applications, especially in the server segment. We will examine applications where this isn't true shortly.) Let's assume that we have four threads running with Intel HT Technology disabled and each thread does one unit of work per second, so our system is doing four units of work per second (u/s). Let us also assume the application exhibits good scaling. A variety of scenarios using the above assumptions are described below.

Scenario 1: Intel HT Technology Enabled (Eight Software Threads): No Performance Gain

Consider first the effects on the application if we assumed that Intel HT Technology provided zero performance benefit. In this case, we have twice as many threads executing simultaneously. In order for there to be no performance gain, each thread must complete one unit of work every 2 seconds, or 0.5 units of work per sec (u/s.) In other words, the time to complete the same amount of work per thread, increased by 2x, but the number of threads has increased by 2x while the total throughput remains the same.

Scenario 2: Intel HT Technology Enabled (Eight Software Threads): Performance Gain of 1.25x

In another case, Intel HT Technology is enabled, the application now runs eight threads, and performance improves 1.25x versus the Intel HT Technology-disabled case. What is the effect on execution time of one thread to do one unit of work? Original performance was 4 u/s with Intel HT Technology disabled and is now 4 * 1.25 = 5.0 u/s with Intel HT Technology enabled. Since we have eight threads, the time to complete one unit of work on one thread = 8 / (5.0 u/s) = 1.6 seconds. In this case, throughput has increased by 25% and thread latency has increased to 1.6 seconds.

Note that while this would seem to indicate that response time would increase with Intel HT Technology enabled for a variety of server workloads, this is generally not the case. The CPU time does increase, but typically, wait time in OS queues decreases. An example is described later in this paper to illustrate this result. Please see the article "Hyper-Threading: Be Sure You Know How to Correctly Measure Your Server's End-User Response Time" for additional exploration of the impact of Intel Hyper-Threading Technology on end-user server response times.

Figure 12 shows the relative compute latency of threads running with Intel HT Technology enabled versus disabled for a different performance gains with Intel HT Technology. Note that this representation is only an accurate model when all threads are 100% utilized.

Figure 12. How single thread execution latency varies relative to the performance gain attained with Intel® HT Technology enabled.
Note: This graph is an accurate representation of thread latency versus performance gain only at or near 100% CPU utilization.

Core Cycles-per-Instruction (CPI) and Thread CPI

Cycles-per-instruction (CPI) is an average over a time interval of the number cycles it takes to execute a given number of instructions and is simply calcuated as cycles/instructions for that interval. The CPI of any section of code is influenced by many factors. These include the amount of instruction level parallelism in the code. Processors based on the Nehalem core can retire four instructions per clock, which corresponds to a CPI of 0.25. Some software has little inherent instruction level parallelism, and other factors such as cache misses and branch mispredictions add cycles, resulting in average CPI numbers closer to 1.0 or 2.0.

In order to understand the impact Intel HT Technology has on CPI, it is important to make the distinction between Core CPI and Thread CPI. Let us first examine the case where Intel HT Technology is disabled. In this case, there is one thread of execution running on one core. Over any given interval, the number of instructions and cycles per core or per thread is identical, because there is only one thread running on the core. Therefore, Core CPI is equivalent to Thread CPI.

When Intel HT Technology is enabled, there are two threads of execution running on one core. Over any interval of time where both threads are active, each thread will have the same number of cycles of execution, but each thread will have executed a different number of instructions. Therefore, in order to calculate thread CPI, we need to know the number of instructions executed on each thread. In this case, the two thread CPIs will be different from each other. The core CPI is the number of cycles of the interval divided by all the instructions retired by the core which is the sum of the instructions retired by each thread.

Assume that, over an interval, two Intel HT Technology threads on one core execute for 1,000,000 cycles, and thread 1 executes 750,000 instructions while thread two executes 500,000 instructions. The thread CPI for thread 1 is 1.33 and the thread CPI for thread 2 is 2.0. The Core CPI is 1,000,000 divided by 1,250,000 for a core CPI of 0.80. Note that while it is possible, with the right data, to calculate core CPI with per thread data, the reverse is not true. One cannot calculate thread CPI with Intel HT Technology enabled from core cycles and core instruction counts. It is possible to calculate the weighted average thread CPI by doubling core CPI, although this result is only a weighted average and may deviate significantly from the actual thread CPIs of the two threads. It is also not accurate to make any estimate or calcualate core CPI for a function using profile data collected on a system with Intel HT Technology enabled, as there is no way to know what was running on the other thread at the time. In fact, profile data within a function may have been sampled with different routines running each time on the other thread.

Understanding Limitations and Maximizing Performance

While Intel HT Technology improves thread-level parallelism, the two logical processors in each physical processor core share most execution resources. The focus of this capability is to improve the efficiency of instruction scheduling, keeping the execution resources occupied, increasing instruction-level parallelism, and keeping execution units busy during microarchitectural stalls. The majority of applications show a significant increase in performance as a result. There are circumstances that can limit the Intel HT Technology benefit and in rare cases cause performance degradation. Examples include the following:

Application Scaling: Intel HT Technology adds additional hardware threads to the system. Therefore, to take advantage of Intel HT Technology, an application must be able to launch additional threads in order to generate additional parallelism. Applications that do not scale well with Intel HT Technology disabled are more likely to exhibit performance issues when Intel HT Technology is enabled. The best solution in this case is to identify the scaling issues and address these first. An application that spends an increasing amount of time in critical resource handling (locks and synchronization) may overwhelm any Intel HT Technology improvement in CPI due to more instructions and contention in the pipeline. Other scaling issues may be due to utilization of all of a particular platform resource as addressed above. Specific limiters of performance scaling are discussed in detail below.
Extremely high memory bandwidth applications. Intel HT Technology increases the demand placed on the memory subsystem when running two threads. If an application is capable of utilizing all the memory bandwidth with Intel HT Technology disabled, then the performance will not increase when Intel HT Technology is enabled. It is possible in some circumstances that performance will degrade, due to increased memory demands and/or data caching effects in these instances. The good news is that systems based on the Nehalem core with integrated memory controllers and Intel® QuickPath Interconnects greatly increase available memory bandwidth compared to older Intel CPUs with Intel HT technology. The result is that the number of applications that will experience a degradation using Intel HT Technology on the Nehalem core due to lack of memory bandwidth is greatly reduced.
Extremely compute-efficient applications. If the processor's execution resources are already well utilized, then there is little to be gained by enabling Intel HT Technology. For instance, code that already can execute four instructions per cycle will not increase performance when running with Intel HT Technology enabled, as the process core can only execute a maximum of four instructions per cycle.
Thread imbalance. The increased parallelism is only as useful as the degree of concurrency of the workload. If the work happens on only a few threads, then the increased hardware parallelism will provide little or no performance benefit. Intel® Software Development Products include tools to diagnose thread imbalance and improve concurrency.
Parallellism bottlenecks. There are many parallelism barriers that can limit thread scaling, such as false sharing, too many locks/synchronization, small parallel region compared to serial region, etc. Some barriers such as the amount of work (and thus the amount of work per thread) may be difficult or impossible to change, but other barriers such as false sharing can be fixed.

Note: False sharing is the state when two logical processors share the same cache-line unintentionally. This commonly occurs for global/static variables but can also occur with dynamic memory. The worst case scenario for false sharing is where both processors are writing to memory.

Intel's Nehalem core uses a write allocate write allocation policy (same as other Intel CPUs). This means the caches fetch on write (also known as Read-For-Ownership, or RFO). The Intel Core i7 processor 1st-level cache is a write back cache (same as other Intel Core CPUs such as those code-named Yonah, Merom, and Penryn, although though the Pentium 4 processor had a write-through cache). A write-back 1st-level cache means the data will stay on the first level cache until it is evicted. Thus, when the CPU writes to memory it will stay in the 1st level cache in a modified state. Now we get to the Ping-Pong Effect: If another processor now does a read-for-ownership, the current processor will have to write-back its L1D cache-line and invalidate its copy to enable the other processor to get the line and hold it in an exclusive state. If the first processor needs to write to cache-line again, it will again do an RFO and cause the other processor to write back its cache line and invalidate its copy.

False Sharing Code Example:
int sum[THREAD_NUM];
/* each thread uses its thread number as index to global sum array */
int inc_sum () {
sum[my_thr_num]++;
return sum[my_thr_num];
}

In the example above, processor(n) only writes to sum[n] and processor(n+1) only writes to sum[n+1]. They are not sharing the exact same memory location, and so no synchronization is needed, but they are sharing the same memory range in cache-line size granularity. The problem is that unintended sharing occurs because the processor deals with cache-line size granularities. False sharing can be easily found using the Memory Access Analysis feature of the Intel® Performance Tool Utility, which uses Intel Core i7 processor precise HITM and store events to identify contested cache-line accesses. False Sharing can be easily fixed by making sure the variables that are causing the false-sharing are spaced out by a cache-line size.(64-bytes) as shown below using Microsoft or Intel Compilers:

Before:
int sum1; int sum2;

Fix:
declspec(align(64)) int sum1;
declspec(align(64)) int sum2;

Undersubscribed operating conditions:This situation describes a case where there are not sufficient ready-to-run software threads available to take advantage of all the logical cores on a system. The OS scheduler may decide to schedule two worker threads on one core instead of an idle core. For example, consider a dual-core processor with Intel HT Technology enabled (two physical cores supporting up to four threads). If only two software threads are ready to run in a given interval and both are scheduled to run on the same physical core (with the other physical core idle), throughput will be lower during that interval than without Intel HT Technology. To improve undersubscribed OSs, consider using common threading approaches:
- OpenMP* is a set of compiler directives and application programming interface (API) that enables users to parallelize their shared-memory applications. Most compilers support OpenMP, including the Intel® Compilers. The most common use of OpenMP is to parallelize loops, where each iteration is a unit of work distributed among the threads.
- Intel® Threading Building Blocks is a portable C++ template library that extends C++ by abstracting thread management. The user specifies tasks rather than threads and lets the library map/schedule the tasks onto threads. Intel Threading Building Blocks is another mechanism that enables users to improve performance, portability (Windows, Linux, Mac OS*), and scalability in an efficient manner.
Affinitization:Incorrect affinitization implementations will cause performance degradations. Affinitization masks can vary per OS and between 32-bit and 64-bit versions of the OS. OS implementations may change between different versions of the same OS and BIOS implementations. The number of cores per processor also may change between systems. Care must be taken to address these situations. Intel recommends that developers use the following affinitization strategies:
- For non-NUMA platforms, developers are strongly encouraged not to use affinitization.
- For NUMA platforms, affinitization may be needed to improve accesses to local memory versus remote memory. Developers are encouraged to use middleware to do the affinitization such as OpenMP or MPI environment variables. If affinitization is used, the application requires correct CPU topology and enumeration information, as described in the white paper, "Intel® 64 Architecture Processor Topology Enumeration."
- CPU Enumeration Shortcuts: CPUID leaf4 (CPUID.4.EAX[31:26]) could be used in the past to get the number of cores per package. This method used to work but will not work with Intel Core i7 and future processors based on the Nehalem core and its derivatives. If developers use this method, they run the risk of creating the wrong number of software threads. The correct way to get the number of cores per package is to use OS APIs and/or to use Intel's CPU topology enumeration, as described in the white paper, "Intel® 64 Architecture Processor Topology Enumeration."
OS support:Any modern OS will run with Intel HT Technology enabled. OSs have continued to add enhancements to the scheduling engine to get the most out of Intel HT Technology. Generally speaking, the newer the OS, the more awareness the scheduler will have of Intel HT Technology and how best to utilize the additional capabilities of Intel HT Technology. Microsoft Windows 7 makes intelligent scheduler decisions for Intel HT Technology. It has the following features designed for Intel HT Technology scheduling:
- Placement of software threads at scheduling time that is cognizant of logical processor/core relationship
- Detection of scenarios where two active threads are running on one core when another core is idle and migrates work
- Parking and un-parking of second logical processor per core to match workload and system utilization needs

Intel recommends upgrading to newer OSs that have been optimized for Intel HT Technology.

Synchronization spin loops: The hardware does not know when loops are doing useful work or when loops were created simply to wait for a resource to become available. Instead, developers should consider either using the pause instruction in their spin-loops or using OS APIs.

Case Studies

In this section, we briefly cover a few interesting case studies and describe how to interpret results as well as improve performance with Intel HT Technology where appropriate.

Case Study: Insufficient Physical Memory Degrades Performance with Intel HT Technology

In this case, a vendor running a high performance computing application observed an overall performance reduction when Intel HT Technology was enabled. After using perfmon to capture data on the system, it was found that the system began swapping pages to the paging file when Intel HT Technology was enabled. The application was designed to allocate memory on a per-thread basis. Enabling Intel HT Technology resulted in a doubling of the threads as well as a doubling in allocated memory.
The solution in this case was to install additional physical RAM so the application did not need to swap to disk. Additional solutions involve modifying the memory allocation policy in the application. It is interesting to note that doubling physical cores in this case while not increasing physical RAM would also have resulted in performance issues due to paging.
This particular example illustrates an important concept in any performance analysis, regardless of the technology being evaluated. In order to maximize performance of the CPU(s), sufficient platform resources (e.g., memory, disk I/O, network I/O) must be present to feed the processor.

Case Study: Insufficient Parallelism Resulted in Very WSmall Gain from Intel HT Technology

In this case, a vendor running a computation engine reported that enabling Intel HT Technology resulted in no performance gain on a dual socket system using Nehalem based Intel Xeon processors. The test involved running with eight threads with Intel HT Technology disabled and with 16 threads with Intel HT Technology enabled. Further analysis showed that there was very little performance gain moving from four to eight threads for either case of Intel HT Technology being enabled or disabled, and CPU utilization during the workload averaged only 55%.
The VTune analyzer was used to capture profile data and was viewed using the "threads over time" view. In this case, there were large blocks of time where the application was running a single serial thread. The parallel time was was becoming significantly smaller than the serial time, and very little additional performance was being gained by adding additional threads. This is consistent with Amdahl's law. The solution currently being implemented is to modify the application architecture to enable the now serial sections to become parallelizable. This particular scaling issue existed even in the absence of Intel HT Technology being enabled.
CPU utilization being low or constantly oscillating between low and high values can be an indicator that there are serial sections of the code, or competition for shared resources that are causing or contributing to poor scaling.

Case Study: Insufficient Parallelism (#2) Resulted in No Gain from Intel HT Technology

In serveral instances where Intel HT Technology has provided no gain, it has been found that there simply are not enough threads running in the application to to realize a benefit. The most extreme example is a single-threaded application, where no additional performance can be achieved without additional threads. The core and thread count on processors will continue to increase for the foreseeable future. Scalable, multi-threaded applications will be necessary to take advantage of the processing power of these new processors. Future systems shipping with Intel HT Technology will have thread and core counts that are not a power of two. Some applications assume core/thread counts will always be a power of two. These applications will not be able to fully utilize the processor performance available without modification to use all available processor threads.

Case Study: Interpreting Profile Data With Versus Without Hyper-Threading Technology

This case study explores how to correctly interpret profile data when Intel HT Technology is enabled and how to compare it to the Intel HT Technology-disabled case. It is contained in the separate paper, "Intel® Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload". Highlights include the following key points:

Performance improved 30% by enabling Intel HT Technology
End user response time decreased from 50ms to 37ms.
The average number of cycles to complete a transaction increased 49%.

The 30% performance gain and the 49% increase in computational cycles fits nearly exactly to the curve in figure 12.
End user response time decreased, which is not necessarily intuitive result. As referenced earlier, please refer to Intel Application Note "Hyper-Threading: Be Sure You Know How to Correctly Measure Your Server's End-User Response Time" for more information on this topic.

Recommendations

Recommendations for Software Developers and Performance Teams

Intel HT Technology provides performance benefits that vary by application as well as the other characteristics of the execution platform. Typically, Intel HT Technology will provide performance benefits. Evaluating performance with Intel HT Technology enabled and disabled will allow determination of Intel HT Technology benefit. Little or no benefit from Intel HT Technology may indicate an opportunity to improve software scaling that many times will result in performance advantages not related to Intel HT Technology. It is important to collect sufficient platform performance data to assess I/O and OS activities to determine the underlying reason for performance results that do not meet expectations.
Incorporating testing with and without Intel HT Technology into optimization activities can yield substantial benefits. Profiling and tuning with Intel® Threading Analysis Tools is a valuable means of improving threaded performance in general, as well as addressing issues that can limit the value of Intel HT Technology to an application, improving the end customer experience and perception of application value.

Software companies should use testing outcomes to make specific recommedations to their customers regarding software and hardware configurations to get the best performance from Intel HT Technology with their specific application. Those recommendations can be expanded to include guidelines for specific types of implementations, as well as specifying system requirements in areas such as system memory and platform requirements for successfully taking advantage of Intel HT Technology.

When evaluating Intel HT Technology benefit, it is important to use metrics such as elapsed time to complete a task, or amount of work per unit time and not to rely solely on CPU utilization. However, low CPU utilization may indicate other scaling problems such as an insufficient number of threads to take advantage of the processor, insufficient platform resources, competition for shared program resources, or excessive serial sections of code.

Recommendations for End Users

End users should first rely on software vendors for the best configuration data for their application and hardware. Evaluating performance with Intel HT Technology can be done following the same guidelines as for developers.

Conclusion

Intel HT Technology boosts performance for many applications, resulting in higher performance and higher efficiency. Applications that scale well with cores will typically also scale well with Intel HT Technology. The Nehalem core brings many improvements that complement Intel HT technology, allowing significant performance gains.
Core and thread counts will continue to increase, and good multi-core scaling will continue to be important into the future.

It is important when evaluating performance of applications running with Intel HT Technology to understand the differences in performance tool data and that many times, comparing data between Intel HT Technology disabled and Intel HT Technology enabled systems requires more than an intuitive understanding to accurately assess the performance implications.

Finally, Intel is committed to helping the ISV community and system users attain the best performance on Intel systems. We encourage you to visit the Intel Developer Zone Parallel Programming and Multi-Core Developer Community for any questions not addressed here on Intel HT Technology and other Intel products.

1 0