SPARC DVMA and x86 S-G DMA

来源：互联网发布：行业域名的重要性编辑：程序博客网时间：2024/06/05 12:10

I think the SPARC-only feature you're trying to describe is called DVMA. The similar feature on x86 is usually called Scatter-Gather DMA. DVMA might actually be tied into the ncrs performance problem but I suspect it's not the root cause. The fact that SPARC systems have DVMA and x86 systems have S-G DMA doesn't inherently penalize x86 systems and isn't really a problem if dealt with correctly. But if dealt with incorrectly the lack of DVMA on x86 systems can cause performance problems. In other words, I think the root cause of the ncrs performance problems may be that someone at Sun simply "broke" the ncrs driver while putting back in the x86 Scatter-Gather DMA to replace the SPARC DVMA code (and has nothing to do with whether or not DVMA is more efficient than S-G DMA). Assuming Bruce's tests are reasonable and accurate (which I have no reason to doubt) I have a theory about the ncrs performance problem. Unfortunately I don't have any access to the currently shipping ncrs driver sources so I can only speculate based on the fact that the results Bruce describes match almost exactly the results I predicted two years ago, that Sun did exactly what I warned them not to do. I suspect that the version of the ncrs driver that Sun is currently shipping has some bogus DMA breakup code and that the DMA breakup code in the current ncrs driver is fundamentally different from the DMA breakup code in original version of the ncrs driver I wrote six years ago. In other words, the last ncrs driver update introduced some bad code. The reason I not certain of this and only suspect this is that at the point I last worked for Sun, they were still in the middle of an extremely lengthy ncrs driver update project (over 18 months for a single update). At that point in time the ncrs driver was still being updated but did in fact have some bogus DMA breakup code in it. But current version of the ncrs driver wasn't finished until long after I stopped working at Sun, so I've no idea exactly what version of the driver they decided to finally ship. If some Sun engineers wants to look at the ncrs and tell me I'm completely wrong, and the current ncrs driver's DMA breakup code is the same as the old ncrs driver's DMA, then please forget I said anything. First a bit of background. There are three different versions of the ncrs driver. There's the original version I wrote that only supports the '810, '815, '820, and '825 chips (and didn't support wide mode or tagged queuing). There's the SPARC glm version of the driver (which supports wide mode, tagged queuing, whatever 8xx chips are used on SPARC platforms, and a bunch of other features and bug fixes which most x86 users never missed), and then there's the new and improved x86 ncrs version which is a merger of the old ncrs and the SPARC glm versions. It's the "new and improved" ncrs driver which has the suspected performance problems. One of the fundamental differences between the x86 and SPARC platforms is that SPARC platforms have DVMA hardware and x86 platforms do not. On a SPARC platform, a PCI device accesses system memory through a virtual memory map which presents to the device an I/O buffer (no matter how large the buffer) as a single linear address space. On x86 platforms, PCI devices see system memory as discontiguous physical memory pages. Therefore, x86 devices and device drivers need to support Scatter-Gather DMA operations with each discontiguous range of pages in a I/O request requiring at least one separate DMA cookie. In fact there may be a cookie per page so that a 1MB request could potentially require a 257 entry scatter-gather list if the driver attempts to process such a request in one swell foop. When I wrote the original ncrs driver I decided that statically allocating a huge Scatter-Gather list buffer for every single I/O request, on the chance that any particular request might be very large, would be too inefficient. I also decided that dynamically adjusting the size of the S-G list over a very large range would needlessly complicate the device driver. What I decided to do is to use the Solaris Partial DMA mechanism and limit the size of the S-G list to 17 entries. If a SCSI HBA driver implements Partial DMA, then any target driver which attempts to initialize a request packet that exceeds the driver's maximum DMA capability is forced to either break up the request or reject the request. All of Sun's x86 SCSI HBA drivers have in fact always included Partial DMA support because they all have been modeled after drivers which were originally written for ISA-SCSI controllers. On a ISA-SCSI controller the Partial DMA is used in conjunction with the DMA Windowing interfaces to deal with the 16MB boundary and the bounce buffers. Therefore x86 HBA drivers that implement Partial DMA can appear to handle very large I/O requests by cooperating with the target driver to internally and transparently breakup a single user I/O request into multiple smaller I/O requests. The user process is never aware that a large I/O request was actually handled as multiple partial sub-requests. It's all taken care in the interface between the target driver and the HBA driver and the DDI-DMA functions. It turns out the exact same Partial DMA and DMA Windowing mechanisms are also necessary to deal with the 36-bit PAE feature which Intel introduced on PentiumPro CPUs. A DDI compliant SCSI HBA driver is expected to work on both 32-bit x86 CPUs and 36-bit PAE CPUs. I was indirectly aware of PAE at that time but what concerned me even more was the fact that 64-bit x86 CPUs would show up eventually and I didn't want to worry about re-writing the driver years in the future in order to make the 32-bit '8xx chips continue to work on 64-bit x86 systems. In other words, even though my driver didn't have to worry about ISA bus limitations, I still had to include Partial DMA support to support coming 36-bit PAE and 64-bit x86 CPUs. Given that I had to include Partial DMA support anyway, I decided that with a small bit of additional code I could also use the exact same Partial DMA support to devise a space and time efficient Scatter-Gather mechanism for the x86 ncrs driver. So I in fact solved three x86-specific requirements with a single mechanism. Of course, when the SMCC engineers ported my ncrs driver to the SPARC platform, and renamed it glm, one of the first things they ripped out (even though they could have simply ignored it) was the multiple entry Scatter-Gather list support. On a SPARC platform all I/O requests, no matter how large, never require more than a single DMA cookie. The DVMA hardware on SPARC platforms, re-maps the physically discontiguous system memory pages, that make up any I/O request, into a single virtual address range. In other words, the DVMA hardware does for I/O devices the same thing that the MMU does for the CPU. As far as the SPARC people were concerned the ncrs driver didn't need the 16 extra S-G list entries and didn't need Partial DMA for 36-bit PAE or 64-bit CPUs. Now as you're all aware, some time ago (about 1.5 years ago) Sun finally released an updated ncrs driver which merged together all the latest and greatest features from the SPARC glm driver and the x86-specific requirements from the original ncrs driver. Unfortunately, the Sun manager in charge of that project hired a contractor who had never written a Solaris device driver. That contractor had his own ideas about what a '8xx device driver should look like and repeatedly tried to convince anyone that would listen that Sun should pay him to port the BSD/Linux '8xx device driver rather than simply merge back the small amount of x86 specific code which had been excised >from the SPARC glm driver. Needless to say, his first few attempts at merging the old ncrs driver and the SPARC glm driver didn't go well. That contractor (who shall remain nameless) chose to completely ignore the advice of several much more experienced Solaris device driver writers (I know of at least three engineers that at one time or another tried to make management aware of the fact that even though the work was unfinished that it was obvious that the merged driver wasn't going to work correctly). One of the major mistakes the contractor made was to completely ignore the fact that all x86 SCSI HBA driver must support Partial DMA and DDI-DMA Windowing (because of 36-bit PAE and 64-bit CPUs with 32-bit devices). The contractor did recognize that x86 systems don't have DVMA hardware and that he still needed some mechanism to deal with very large DMA requests. He decided (I think based on some BSD driver he'd previously worked on) that the way to deal with the limited size of the Scatter-Gather list was to (not implement Partial DMA but rather) but to implement a completely new large DMA mechanism. His DMA mechanism would accept any size I/O request (no matter how large) by changing the lowest level interrupt handling code to emulate an infinitely long S-G list using just the fixed length S-G list. In other words, the actual S-G list accessed by the firmware has only 17 entries in it. When the '8xx firmware finishes the data transfer for 17th entry in the S-G list, it PAUSES THE WHOLE SCSI BUS (by withholding the last ACK cycle) while it waits for the device driver's interrupt handler to generate the next 17 S-G list entries and fix up all the S-G list pointers. I believe he used the expression, "continuous DMA" (or maybe it was, "infinite DMA") to describe this mechanism. Of course, normally the time interval between byte ACKs is 100nsec or less. My best guess is that his "continuous DMA" Scatter-Gather list fix up stuff would take multiple hundreds of microseconds to complete (perhaps even as long as 1 msec). A worst case 1MB transfer would require 16 such fix-up interrupts (one every 64KB). So that on a 80MB/sec bus the data transfer would take 12.5 msec and the 16 fix-up interrupts might add something on the order of an additional 4 to 16 msecs (depending on whether you think an x86 CPU can handle 1000 or 4000 PCI-SCSI interrupts per second). Clearly, the "continuous DMA" mechanism is fundamentally different from how the old ncrs driver worked. The old driver would break up any request that exceeded 17 DMA cookies into multiple requests. Multiple requests of course require additional command and status phases but they do *not* hang the whole bus between requests for hundreds of microseconds at a time. Also, any reasonable SCSI disk drive implements tagged queuing so in fact the multiple "sub-requests" can potentially be handed to the drive all at once before any data transfer starts and overlap the seek time of a prior request during which the SCSI bus might be idle anyway. And the bus and device continue to operate in parallel while the device driver handles the "normal" I/O completion interrupts (rather than the bus-hanging fix-up interrupts). What this all means is that "continuous DMA" turns a large I/O request into multiple bursts of data transfers with the long periods between the bursts during which every device on the bus is unable to initiate any other bus or device activity. Whereas the Partial DMA mechanism, splits up a large I/O request into multiple smaller requests and allows very efficient parallelzation of all the phases of the multiple sub-requests using the well-proven tagged queuing and disconnect-reconnect SCSI features. The Partial DMA approach is analogous to a CPU with a multi-stage pipeline and multiple dispatch units, and the "continuous DMA" approach is analogous to a monolithic CPU which is constrained by its slowest functional element. Of course, all of the above discourse about how the merged ncrs driver is implemented is total speculation. My analysis could be completely wrong because at the time I last reviewed the merged ncrs driver the "continuous DMA" code was incomplete and couldn't be tested or benchmarked. My recollection is that long before Sun finalized the ncrs driver, I warned the Sun managers that the merged ncrs driver appeared wrong to me and would probably perform worse than the older ncrs. I suggested that since adding the Partial DMA support was unavoidable (because of 36-bit PAE, and 64-bit) that they should discard *all* the "continuous DMA" changes and stick with the proven Partial DMA design. Perhaps they listened to my advice and removed the "continuous DMA" code. In which case the ncrs performance problem would be due to some other problem. Perhaps even the old ncrs driver has the same performance problem. I strongly suspect (but have no way to verify) that what Sun ended up doing was including both Partial DMA and "continuous DMA" and that Partial DMA is only activated on 36-bit or 64-bit systems when bounce buffers are needed and that in normal operation large I/O requests are subjected to "continuous DMA" and the Scatter-Gather list fix-up interrupt overhead.