| ||
初步研究CAPI的加速原理,理解cache 一致性,对比CAPI和一般PCIE加速设备的优势和劣势。部分总结CAPI 1.0的使用,并简单列举CAPI现状,网站以及2.0的对比。简单介绍现今三个新的开放的cpu高速一致性接口(CCIX,Gen-Z,OpenCAPI)
CAPI的原理含CAPI2.0的总线接口,流程以及仿真步骤(可以指出历史和自己的弯路)
为了满足加速accelerators,业界正在为CPU高性能一致性接口(high performance coherence interface)定义开放的标准,2016年出现了openCAPI/Gen-Z/CCIX 三种open标准,本文也会略微提及
说是初步研究,是因为缺少CAPI的软件分析,比如具体如何减少了I/O overhead,相对于IO加速的优势没有性能对比。尤其是cache coherent带来的优势没有自己的具体指标,虽然引用了Power自己的数据。
CAPI全称coherent acceleration processor interface(一致性加速处理器接口),作为 Power 处理器架构的一个重要加速功能,提供用户一个可订制、高效易用、分担CPU负荷的硬件加速的解决方案,其实现载体是FPGA。Power8的时候,CAPI 的PSL(加密的IP核)是在ALTERA的FPGA上实现,自从ALTERA为intel收购后,改为Xilinx上的IP核,PSL的资源占用情况需要自行查询,本人手上有的资料是CAPI1.0在Altera的资源使用情况。由于CAPI2.0和1.0基础原理一致,加之自己主要接触到1.0,所以本文CAPI如无特殊说明,均是1.0[dream1] 。
限于精力和资源,也没有深入研究OpenCAPI
限于能力和时间,文中定有不少错误,欢迎指出,邮箱yixiangrong@hotmail.com, 期待讨论。由于绝大部分是原创,即使拷贝也指明了出处,所以转载请表明出处
http://www.cnblogs.com/e-shannon/p/7495618.html
1) <OpenPOWER_CAPI_Education_Intro_Latest.ppt>
2) <CCIX,Gen-Z,penCAPI_Overview&Comparison.pdf>
3) <OpenPOWER and the Roadmap Ahead.pdf>
4) 网址来源
https://openpowerfoundation.org/?resource_lib=psl-afu-interface-capi-2-0
http://www.csdn.net/article/2015-06-17/2824990
http://www.openhw.org/module/forum/thread-597651-1-1.html
www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pdf
5) <POWER9-VUG..pdf>
https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf
CAPI : Coherent Accelerator Processor Interface
POWER: Performance Optimization With Enhanced RISC
HDK: Hardware development kit
SDK: Software development kit
CCIX: Cache Coherent Internconnect for Accelerators. www.ccixconsortium.com
OpenCAPI: Open Coherent Accelerator Processor Interfae opencapi.org
Gen-Z: genzconsortium.org
LRU: least recent used
HPC: High Performace Computing
DMI: Durable Memory interface (OpenPOWER and the Roadmap Ahead.pdf)
QPI: The Intel QuickPath Interconnect (QPI) is a point-to-point processor interconnect developed by Intel which replaced the front-side bus (FSB) in Xeon, Itanium, and certain desktop platforms starting in 2008.(wiki),与AMD的HyperTransport(HT)竞争
https://jingyan.baidu.com/article/6525d4b11f2c2bac7d2e943e.html
SMP: Symmetric Multi-Processor,一种UMA结构,多核CPU共享所有资源,SMP在POWER架构中采用[dream2]
NUMA: Non-Uniform. Memory Access与SMP结构对比,多CPU分成几组,本地的内存访问速度快于远端的内存访问,所以是Non-Uniform. The trend in hardware has been towards more than one system bus, each serving a small set of processors. Each group of processors has its own memory and possibly its own I/O channels. However, each CPU can access memory associated with the other groups in a coherent way. Each group is called a NUMA node. The number of CPUs within a NUMA node depends on the hardware vendor. It is faster to access local memory than the memory associated with other NUMA nodes. This is the reason for the name, non-uniform. memory access architecture.
http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html
https://technet.microsoft.com/en-us/library/ms178144(v=sql.105).aspx
MPP: Massive Parallel Processing多组SMP CPU组,组和组之间内存不能访问,通过网络节点互联,可以无限扩展[dream3]
NUMA与MPP的区别
http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html
从架构来看,NUMA与MPP具有许多相似之处:它们都由多个节点组成,每个节点都具有自己的CPU、内存、I/O,节点之间都可以通过节点互联机制进行信息交互。那么它们的区别在哪里?通过分析下面NUMA和MPP服务器的内部架构和工作原理不难发现其差异所在。
首先是节点互联机制不同,NUMA的节点互联机制是在同一个物理服务器内部实现的,当某个CPU需要进行远地内存访问时,它必须等待,这也是NUMA服务器无法实现CPU增加时性能线性扩展的主要原因。而MPP的节点互联机制是在不同的SMP服务器外部通过I/O 实现的,每个节点只访问本地内存和存储,节点之间的信息交互与节点本身的处理是并行进行的。因此MPP在增加节点时性能基本上可以实现线性扩展。
其次是内存访问机制不同。在NUMA服务器内部,任何一个CPU可以访问整个系统的内存,但远地访问的性能远远低于本地内存访问,因此在开发应用程序时应该尽量避免远地内存访问。在MPP服务器中,每个节点只访问本地内存,不存在远地内存访问的问题。
NUCA: 非对称缓存架构 ,Power9 L3
ISA: instruction set architechture
CAIA : Coherent Accelerator Interface Architecture defines a coherent accelerator interface structure for coherently attaching accelerators to the POWER systems using a standard PCIe bus. The intent is to allow implementation of a wide range of accelerator in order to optimally address many different market segments.
CAPP : Coherent Accelerator Processor Proxy
Design unit that snoops the PowerBus commands and provides coherency responses reflecting the state of the caches in PSL. Issues commands to PSL so that it can provide data responses.
PSL : Power Service Layer
The PSL provides the address translation and system memory cache for the AFUs. In addition, the PSL provides miscellaneous facilities for the host processor to manage the virtualization of the AFUs, interrupts, and memory management.
AFU : Accelerator Function Unit
Effective
Address(EA)/Real Address(RA)….power ISA book III
AFU使用有效地址即CPU的地址空间(业界也称为虚拟地址),PSL则将有效地址翻译为实际地址(业界也称为物理地址)The AFU uses
Effective Addressing, which is the process’s address space (industry calls this
“virtual”). The PSL translates the
Effective Address into a Real Address (industry calls this “physical”) for accessing memory within the
PowerPC system.
MMIO: Memory-mapped input/output.
WED: work element discriptor工作单元描述符。当应用程序申请使用AFU时,一个处理单元被加入到处理单元链上,这个处理单元链描述了整个应用的处理状态。处理单元同时含有一个WED,工作单元描述符,这个WED可以是描述job也可以是一个指针,指向更丰富的描述,来告知AFU的工作内容。When an application requests use of an AFU, a process element is added to the process-element linked list that describes the application’s process state. The process element also contains a work element descriptor (WED) provided by the application. The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application’s memory space. Several programming models are described providing for an AFU to be used by any application or for an AFU to be dedicated to a single application.