At the SC21 supercomputing conference this week, Xilinx introduced its Alveo U55C data center accelerator card and a new standards-based, API-driven clustering solution for deploying FPGAs at massive scale. The company said by enabling clustering of hundreds of Alveo cards and enabling high-level programmability of both the application and the cluster, this new card makes scaling out Alveo compute capabilities to target high performance computing (HPC) workloads easier and more efficient than before.
Xilinx said the Alveo U55C card is purpose-built for HPC and big data workloads, offering the highest compute density and HBM (high bandwidth memory) capacity in the Alveo accelerator portfolio. Together with the new Xilinx RoCE v2-based clustering solution, a broad spectrum of customers with large-scale compute workloads can now implement powerful FPGA-based HPC clustering using their existing data center infrastructure and network. Architecturally, the FPGA-based accelerator claims to provide the highest performance at the lowest cost for many compute-intensive workloads. It is introducing a standards-based methodology that enables the creation of Alveo HPC clusters using a customer’s existing infrastructure and network.
The company said this is a major leap forward for broader adoption of Alveo and adaptive computing throughout the data center.
In an interview with embedded.com, Nathan Chang, HPC product manager for data centers at Xilinx, said, “We’re starting to see that compute isn’t always the bottleneck. Actually, more often than not it tends to be the memory bandwidth. More and more compute problems are becoming memory bandwidth bound. So, we slimmed down our card to a single slot, and also doubled the HBM on that card. But more importantly, we provided the ability to scale out across these cards, with the ability to create large clusters with hundreds of cards and target all the HBM on those cards.”
He continued, “Unlocking the bandwidth across clusters of Alveo cards has always been a big endeavour for our community. Developers had to create teams and then create their own clustering designs to meet their needs. Now we are coming forward with an open standards based clustering package – meaning we will be leveraging RoCE v2, and data center bridging, all over Ethernet with 200 Gbps bandwidth in each card.”
“This means that in existing infrastructure in data centers, you’ll be able to put these cards in existing servers, be able to leverage them on existing ethernet networks, and compete with InfiniBand on performance and latency.”
“Another key point is that not only are we creating room for bigger workloads, but we are also ensuring Vitis is more accessible to the development community. No longer do you need to understand RTL or Verilog. You are able to program Alveo cards and target Alveo boards with existing high-level languages like C, C++ and Python.”
Alveo U55C features for HPC and big data
The Alveo U55C card combines many key features that today’s HPC workloads require. It delivers more parallelism of data pipelines, superior memory management, optimized data movement throughout the pipeline, and the highest performance-per-watt in the Alveo portfolio, according to Xilinx. The card is a single-slot full height, half length (FHHL) form factor with a low 150W max power. It offers superior compute density and doubles the HBM2 to 16GB compared to its predecessor, the dual-slot Alveo U280 card. Hence the new U55C provides more compute in a smaller form factor for creating dense Alveo accelerator-based clusters. This is targeting high-density streaming data, high IO math, and big compute problems that require scale out like big data analytics and AI applications.
Leveraging RoCE v2 and data center bridging, coupled with 200 Gbps bandwidth, the API-driven clustering solution enables an Alveo network that competes with InfiniBand networks in performance and latency, with no vendor lock-in. MPI integration allows for HPC developers to scale out Alveo data pipelining from the Xilinx Vitis unified software platform. Utilizing existing open standards and frameworks, the company said it is now possible to scale out across hundreds of Alveo cards regardless of the server platforms and network infrastructure and with shared workloads and memory.
Software developers and data scientists can obtain the benefits of Alveo and adaptive computing through high-level programmability of both the application and cluster utilizing the Vitis platform. Xilinx said it has invested heavily in the Vitis development platform and tools flow to make adaptive computing more accessible to software developers and data scientists without hardware expertise. The major AI frameworks like Pytorch and Tensorflow are supported, as well as high-level programming languages like C, C++ and Python, allowing developers to build domain solutions using specific APIs and libraries, or utilize Xilinx software development kits, to easily accelerate key HPC workloads within an existing data center.
Who’s using the cards?
Chang said the company has been working with several organizations on proof-of-concept designs using the U55C cards.
One of them is CSIRO, Australia’s national research organization along with the world’s largest radio astronomy antenna array, who used the U55C rather than GPUs, because the Alveo card enables single slot card and doesn’t require a NIC (network interface card). CSIRO is utilizing Alveo U55C cards for signal processing in the square kilometer array radio telescope. Deploying the Alveo cards as network-attached accelerators with HBM allows for massive throughput at scale across the HPC signal processing cluster. The Alveo accelerator-based cluster allows CSIRO to tackle the massive compute task of aggregating, filtering, preparing and processing data from 131,000 antennas in real time. The 460Gbps of HBM2 bandwidth across the signal processing cluster is served by 420 Alveo U55C cards fully networked together across P4-enabled 100Gbps switches. The Alveo U55C cluster delivers processing performance with overall throughput at 15Tb/s in a compact power and cost-efficient footprint. CSIRO is now completing an example Alveo reference design in order to help other radio astronomy or adjacent industries achieve the same success.
Another use case example is with Ansys LS-DYNA crash simulation software, which is used by nearly every automotive company in the world. The design of safety and structural systems hinges on the performance of models as they mitigate the costs of physical crash testing with computer-aided design finite element method (FEM) simulations. FEM solvers are the primary algorithms driving simulations with hundreds of millions of degrees of freedom, these enormous algorithms can be broken out into more rudimentary solvers like PCG, Sparse matrices, ICCG. By scaling out across many Alveo cards with hyper parallel data pipelining, LS-DYNA can accelerate performance by more than 5X in comparison to x86 CPUs. This results in more work per clock cycle in an Alveo pipeline with LS-DYNA customers benefiting from game changing simulation times. “In the spirit of relentless innovation, we’re excited about collaborating with Xilinx to significantly accelerate the finite-element solvers, which can represent 90% of the compute workload for implicit mechanics, in our LS-DYNA simulation application,” said Wim Slagter, strategic partnerships director at Ansys. “We look forward to Xilinx acceleration helping us in our mission to support innovators in engineering what’s ahead.”
Xilinx cited a third example, that of TigerGraph, a provider of a leading graph analytics platform. The company is using multiple Alveo U55C cards to cluster and accelerate the two most prolific algorithms that drive graph-based recommendation and clustering engines. Graph databases are a disruptive platform for data scientists. Graphs take data from silos and bring focus to the relationships between data. The next frontier for graph is finding those answers in real time. Alveo U55C accelerates the query times and predictions for recommendation engines from minutes down to milliseconds. By utilizing multiple U55C cards to scale up analytics, the superior computational power and memory bandwidth accelerates graph query speeds up to 45X faster compared to CPU-based clusters. The quality of scores also increases by up to 35 percent, resulting in greater confidence dramatically lowering false positives to low single digits.
The Alveo U55C card is currently available on Xilinx’ web site and through Xilinx authorized distributors. It is also available for evaluation via public cloud-based FPGA-as-a-service providers, as well as select colocation data centers for private previews. Clustering is available now for private previews, with general availability expected in the second quarter of next year.
- Xilinx targets data center offload with ‘composable’ hardware
- Startup packs 1000 RISC-V cores into AI accelerator chip
- CXL 2.0 /PCIe 5.0 solutions unlock heterogeneous compute & data bottlenecks
- FPGA accelerator for embedded vision MIPI cameras