Pesquisa | Portal Regional da BVS

1.

Chi, Yuze; Guo, Licheng; Cong, Jason.

FPGA ; 2022: 190-200, 2022 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-35300320

RESUMO

The single-source shortest path (SSSP) problem is one of the most important and well-studied graph problems widely used in many application domains, such as road navigation, neural image reconstruction, and social network analysis. Although we have known various SSSP algorithms for decades, implementing one for large-scale power-law graphs efficiently is still highly challenging today, because â a work-efficient SSSP algorithm requires priority-order traversal of graph data, â¡ the priority queue needs to be scalable both in throughput and capacity, and â¢ priority-order traversal requires extensive random memory accesses on graph data. In this paper, we present SPLAG to accelerate SSSP for power-law graphs on FPGAs. SPLAG uses a coarse-grained priority queue (CGPQ) to enable high-throughput priority-order graph traversal with a large frontier. To mitigate the high-volume random accesses, SPLAG employs a customized vertex cache (CVC) to reduce off-chip memory access and improve the throughput to read and update vertex data. Experimental results on various synthetic and real-world datasets show up to a 4.9× speedup over state-of-the-art SSSP accelerators, a 2.6× speedup over 32-thread CPU running at 4.4 GHz, and a 0.9× speedup over an A100 GPU that has 4.1× power budget and 3.4× HBM bandwidth. Such a high performance would place SPLAG in the 14th position of the Graph 500 benchmark for data intensive applications (the highest using a single FPGA) with only a 45 W power budget. SPLAG is written in high-level synthesis C++ and is fully parameterized, which means it can be easily ported to various different FPGAs with different configurations. SPLAG is open-source at https://github.com/UCLA-VAST/splag.

2.

Extending High-Level Synthesis for Task-Parallel Programs.

Chi, Yuze; Guo, Licheng; Lau, Jason; Choi, Young-Kyu; Wang, Jie; Cong, Jason.

Proc Annu IEEE Symp Field Program Cust Comput Mach ; 20212021 May.

Artigo em Inglês | MEDLINE | ID: mdl-34497978

RESUMO

C/C++/OpenCL-based high-level synthesis (HLS) becomes more and more popular for field-programmable gate array (FPGA) accelerators in many application domains in recent years, thanks to its competitive quality of results (QoR) and short development cycles compared with the traditional register-transfer level design approach. Yet, limited by the sequential C semantics, it remains challenging to adopt the same highly productive high-level programming approach in many other application domains, where coarse-grained tasks run in parallel and communicate with each other at a fine-grained level. While current HLS tools do support task-parallel programs, the productivity is greatly limited â in the code development cycle due to the poor programmability, â¡ in the correctness verification cycle due to restricted software simulation, and â¢ in the QoR tuning cycle due to slow code generation. Such limited productivity often defeats the purpose of HLS and hinder programmers from adopting HLS for task-parallel FPGA accelerators. In this paper, we extend the HLS C++ language and present a fully automated framework with programmer-friendly interfaces, unconstrained software simulation, and fast hierarchical code generation to overcome these limitations and demonstrate how task-parallel programs can be productively supported in HLS. Experimental results based on a wide range of real-world task-parallel programs show that, on average, the lines of kernel and host code are reduced by 22% and 51%, respectively, which considerably improves the programmability. The correctness verification and the iterative QoR tuning cycles are both greatly shortened by 3.2× and 6.8×, respectively. Our work is open-source at https://github.com/UCLA-VAST/tapa/.

3.

HBM Connect: High-Performance HLS Interconnect for FPGA HBM.

Choi, Young-Kyu; Chi, Yuze; Qiao, Weikang; Samardzic, Nikola; Cong, Jason.

FPGA ; 2021: 116-126, 2021 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-33817702

RESUMO

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.

4.

AutoBridge: Coupling Coarse-Grained Floorplanning and Pipelining for High-Frequency HLS Design on Multi-Die FPGAs.

Guo, Licheng; Chi, Yuze; Wang, Jie; Lau, Jason; Qiao, Weikang; Ustun, Ecenur; Zhang, Zhiru; Cong, Jason.

FPGA ; 2021: 81-92, 2021 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-33851145

RESUMO

Despite an increasing adoption of high-level synthesis (HLS) for its design productivity advantages, there remains a significant gap in the achievable frequency between an HLS design and a handcrafted RTL one. A key factor that limits the timing quality of the HLS outputs is the difficulty in accurately estimating the interconnect delay at the HLS level. This problem becomes even worse when large HLS designs are implemented on the latest multi-die FPGAs. To tackle this challenge, we propose AutoBridge, an automated framework that couples a coarse-grained floorplanning step with pipelining during HLS compilation. First, our approach provides HLS with a view on the global physical layout of the design, allowing HLS to more easily identify and pipeline the long wires, especially those crossing the die boundaries. Second, by exploiting the flexibility of HLS pipelining, the floorplanner is able to distribute the design logic across multiple dies on the FPGA device without degrading clock frequency. This prevents the placer from aggressively packing the logic on a single die which often results in local routing congestion that eventually degrades timing. Since pipelining may introduce additional latency, we further present analysis and algorithms to ensure the added latency will not compromise the overall throughput. AutoBridge can be integrated into the existing CAD toolflow for Xilinx FPGAs. In our experiments with a total of 43 design configurations, we improve the average frequency from 147 MHz to 297 MHz (a 102% improvement) with no loss of throughput and a negligible change in resource utilization. Notably, in 16 experiments we make the originally unroutable designs achieve 274 MHz on average. The tool is available at https://github.com/Licheng-Guo/AutoBridge.

5.

Exploiting Computation Reuse for Stencil Accelerators.

Chi, Yuze; Cong, Jason.

Proc Des Autom Conf ; 20202020 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-33796879

RESUMO

Stencil kernel is an important type of kernel used extensively in many application domains. Over the years, researchers have been studying the optimizations on parallelization, communication reuse, and computation reuse for various target platforms. However, challenges still exist, especially on the computation reuse problem for accelerators, due to the lack of complete design-space exploration and effective design-space pruning. In this paper, we present solutions to the above challenges for a wide range of stencil kernels (i.e., stencil with reduction operations), where the computation reuse patterns are extremely flexible due to the commutative and associative properties. We formally define the complete design space, based on which we present a provably optimal dynamic programming algorithm and a heuristic beam search algorithm that provides near-optimal solutions under an architecture-aware model. Experimental results show that for synthesizing stencil kernels to FPGAs, compared with state-of-the-art stencil compiler without computation reuse capability, our proposed algorithm can reduce the look-up table (LUT) and digital signal processor (DSP) usage by 58.1% and 54.6% on average respectively, which leads to an average speedup of 2.3× for compute-intensive kernels, outperforming the latest CPU/GPU results.

6.

Test-retest reliability of graph metrics in high-resolution functional connectomics: a resting-state functional MRI study.

Du, Hai-Xiao; Liao, Xu-Hong; Lin, Qi-Xiang; Li, Gu-Shu; Chi, Yu-Ze; Liu, Xiang; Yang, Hua-Zhong; Wang, Yu; Xia, Ming-Rui.

CNS Neurosci Ther ; 21(10): 802-16, 2015 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-26212146

RESUMO

BACKGROUND: The combination of resting-state functional MRI (R-fMRI) technique and graph theoretical approaches has emerged as a promising tool for characterizing the topological organization of brain networks, that is, functional connectomics. In particular, the construction and analysis of high-resolution brain connectomics at a voxel scale are important because they do not require prior regional parcellations and provide finer spatial information about brain connectivity. However, the test-retest reliability of voxel-based functional connectomics remains largely unclear. AIMS: This study tended to investigate both short-term (â¼20 min apart) and long-term (6 weeks apart) test-retest (TRT) reliability of graph metrics of voxel-based brain networks. METHODS: Based on graph theoretical approaches, we analyzed R-fMRI data from 53 young healthy adults who completed two scanning sessions (session 1 included two scans 20 min apart; session 2 included one scan that was performed after an interval of â¼6 weeks). RESULTS: The high-resolution networks exhibited prominent small-world and modular properties and included functional hubs mainly located at the default-mode, salience, and executive control systems. Further analysis revealed that test-retest reliabilities of network metrics were sensitive to the scanning orders and intervals, with fair to excellent long-term reliability between Scan 1 and Scan 3 and lower reliability involving Scan 2. In the long-term case (Scan 1 and Scan 3), most network metrics were generally test-retest reliable, with the highest reliability in global metrics in the clustering coefficient and in the nodal metrics in nodal degree and efficiency. CONCLUSION: We showed high test-retest reliability for graph properties in the high-resolution functional connectomics, which provides important guidance for choosing reliable network metrics and analysis strategies in future studies.

Assuntos

Encéfalo/fisiologia , Conectoma/métodos , Imageamento por Ressonância Magnética/métodos , Adulto , Feminino , Movimentos da Cabeça , Humanos , Masculino , Vias Neurais/fisiologia , Reprodutibilidade dos Testes , Descanso , Fatores de Tempo , Adulto Jovem

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA