Pesquisa | Portal Regional da BVS

Fiber Clustering Acceleration With a Modified Kmeans++ Algorithm Using Data Parallelism.

Goicovich, Isaac; Olivares, Paulo; Román, Claudio; Vázquez, Andrea; Poupon, Cyril; Mangin, Jean-François; Guevara, Pamela; Hernández, Cecilia.

Front Neuroinform ; 15: 727859, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34539370

RESUMO

Fiber clustering methods are typically used in brain research to study the organization of white matter bundles from large diffusion MRI tractography datasets. These methods enable exploratory bundle inspection using visualization and other methods that require identifying brain white matter structures in individuals or a population. Some applications, such as real-time visualization and inter-subject clustering, need fast and high-quality intra-subject clustering algorithms. This work proposes a parallel algorithm using a General Purpose Graphics Processing Unit (GPGPU) for fiber clustering based on the FFClust algorithm. The proposed GPGPU implementation exploits data parallelism using both multicore and GPU fine-grained parallelism present in commodity architectures, including current laptops and desktop computers. Our approach implements all FFClust steps in parallel, improving execution times in all of them. In addition, our parallel approach includes a parallel Kmeans++ algorithm implementation and defines a new variant of Kmeans++ to reduce the impact of choosing outliers as initial centroids. The results show that our approach provides clustering quality results very similar to FFClust, and it requires an execution time of 3.5 s for processing about a million fibers, achieving a speedup of 11.5 times compared to FFClust.

GPU algorithms for density matrix methods on MOPAC: linear scaling electronic structure calculations for large molecular systems.

Maia, Julio Daniel Carvalho; Dos Anjos Formiga Cabral, Lucidio; Rocha, Gerd Bruno.

J Mol Model ; 26(11): 313, 2020 Oct 22.

Artigo em Inglês | MEDLINE | ID: mdl-33090341

RESUMO

Purification of the density matrix methods should be employed when dealing with complex chemical systems containing many atoms. The running times for these methods scale linearly with the number of atoms if we consider the sparsity from the density matrix. Since the efficiency expected from those methods is closely tied to the underlying parallel implementations of the linear algebra operations (e.g., P2 = P × P), we proposed a central processing unit (CPU) and graphics processing unit (GPU) parallel matrix-matrix multiplication in SVBR (symmetrical variable block row) format for energy calculations through the SP2 algorithm. This algorithm was inserted in MOPAC's MOZYME method, using the original LMO Fock matrix assembly, and the atomic integral calculation implemented on it. Correctness and performance tests show that the implemented SP2 is accurate and fast, as the GPU is able to achieve speedups up to 40 times for a water cluster system with 42,312 orbitals running in one NVIDIA K40 GPU card compared to the single-threaded version. The GPU-accelerated SP2 algorithm using the MOZYME LMO framework enables the calculations of semiempirical wavefunction with stricter SCF criteria for localized charged molecular systems, as well as the single-point energies of molecules with more than 100.000 LMO orbitals in less than 1 h. Graphical abstract Parallel CPU and GPU purification algorithms for electronic structure calculations were implemented in MOPAC's MOZYME method. Some matrices in these calculations, e.g., electron density P, are compressed, and the developed linear algebra operations deal with non-zero entries only. We employed the NVIDIA/CUDA platform to develop GPU algorithms, and accelerations up to 40 times for larger systems were achieved.

Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis.

Teodoro, George; Kurc, Tahsin; Andrade, Guilherme; Kong, Jun; Ferreira, Renato; Saltz, Joel.

Int J High Perform Comput Appl ; 31(1): 32-51, 2017 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-28239253

RESUMO

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations.

Region Templates: Data Representation and Management for High-Throughput Image Analysis.

Teodoro, George; Pan, Tony; Kurc, Tahsin; Kong, Jun; Cooper, Lee; Klasky, Scott; Saltz, Joel.

Parallel Comput ; 40(10): 589-610, 2014 Dec 01.

Artigo em Inglês | MEDLINE | ID: mdl-26139953

RESUMO

We introduce a region template abstraction and framework for the efficient storage, management and processing of common data types in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The region template abstraction provides a generic container template for common data structures, such as points, arrays, regions, and object sets, within a spatial and temporal bounding box. It allows for different data management strategies and I/O implementations, while providing a homogeneous, unified interface to applications for data storage and retrieval. A region template application is represented as a hierarchical dataflow in which each computing stage may be represented as another dataflow of finer-grain tasks. The execution of the application is coordinated by a runtime system that implements optimizations for hybrid machines, including performance-aware scheduling for maximizing the utilization of computing devices and techniques to reduce the impact of data transfers between CPUs and GPUs. An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging application shows that the abstraction adds negligible overhead (about 3%) and achieves good scalability and high data transfer rates. Optimizations in a high speed disk based storage implementation of the abstraction to support asynchronous data transfers and computation result in an application performance gain of about 1.13×. Finally, a processing rate of 11,730 4K×4K tiles per minute was achieved for the microscopy imaging application on a cluster with 100 nodes (300 GPUs and 1,200 CPU cores). This computation rate enables studies with very large datasets.

RESUMO

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA