Search | VHL Regional Portal

1.

Complexes++: Efficient and versatile coarse-grained simulations of protein complexes and their dense solutions.

Linke, Max; Quoika, Patrick K; Bramas, Berenger; Köfinger, Jürgen; Hummer, Gerhard.

J Chem Phys ; 157(20): 204802, 2022 Nov 28.

Article in English | MEDLINE | ID: mdl-36456243

ABSTRACT

The interior of living cells is densely filled with proteins and their complexes, which perform multitudes of biological functions. We use coarse-grained simulations to reach the system sizes and time scales needed to study protein complexes and their dense solutions and to interpret experiments. To take full advantage of coarse-graining, the models have to be efficiently implemented in simulation engines that are easy to use, modify, and extend. Here, we introduce the Complexes++ simulation software to simulate a residue-level coarse-grained model for proteins and their complexes, applying a Markov chain Monte Carlo engine to sample configurations. We designed a parallelization scheme for the energy evaluation capable of simulating both dilute and dense systems efficiently. Additionally, we designed the software toolbox pycomplexes to easily set up complex topologies of multi-protein complexes and their solutions in different thermodynamic ensembles and in replica-exchange simulations, to grow flexible polypeptide structures connecting ordered protein domains, and to automatically visualize structural ensembles. Complexes++ simulations can easily be modified and they can be used for efficient explorations of different simulation systems and settings. Thus, the Complexes++ software is well suited for the integration of experimental data and for method development.

Subject(s)

Software , Computer Simulation , Markov Chains , Monte Carlo Method , Protein Domains

2.

Automated prioritizing heuristics for parallel task graph scheduling in heterogeneous computing.

Flint, Clément; Paillat, Ludovic; Bramas, Bérenger.

PeerJ Comput Sci ; 8: e969, 2022.

Article in English | MEDLINE | ID: mdl-36262161

ABSTRACT

High-performance computing (HPC) relies increasingly on heterogeneous hardware and especially on the combination of central and graphical processing units. The task-based method has demonstrated promising potential for parallelizing applications on such computing nodes. With this approach, the scheduling strategy becomes a critical layer that describes where and when the ready-tasks should be executed among the processing units. In this study, we describe a heuristic-based approach that assigns priorities to each task type. We rely on a fitness score for each task/worker combination for generating priorities and use these for configuring the Heteroprio scheduler automatically within the StarPU runtime system. We evaluate our method's theoretical performance on emulated executions and its real-case performance on multiple different HPC applications. We show that our approach is usually equivalent or faster than expert-defined priorities.

3.

A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE).

Bramas, Bérenger.

PeerJ Comput Sci ; 7: e769, 2021.

Article in English | MEDLINE | ID: mdl-34901427

ABSTRACT

The way developers implement their algorithms and how these implementations behave on modern CPUs are governed by the design and organization of these. The vectorization units (SIMD) are among the few CPUs' parts that can and must be explicitly controlled. In the HPC community, the x86 CPUs and their vectorization instruction sets were de-facto the standard for decades. Each new release of an instruction set was usually a doubling of the vector length coupled with new operations. Each generation was pushing for adapting and improving previous implementations. The release of the ARM scalable vector extension (SVE) changed things radically for several reasons. First, we expect ARM processors to equip many supercomputers in the next years. Second, SVE's interface is different in several aspects from the x86 extensions as it provides different instructions, uses a predicate to control most operations, and has a vector size that is only known at execution time. Therefore, using SVE opens new challenges on how to adapt algorithms including the ones that are already well-optimized on x86. In this paper, we port a hybrid sort based on the well-known Quicksort and Bitonic-sort algorithms. We use a Bitonic sort to process small partitions/arrays and a vectorized partitioning implementation to divide the partitions. We explain how we use the predicates and how we manage the non-static vector size. We also explain how we efficiently implement the sorting kernels. Our approach only needs an array of O(log N) for the recursive calls in the partitioning phase, both in the sequential and in the parallel case. We test the performance of our approach on a modern ARMv8.2 (A64FX) CPU and assess the different layers of our implementation by sorting/partitioning integers, double floating-point numbers, and key/value pairs of integers. Our results show that our approach is faster than the GNU C++ sort algorithm by a speedup factor of 4 on average.

4.

Improving parallel executions by increasing task granularity in task-based runtime systems using acyclic DAG clustering.

Bramas, Bérenger; Ketterlin, Alain.

PeerJ Comput Sci ; 6: e247, 2020.

Article in English | MEDLINE | ID: mdl-33816899

ABSTRACT

The task-based approach is a parallelization paradigm in which an algorithm is transformed into a direct acyclic graph of tasks: the vertices are computational elements extracted from the original algorithm and the edges are dependencies between those. During the execution, the management of the dependencies adds an overhead that can become significant when the computational cost of the tasks is low. A possibility to reduce the makespan is to aggregate the tasks to make them heavier, while having fewer of them, with the objective of mitigating the importance of the overhead. In this paper, we study an existing clustering/partitioning strategy to speed up the parallel execution of a task-based application. We provide two additional heuristics to this algorithm and perform an in-depth study on a large graph set. In addition, we propose a new model to estimate the execution duration and use it to choose the proper granularity. We show that this strategy allows speeding up a real numerical application by a factor of 7 on a multi-core system.

5.

Increasing the degree of parallelism using speculative execution in task-based runtime systems.

Bramas, Bérenger.

PeerJ Comput Sci ; 5: e183, 2019.

Article in English | MEDLINE | ID: mdl-33816836

ABSTRACT

Task-based programming models have demonstrated their efficiency in the development of scientific applications on modern high-performance platforms. They allow delegation of the management of parallelization to the runtime system (RS), which is in charge of the data coherency, the scheduling, and the assignment of the work to the computational units. However, some applications have a limited degree of parallelism such that no matter how efficient the RS implementation, they may not scale on modern multicore CPUs. In this paper, we propose using speculation to unleash the parallelism when it is uncertain if some tasks will modify data, and we formalize a new methodology to enable speculative execution in a graph of tasks. This description is partially implemented in our new C++ RS called SPETABARU, which is capable of executing tasks in advance if some others are not certain to modify the data. We study the behavior of our approach to compute Monte Carlo and replica exchange Monte Carlo simulations.

6.

Impact study of data locality on task-based applications through the Heteroprio scheduler.

Bramas, Bérenger.

PeerJ Comput Sci ; 5: e190, 2019.

Article in English | MEDLINE | ID: mdl-33816843

ABSTRACT

The task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks' dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution.

7.

Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions.

Bramas, Bérenger; Kus, Pavel.

PeerJ Comput Sci ; 4: e151, 2018.

Article in English | MEDLINE | ID: mdl-33816805

ABSTRACT

The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL