Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
Add more filters










Publication year range
1.
Data Brief ; 54: 110458, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38711739

ABSTRACT

This paper presents a dataset comprising 700 video sequences encoded in the two most popular video formats (codecs) of today, H.264 and H.265 (HEVC). Six reference sequences were encoded under different quality profiles, including several bitrates and resolutions, and were affected by various packet loss rates. Subsequently, the image quality of encoded video sequences was assessed by subjective, as well as objective, evaluation. Therefore, the enclosed spreadsheet contains results of both assessment approaches in a form of MOS (Mean Opinion Score) delivered by the absolute category ranking (ACR) procedure, SSIM (Structural Similarity Index Measure) and VMAF (Video Multimethod Assessment Fusion). All assessments are available for each test sequence. This allows a comprehensive evaluation of coding efficiency under different test scenarios without the necessity of real observers or a secure laboratory environment, as recommended by the ITU (International Telecommunication Union). As there is currently no standardized mapping function between the results of subjective and objective methods, this dataset can also be used to design and verify experimental machine learning algorithms that contribute to solving the relevant research issues.

2.
Int J Biol Macromol ; 266(Pt 2): 130984, 2024 May.
Article in English | MEDLINE | ID: mdl-38513910

ABSTRACT

Genome sequence analysis and classification play critical roles in properly understanding an organism's main characteristics, functionalities, and changing (evolving) nature. However, the rapid expansion of genomic data makes genome sequence analysis and classification a challenging task due to the high computational requirements, proper management, and understanding of genomic data. Recently proposed models yielded promising results for the task of genome sequence classification. Nevertheless, these models often ignore the sequential nature of nucleotides, which is crucial for revealing their underlying structure and function. To address this limitation, we present SPM4GAC, a sequential pattern mining (SPM)-based framework to analyze and classify the macromolecule genome sequences of viruses. First, a large dataset containing the genome sequences of various RNA viruses is developed and transformed into a suitable format. On the transformed dataset, algorithms for SPM are used to identify frequent sequential patterns of nucleotide bases. The obtained frequent sequential patterns of bases are then used as features to classify different viruses. Ten classifiers are employed, and their performance is assessed by using several evaluation measures. Finally, a performance comparison of SPM4GAC with state-of-the-art methods for genome sequence classification/detection reveals that SPM4GAC performs better than those methods.


Subject(s)
Algorithms , Genome, Viral , Genomics/methods , Computational Biology/methods , Macromolecular Substances/chemistry , Data Mining , RNA Viruses/genetics , RNA Viruses/classification
4.
Comput Biol Med ; 158: 106814, 2023 05.
Article in English | MEDLINE | ID: mdl-36989742

ABSTRACT

This paper presents a novel framework, called PSAC-PDB, for analyzing and classifying protein structures from the Protein Data Bank (PDB). PSAC-PDB first finds, analyze and identifies protein structures in PDB that are similar to a protein structure of interest using a protein structure comparison tool. Second, the amino acids (AA) sequences of identified protein structures (obtained from PDB), their aligned amino acids (AAA) and aligned secondary structure elements (ASSE) (obtained by structural alignment), and frequent AA (FAA) patterns (discovered by sequential pattern mining), are used for the reliable detection/classification of protein structures. Eleven classifiers are used and their performance is compared using six evaluation metrics. Results show that three classifiers perform well on overall, and that FAA patterns can be used to efficiently classify protein structures in place of providing the whole AA sequences, AAA or ASSE. Furthermore, better classification results are obtained using AAA of protein structures rather than AA sequences. PSAC-PDB also performed better than state-of-the-art approaches for SARS-CoV-2 genome sequences classification.


Subject(s)
COVID-19 , Humans , SARS-CoV-2 , Protein Structure, Secondary , Amino Acids , Databases, Protein , Protein Conformation
5.
Knowl Inf Syst ; 65(5): 2017-2042, 2023.
Article in English | MEDLINE | ID: mdl-36683607

ABSTRACT

An obvious defect of extreme learning machine (ELM) is that its prediction performance is sensitive to the random initialization of input-layer weights and hidden-layer biases. To make ELM insensitive to random initialization, GPRELM adopts the simple an effective strategy of integrating Gaussian process regression into ELM. However, there is a serious overfitting problem in kernel-based GPRELM (kGPRELM). In this paper, we investigate the theoretical reasons for the overfitting of kGPRELM and further propose a correlation-based GPRELM (cGPRELM), which uses a correlation coefficient to measure the similarity between two different hidden-layer output vectors. cGPRELM reduces the likelihood that the covariance matrix becomes an identity matrix when the number of hidden-layer nodes is increased, effectively controlling overfitting. Furthermore, cGPRELM works well for improper initialization intervals where ELM and kGPRELM fail to provide good predictions. The experimental results on real classification and regression data sets demonstrate the feasibility and superiority of cGPRELM, as it not only achieves better generalization performance but also has a lower computational complexity.

6.
ISA Trans ; 131: 460-475, 2022 Dec.
Article in English | MEDLINE | ID: mdl-35636986

ABSTRACT

High occupancy pattern mining has been recently studied as an improved method for frequent pattern mining. It considers the proportion of each pattern in the transactions where the pattern occurred. The results of high occupancy pattern mining can be employed for automated control systems in order to make decisions. Meanwhile, the features of the databases have changed, because information technology has advanced. In real-world databases, new transactions are inserted in real time. However, the state-of-the-art approach to high occupancy pattern mining cannot handle incremental databases. Moreover, the existing method also requires a large amount of memory space, because it adopted a BFS-based search in order to find patterns. In this paper, we propose an approach, which is called HOMI (High Occupancy pattern Mining on Incremental databases), that uses a DFS-based search in order to detect patterns, and it mines high occupancy patterns on incremental databases. The performance analysis for both real and synthetic datasets indicates that HOMI has better performance than the state-of-the-art approaches and related algorithms.


Subject(s)
Data Mining , Pattern Recognition, Automated , Pattern Recognition, Automated/methods , Data Mining/methods , Algorithms , Databases, Factual
7.
Appl Intell (Dordr) ; 52(14): 16458-16474, 2022.
Article in English | MEDLINE | ID: mdl-35340983

ABSTRACT

Online learning is playing an increasingly important role in education. Massive open online course (MOOC) platforms are among the most important tools in online learning, and record historical learning data from an extremely large number of learners. To enhance the learning experience, a promising approach is to apply sequential pattern mining (SPM) to discover useful knowledge in these data. In this paper, mining sequential patterns (SPs) with flexible constraints in MOOC enrollment data is proposed, which follows that research approach. Three constraints are proposed: the length constraint, discreteness constraint, and validity constraint. They are used to describe the effect of the length of enrollment sequences, variance of enrollment dates, and enrollment moments, respectively. To improve the mining efficiency, the three constraints are pushed into the support, which is the most typical parameter in SPM, to form a new parameter called support with flexible constraints (SFC). SFC is proved to satisfy the downward closure property, and two algorithms are proposed to discover SPs with flexible constraints. They traverse the search space in a breadth-first and depth-first manner. The experimental results demonstrate that the proposed algorithms effectively reduce the number of patterns, with comparable performance to classical SPM algorithms.

8.
Appl Intell (Dordr) ; 51(5): 3086-3103, 2021.
Article in English | MEDLINE | ID: mdl-34764587

ABSTRACT

The genome of the novel coronavirus (COVID-19) disease was first sequenced in January 2020, approximately a month after its emergence in Wuhan, capital of Hubei province, China. COVID-19 genome sequencing is critical to understanding the virus behavior, its origin, how fast it mutates, and for the development of drugs/vaccines and effective preventive strategies. This paper investigates the use of artificial intelligence techniques to learn interesting information from COVID-19 genome sequences. Sequential pattern mining (SPM) is first applied on a computer-understandable corpus of COVID-19 genome sequences to see if interesting hidden patterns can be found, which reveal frequent patterns of nucleotide bases and their relationships with each other. Second, sequence prediction models are applied to the corpus to evaluate if nucleotide base(s) can be predicted from previous ones. Third, for mutation analysis in genome sequences, an algorithm is designed to find the locations in the genome sequences where the nucleotide bases are changed and to calculate the mutation rate. Obtained results suggest that SPM and mutation analysis techniques can reveal interesting information and patterns in COVID-19 genome sequences to examine the evolution and variations in COVID-19 strains respectively.

9.
IEEE Trans Cybern ; 51(2): 487-500, 2021 Feb.
Article in English | MEDLINE | ID: mdl-32142464

ABSTRACT

High-utility sequential pattern (HUSP) mining is an emerging topic in the field of knowledge discovery in databases. It consists of discovering subsequences that have a high utility (importance) in sequences, which can be referred to as HUSPs. HUSPs can be applied to many real-life applications, such as market basket analysis, e-commerce recommendations, click-stream analysis, and route planning. Several algorithms have been proposed to efficiently mine utility-based useful sequential patterns. However, due to the combinatorial explosion of the search space for low utility threshold and large-scale data, the performances of these algorithms are unsatisfactory in terms of runtime and memory usage. Hence, this article proposes an efficient algorithm for the task of HUSP mining, called HUSP mining with UL-list (HUSP-ULL). It utilizes a lexicographic q -sequence (LQS)-tree and a utility-linked (UL)-list structure to quickly discover HUSPs. Furthermore, two pruning strategies are introduced in HUSP-ULL to obtain tight upper bounds on the utility of the candidate sequences and reduce the search space by pruning unpromising candidates early. Substantial experiments on both real-life and synthetic datasets showed that HUSP-ULL can effectively and efficiently discover the complete set of HUSPs and that it outperforms the state-of-the-art algorithms.

10.
IEEE Trans Cybern ; 50(3): 1195-1208, 2020 Mar.
Article in English | MEDLINE | ID: mdl-30794524

ABSTRACT

Mining useful patterns from varied types of databases is an important research topic, which has many real-life applications. Most studies have considered the frequency as sole interestingness measure to identify high-quality patterns. However, each object is different in nature. The relative importance of objects is not equal, in terms of criteria, such as the utility, risk, or interest. Besides, another limitation of frequent patterns is that they generally have a low occupancy, that is, they often represent small sets of items in transactions containing many items and, thus, may not be truly representative of these transactions. To extract high-quality patterns in real-life applications, this paper extends the occupancy measure to also assess the utility of patterns in transaction databases. We propose an efficient algorithm named high-utility occupancy pattern mining (HUOPM). It considers user preferences in terms of frequency, utility, and occupancy. A novel frequency-utility tree and two compact data structures, called the utility-occupancy list and frequency-utility table, are designed to provide global and partial downward closure properties for pruning the search space. The proposed method can efficiently discover the complete set of high-quality patterns without candidate generation. Extensive experiments have been conducted on several datasets to evaluate the effectiveness and efficiency of the proposed algorithm. Results show that the derived patterns are intelligible, reasonable, and acceptable, and that HUOPM with its pruning strategies outperforms the state-of-the-art algorithm, in terms of runtime and search space, respectively.

11.
PLoS One ; 12(7): e0180931, 2017.
Article in English | MEDLINE | ID: mdl-28742847

ABSTRACT

High-utility sequential pattern mining (HUSPM) has become an important issue in the field of data mining. Several HUSPM algorithms have been designed to mine high-utility sequential patterns (HUPSPs). They have been applied in several real-life situations such as for consumer behavior analysis and event detection in sensor networks. Nonetheless, most studies on HUSPM have focused on mining HUPSPs in precise data. But in real-life, uncertainty is an important factor as data is collected using various types of sensors that are more or less accurate. Hence, data collected in a real-life database can be annotated with existing probabilities. This paper presents a novel pattern mining framework called high utility-probability sequential pattern mining (HUPSPM) for mining high utility-probability sequential patterns (HUPSPs) in uncertain sequence databases. A baseline algorithm with three optional pruning strategies is presented to mine HUPSPs. Moroever, to speed up the mining process, a projection mechanism is designed to create a database projection for each processed sequence, which is smaller than the original database. Thus, the number of unpromising candidates can be greatly reduced, as well as the execution time for mining HUPSPs. Substantial experiments both on real-life and synthetic datasets show that the designed algorithm performs well in terms of runtime, number of candidates, memory usage, and scalability for different minimum utility and minimum probability thresholds.


Subject(s)
Data Mining/methods , Pattern Recognition, Automated/methods , Algorithms , Databases, Factual , Knowledge Bases , Probability , Uncertainty
SELECTION OF CITATIONS
SEARCH DETAIL
...