RESUMO
Rapid degradation of tropical forests urges to improve our efficiency in large-scale biodiversity assessment. DNA barcoding can assist greatly in this task, but commonly used phenetic approaches for DNA-based identifications rely on the existence of comprehensive reference databases, which are infeasible for hyperdiverse tropical ecosystems. Alternatively, phylogenetic methods are more robust to sparse taxon sampling but time-consuming, while multiple alignment of species-diagnostic, typically length-variable, markers can be problematic across divergent taxa. We advocate the combination of phylogenetic and phenetic methods for taxonomic assignment of DNA-barcode sequences against incomplete reference databases such as GenBank, and we developed a pipeline to implement this approach on large-scale plant diversity projects. The pipeline workflow includes several steps: database construction and curation, query sequence clustering, sequence retrieval, distance calculation, multiple alignment and phylogenetic inference. We describe the strategies used to establish these steps and the optimization of parameters to fit the selected psbA-trnH marker. We tested the pipeline using infertile plant samples and herbivore diet sequences from the highly threatened Nicaraguan seasonally dry forest and exploiting a valuable purpose-built resource: a partial local reference database of plant psbA-trnH. The selected methodology proved efficient and reliable for high-throughput taxonomic assignment, and our results corroborate the advantage of applying 'strict' tree-based criteria to avoid false positives. The pipeline tools are distributed as the scripts suite 'BAGpipe' (pipeline for Biodiversity Assessment using GenBank data), which can be readily adjusted to the purposes of other projects and applied to sequence-based identification for any marker or taxon.