RESUMO
Skin is an important ecosystem that links the human body and the external environment. Previous studies have shown that the skin microbial community could remain stable, even after long-term exposure to the external environment. In this study, we explore two questions: Do there exist strains or genetic variants in skin microorganisms that are individual-specific, temporally stable, and body site-independent? And if so, whether such microorganismal genetic variants could be used as markers, called "fingerprints" in our study, to identify donors? We proposed a framework to capture individual-specific DNA microbial fingerprints from skin metagenomic sequencing data. The fingerprints are identified on the frequency of 31-mers free from reference genomes and sequence alignments. The 616 metagenomic samples from 17 skin sites at 3-time points from 12 healthy individuals from Integrative Human Microbiome Project were adopted. Ultimately, one contig for each individual is assembled as a fingerprint. And results showed that 89.78% of the skin samples despite body sites could identify their donors correctly. It is observed that 10 out of 12 individual-specific fingerprints could be aligned to Cutibacterium acnes. Our study proves that the identified fingerprints are temporally stable, body site-independent, and individual-specific, and can identify their donors with enough accuracy. The source code of the genetic identification framework is freely available at https://github.com/Ying-Lab/skin_fingerprint.
RESUMO
Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a "group-specific" sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific k-mers (k up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including k-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific k-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at https://github.com/ChnMasterOG/KmerGO.