Hadoop-CNV-RF: A scalable copy number variation detection tool for next-generation sequencing data

Getiria Onsongo, Ham Ching Lam, Matthew Bower, Bharat Thyagarajan

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Detection of small copy number variations (CNVs) in clinically relevant genes is routinely being used to aid diagnosis. We recently developed a tool, CNV-RF, capable of detecting clinically relevant CNVs with a high degree of sensitivity. CNV-RF implementation was designed for small gene panels and did not scale to large gene panels. Analyzing large gene panels with several hundred genes routinely failed due to memory limitations on a single computer,and, when successful, analysis took on average over 24 hours, making it impractical for routine use in the clinic. We need a reliable tool capable of accurately identifying clinically relevant CNVs on large gene panels within a more practical time frame. We have developed Hadoop-CNV-RF, a freely available, scalable, and more user-friendly implementation of CNV-RF capable of rapidly analyzing large datasets. Hadoop-CNV-RF takes advantage of Hadoop, a framework developed to analyze large amounts of data. In its implementation, we demonstrate the feasibility of developing scalable pipelines on Hadoop that integrate popular bioinformatics software developed for usage on traditional single-user computers without the need for special-purpose routines developed for Hadoop. Results show that Hadoop-CNV-RF reduces analysis time on large gene panels from over 24 hours to about 4 hours on a 20 node Hadoop cluster. Additionally, we demonstrate its ability to scale by analyzing a whole-exome dataset with close to a billion reads. Hadoop-CNV-RF has been clinically validated for large gene panels (up to 4800 genes) and is currently being used in the clinic. It is publicly available at: https://github.com/getiria-onsongo/hadoopcnvrf-public.

Original languageEnglish (US)
Title of host publicationProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450379649
DOIs
StatePublished - Sep 21 2020
Event11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020 - Virtual, Online, United States
Duration: Sep 21 2020Sep 24 2020

Publication series

NameProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020

Conference

Conference11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
Country/TerritoryUnited States
CityVirtual, Online
Period9/21/209/24/20

Bibliographical note

Publisher Copyright:
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Keywords

  • Amazon Web Services
  • Copy Number Variation
  • Hadoop
  • High-Performance Computing
  • Next-Generation Sequencing

Fingerprint

Dive into the research topics of 'Hadoop-CNV-RF: A scalable copy number variation detection tool for next-generation sequencing data'. Together they form a unique fingerprint.

Cite this