EE Systems Seminar

Wednesday May 24, 2017 4:00 PM

DNA Sequencing: From Information Limits to Assembly Software

Speaker: Thomas Courtade , Department of Electrical Engineering and Computer Sciences , UC Berkeley
Location: Moore B280

Abstract: Emerging long-read sequencing technologies promise to enable near-perfect reconstruction of whole genomes. Assembly of long reads is often accomplished using a read-overlap graph, in which the vertices correspond to reads and the target sequence corresponds to a Hamiltonian path. As such, the assembly problem appears NP-hard under most formulations, and most of the known algorithmic approaches are heuristic in nature.

Surprisingly, it turns out that by focusing only on "information-feasible" instances of this problem, we can sidestep these apparent hardness issues and design efficient assembly algorithms with correctness guarantees. To show this, we begin with a basic feasibility question: when does the set of reads contain enough information to allow unambiguous reconstruction of the genome? We show that in most instances of the problem where the reads contain enough information for assembly, the read-overlap graph can be sparsified, allowing the problem to be solved in linear time. Along similar lines, we show that a greedy-like algorithm can maximize assembly quality in information-limited problem instances. These insights formed the theoretical foundation of our long-read assembly software HINGE, which has been shown to outperform existing tools through extensive validation on long-read datasets.

Bio: Thomas Courtade is an Assistant Professor in the Department of Electrical Engineering and Computer Sciences at UC Berkeley. He received his Ph.D. from UCLA in 2012 and did a postdoc at Stanford, where he was supported by the NSF Center for Science of Information. He does research on both fundamental and applied aspects of information theory, with work ranging from data compression and complexity to problems in genomics. Courtade is the recipient of several awards, including a Distinguished Ph.D. Dissertation award, a best paper award at the International Symposium on Information Theory, and a Hellman Fellowship.

Host: Ehsan Abbasi

Series: Electrical Engineering Systems Seminar Series
Contact: Katie Pichotta pichotta@caltech.edu
Department of Computing + Mathematical Sciences