Good Truth is Hard to Find

Sun, 19/01/202014:00
Rothberg C-220
Dr. Yossi Farjoun, The Broad Institute of MIT and Harvard

As sequencing technologies are enabling us to sequence RNA in individual cells by the millions in a single experiment, and sequence data is becoming cheaper and more abundant, the methods for finding and classifying variants are being held back by the lack of high-quality truth data. In the past few years, we have taken it upon ourselves to generate truth-data for both germline and somatic variation. For germline, we made use of publicly available PacBio sequencing data of two cases of complete hydatidiform mole (CHM) which were combined informatically into a single "synthetic diploid" dataset. For somatic, we examined the recently published LinSeq dataset and used the lineage structure to validate the variants. In both cases, the results provided a large, high-quality, and realistic set of variants that can be used to develop, improve and evaluate variant-calling pipelines.