A precise regulation of gene expression is required for virtually all biological processes, such as cell and tissue development or response to external stimuli. Mis-regulation of gene expression can lead to diseases, such as cancer. It is therefore crucial to improve our understanding of the components underlying the process of gene regulation.
Enhancers are genomic regions (sequences) that act as regulatory elements by cooperating with core promoter regions to recruit the transcription machinery to drive gene expression. Recently emerged, genome-wide experimental assays, such as STAP-seq, aim to quantify the ability of genomic fragments to respond to an enhancer and drive the transcription of a gene.
In this seminar, I will outline how statistical learning models enable us to extract biological insights from large, genome-wide assays, such as STAP-seq. As the vast majority of DNA sequences are unable to respond to enhancers and drive gene expression, we employ a zero-inflated model to address the challenge of many zeros in the data set. We harness convolutional neural networks (ConvNets) to automatically discover which DNA sequence motifs serve as ingredients of a responsive promoter. Our interpretable, zero-inflated, nonlinear Poisson regression model allows us to delineate minimal, core promoter properties from those that cooperate with the enhancer to modulate the level of gene expression.