C2D3 ECR and student conference
C2D3 seeks to create an interdisciplinary data science and AI community for Early Career Researchers (ECRs) and students, as a place for supporting researchers and their ideas, sharing solutions and networking.
This half-day conference will provide a forum to exchange ideas, discuss research problems and solutions, and make new connections. During the conference we will hear presentations from the C2D3 ECR Seed Fund Awardees (2022 and 2023) and lightning talks from the ECR and student community.
Registration
This event is specifically aimed at University of Cambridge ECRs and students, and we warmly invite you all to participate. We also welcome are all other members of the University to hear about the amazing research being conducted by the ECRs and students.
(Registrations will remain open but we can no longer accommodate any specific dietary requirements other than vegetarian as final catering requirements have been submitted)
Programme
09.00-09.20 Registration
09.20 -09.30 Opening with Matt Castle (Cambridge Centre for Data-Driven Discovery / Genetics)
Session 1 : Showcase ECR Seed Fund 2022 Awardee talks Chair: Cheng-Yu Huang (Physiology/Neuroscience)
- Edward Harding Forming an expert network for AI-informed drug repurposing in neurodegenerative disease (Institute of Metabolic Science, Department of Clinical Biochemistry)
- Angelica Aviles-Rivero TripletMIS: AI for Triplet Recognition in Minimally Invasive Surgery (Department of Applied Mathematics & Theoretical Physics)
- Lin Wang Using deep learning techniques to optimize genetics studies of infectious diseases (Department of Genetics)
Session 2 : Showcase: ECR Seed Fund 2023 Awardee talks Chair: Edward Harding (Biochemistry)
- Henry Moss A Bayesian approach to equation discovery: improving reliability, robustness and data efficiency (Institute of Computing for Climate Science, Department of Applied Mathematics & Theoretical Physics) (no longer available to speak)
- Simon Carrignon Beyond Projections: An Interdisciplinary Investigation of Socio-Economic Processes in Sustainable Food System Adoption using Agent Based Models of Cultural Evolution (McDonald Institute for Archaeological Research)
- Mia Tackney Using Digital Technologies in Clinical Trials for Pulmonary Hypertension (MRC Biostatistics Unit)
Q&A
- Chris Edsall Resources for ML and Data Science from Research Computing Services (Research Computing Services)
11.10-11.40 Tea/Coffee Break
Session 3 : Lightning Talks "I have a solution to share" Chair: Edward Harding (Biochemistry)
- Zihao Fu Large Language Model for Sciences: Exploring the Potential of Knowledge Injection for Enhanced Research in Various Domains (Language Technology Lab -LTL)
- Alexandros Kontogiannis Learning Bayesian digital twins of physical processes from data (Engineering)
- Jiachen Cai Dynamic Factor Analysis with Dependent Gaussian Processes for High-Dimensional Gene Expression Trajectories (MRC Biostatistics Unit)
"I have a problem to solve"
- Tom Allen Personalising Infant Cognitive Assessments with Machine Learning (Department of Applied Mathematics & Theoretical Physics)
- Wallace Peaslee Art, Cultural Heritage, and Mathematics: From Image Processing to Investigating Punch Marks (Department of Applied Mathematics & Theoretical Physics)
- Jack Hughes Threading of conversations on chat discussions (Computer Science and Technology)
- Sida Chen Deep generative multistate models for longitudinal multimorbidity analysis (MRC Biostatistics Unit)
Q&A: Problem solving discussions
13.00-14.00 Light lunch, breakout groups & networking
Lightning Talk Abstracts - confirmed
Tom Allen Personalising Infant Cognitive Assessments with Machine Learning (DAMTP)
Abstract: Cognitive assessments in infancy have focussed predominantly on the differences between age groups, without considering individual developmental trajectories. Personalising feedback from cognitive tests is a difficult problem but would help parents better understand what infants need to progress their learning. Modern computer vision algorithms unlock a wealth of information from video data which could be applied to automatically individualise feedback. Pose, emotion and gaze detection could all be used to gain a deeper understanding of why individuals perform well or poorly on cognitive tasks. In this lightning talk, I will discuss how our lab is currently utilising machine learning algorithms in scalable, automated analysis and how these algorithms could solve the problem of individualising developmental assessments.
Jiachen Cai Dynamic Factor Analysis with Dependent Gaussian Processes for High-Dimensional Gene Expression Trajectories (MRC Biostatistics Unit)
Abstract: The increasing availability of high-dimensional, longitudinal measures of genetic expression can facilitate analysis of the biological mechanisms of disease and prediction of future trajectories, as required for precision medicine. Biological knowledge suggests that it may be best to describe complex diseases at the level of underlying pathways, which may interact with one another. We propose a Bayesian approach that allows for characterising such correlation among different pathways through Dependent Gaussian Processes (DGP) and mapping the observed high-dimensional gene expression trajectories into unobserved low-dimensional pathway expression trajectories via Bayesian Sparse Factor Analysis. Compared to previous approaches that model each pathway expression trajectory independently, our model demonstrates better performance in recovering the shape of pathway expression trajectories, revealing the relationships between genes and pathways, and predicting gene expressions (closer point estimates and narrower predictive intervals), as demonstrated in the simulation study and real data analysis. To fit the model, we propose a Monte Carlo Expectation Maximization (MCEM) scheme that can be implemented conveniently by combining a standard Markov Chain Monte Carlo sampler and an R package GPFDA, which returns the maximum likelihood estimates of DGP parameters. The modular structure of MCEM makes it generalizable to other complex models involving the DGP model component. An R package has been developed that implements the proposed approach.
Sida Chen Deep generative multistate models for longitudinal multimorbidity analysis (MRC Biostatistics Unit)
Abstract: Multimorbidity, defined as the presence of two or more chronic conditions in an individual, is a growing concern in aging societies. A better understanding of how these conditions develop and progress over time is crucial for more effective management and treatment. Multistate models, which generalize traditional survival models to encompass more than two states, provide a promising probabilistic approach for modelling disease progression. This process can be conceptualized as an individual transitioning through various disease states. However, current applications of such models often rely on strong modelling and data-generating assumptions, which can be unrealistic in many real data scenarios, thus negatively affecting model performance and reliability. While there are a few recent attempts to integrate deep learning techniques into multistate frameworks to relax some of these assumptions, complications in data scenarios, such as interval-censoring, are still overlooked. In addition, these approaches focus on estimating specific multistate quantities and are not generative in nature, such as being able to simulate trajectories of disease progression. We aim to leverage modern deep learning techniques and computational advancements to develop a scalable, flexible, and generative multistate modelling framework addressing the aforementioned limitations. The focus will be on predictive inferences such as the generation and analysis of longitudinal multimorbidity trajectories at both individual and population levels. The developed approach could be applied to large electronic health record data, such as the Clinical Practice Research Datalink, to answer a wealth of interesting clinical questions. A by-product of the approach is the ability to generate pseudo-realistic data to form new benchmarks for testing and validating new prediction models and methods for such data, which is lacking at the moment.
Zihao Fu Large Language Model for Sciences: Exploring the Potential of Knowledge Injection for Enhanced Research in Various Domains (Language Technology Lab (LTL) Dept. of Theoretical & Applied Linguistics)
Abstract: Large Language Models (LLMs) have transformed multiple fields, offering unprecedented capabilities in understanding and generating human-like text. However, their application in scientific research is often constrained by their knowledge cutoff, leaving them unaware of the latest advancements. This lightning talk introduces a conceptual framework for enhancing the utility of LLMs in scientific research through a process we term 'knowledge injection'. We propose a mechanism to integrate up-to-date, domain-specific knowledge from scientific literature, databases, and ontologies into these models. We explore potential applications of this approach across diverse scientific fields such as biomedical, and computer science. By enabling LLMs to comprehend and generate responses based on the most recent scientific knowledge, we envision a future where they can become powerful tools for driving innovation and discovery in science. This talk aims to spark discourse on the possibilities of this approach, encouraging further exploration and development in the field.
Jack Hughes Threading of conversations on chat discussions (Computer Science and Technology)
Abstract: Cybercrime discussions are moving from forums to discussion channels such as Discord and Telegram. Previously, measurements of forums were simplified by utilising the structure of forums (e.g. boards, threads) to provide coherent topics for analysis. However, discussion channels typically lack structure, introducing additional challenges of detangling multiple conversations happening in parallel and identifying when a conversation topic changes. We are looking into developing techniques for converting long ongoing series of short texts into discrete chunks of conversations on topics.
Alexandros Kontogiannis Learning Bayesian digital twins of physical processes from data (Engineering)
Abstract: We formulate a Bayesian digital twin approach to the reconstruction of physical processes in silico from noisy and possibly sparse experimental data. The method learns the most probable physical model that fits the data by solving a Bayesian inverse problem for the unknown model parameters. The learned digital twin reconstructs the experimental data and at the same time infers hidden quantities that cannot be measured otherwise but arise from the physical model. Unlike neural networks, the learned model has significant explanatory power which allows it to extrapolate to different conditions for which data are not available. We have created an algorithm that solves this inverse problem and have successfully applied it to the reconstruction of noisy and sparse magnetic resonance velocimetry data of porous media flows and the flow through a 3D-printed physical model of an aortic arch. In the future we aim to apply this methodology to in vivo cardiovascular flows and other physical problems for which qualitative models exist, such as cancer tumour growth modelling and infectious disease modelling.
Wallace Peaslee Art, Cultural Heritage, and Mathematics: From Image Processing to Investigating Punch Marks (DAMTP)
Abstract: New imaging technologies, from high-resolution visible-light photography to X-ray fluorescence, have become useful tools for investigating old-master paintings and other cultural heritage objects. These imaging modalities can, for example, reveal where certain pigments have been used in a painting or capture fine brushstroke details. Data science and artificial intelligence have proven useful for supporting cultural heritage investigations, whether by inpainting damage or graffiti in artworks, classifying pottery, separating a painting’s X-ray into surface and concealed components, or digitally rejuvenating old artworks; however, many problems remain unsolved. Punch marks, small imprints made in gold foil present in many medieval artworks, serve as a case study of artificial intelligence and data science in cultural heritage and highlight the possibility of interdisciplinary collaboration with art historians and conservators. One task involves correctly classifying a punch mark, usually via indexing systems from decades-long studies by art historians.
Convolutional neural networks can classify punch marks with high accuracy when given training images from the same painting as the given punch mark. However, many challenges remain as the classification problem is broadened and complexified with more varied imaging data, different data modalities, and especially an increased number of punch classes and paintings to consider. The results of this submission may be viewed at: https://www.c2d3.cam.ac.uk/node/28191/submission/10561
Research Computing Services
Chris Edsall Resources for ML and Data Science from Research Computing Services (Research Computing Services)
Research Software Engineering is the approach to make the software used in research more sustainable and FAIR (Findable, Accessible, Interoperable, and Reusable). Research Computing Services (RCS) at the University of Cambridge hosts a team of RSEs who can advise if you are writing your own software.
RCS also hosts the Cambridge System for Data Driven Discovery (CSD3) through which researchers, PIs and their students at the University of Cambridge have access to world class computing resources to further their work at scale. CSD3 comprises a large number of modern CPU and GPU nodes with new hardware being added continually. Software frameworks are provided on CSD3 for common ML and data science applications.
Organising Committee
- Raghada Al-Najjar, PhD student (Department of Clinical Neuroscience)
- Jack Atkinson, ECR (Institute of Computing for Climate Science)
- Edward Harding, ECR (Department of Clinical Biochemistry)
- Cheng-Yu Huang, PhD student (Department of Physiology, Development and Neurosciences)
- Oliver Wissett, PhD student (Yusuf Hamied Department of Chemistry)
- Ellen Ashmore Marsh (C2D3)
- Lisa Curtis (C2D3)
