On 23rd June 2021, C2D3 held a collaborative data science and AI research event. This series of online sessions reflected the strengths and collaborations of data science and AI research across the University and many of the presentations showcased the University’s partnership with The Alan Turing Institute. Our speakers covered a wide range of themes and disciplines including the humanities, natural language processing for mental health, digital twins, weather and climate, and healthcare.
Watch the sessions on-demand
C2D3 members can watch the videos on-demand here.
Programme
13:00-13:20 Session 1: Introducing The Alan Turing Institute, chaired by Professor Zoe Kourtzi (Turing University Lead for the University of Cambridge)
- 13:00: Introduction to the Turing, Turing for universities, Turing for industry and other organisations by Daniel Lovelock (Senior Academic Engagement Manager, The Alan Turing Institute) and Katrina Payne (Partnerships Development Lead, The Alan Turing Institute).
13:20-14:25 Session 2: Presenting the Turing Fellow research projects, chaired by Professor Zoe Kourtzi (Turing University Lead for the University of Cambridge)
- 13:20: Session introduction
- 13:25 Data science and AI in the humanities: Professor Robert Foley, Dr Jason Gellis and Dr Camila Rangel Smith Data science and the reconstruction of past behaviour: capturing the stone tool technology of prehistoric people
- 13:45: Speaker Q&A
- 13:55 Data science and AI for mental health: Dr Sarah Morgan Assessing psychosis risk using quantitative markers of transcribed speech
- 14:15: Speaker Q&A
14:25-14:40: Break
14:40-16:30 Session 3: Research showcase, chaired by Professor Gos Micklem (Department of Genetics and Department of Applied Mathematics and Theoretical Physics, University of Cambridge)
- 14:40 Data-centric engineering: Rebecca Ward Growing Underground: towards a digital twin for crop production
- 15:00 Data science and the environment: Rachel Furner Developing data-driven forecast systems
- 15:20 Data science and AI for healthcare: Adam Berman Automated approaches to diagnosing Barrett's Oesophagus from Cytosponge
- 15:40 Data science for history and social sciences: Dr Alexis Litvine Extracting structured data from historical insurance records for Aviva. An application of a new tool (THOTH) to extract tabular data at scale
- 16:00 Foundations of data science: Professor Carola-Bibiane Schönlieb Looking into the black box: how mathematics can help to turn deep learning inside out
- 16:20: Closing remarks and opening of the networking
16:30-17:00 Session 4: Get connected
- Interactive networking and discussions
- Ask a Turing Fellow
- Connect to a Cambridge academic and researcher
- Q&As with speakers and session chairs
The above times are UK BST
Abstracts
Prof. Robert Foley (1), Dr Jason Gellis (2), Dr Camila Rangel Smith (3)
(1) Leverhulme Centre for Human Evolutionary Studies, University of Cambridge; Interdisciplinary Centre for Archaeology and Evolution of Human Behaviour, University of Algarve, Faro, Portugal; Turing Fellow. (2) Leverhulme Centre for Human Evolutionary Studies, University of Cambridge. (3) The Alan Turing Institute.
Title: Data science and the reconstruction of past behaviour: capturing the stone tool technology of prehistoric people
For most of human history, stone was the primary raw material for much of the technological basis for human adaptation. The flaking of stone to create sharp edges and particular shapes and sizes of tools represents one of our major evolutionary advances. Once the skill was acquired, stone tools were made and discarded in prolific quantities, and changed in ways that mapped developments in cognition, behaviour and ecology. Archaeologists and anthropologists have, over more than one hundred years, collected vast numbers of stone tools, and developed intensive methods of analysis. The result is that there is a major resource in archived photographs and drawings of lithics. The Turing funded project, PALAEOANALYTICS, aims to develop AI/machine learning approaches to automate the retrieval of this information and to expand the potential data collected. In this talk we will present the progress we have made in developing computer vision techniques to collect key morphometric information from drawings of stone tools, focusing on those that indicate the technological processes used by prehistoric people to produce them.
Dr Sarah Morgan
Accelerate Science Research Fellow, Department of Computer Science and Technology, University of Cambridge; Senior Research Associate, Cambridge Brain Mapping Unit, Department of Psychiatry, University of Cambridge; Turing Fellow.
Title: Assessing psychosis risk using quantitative markers of transcribed speech
There is a pressing clinical demand for tools to predict individual patients' disease trajectories for schizophrenia and other conditions involving psychosis, however to date such tools have proved elusive. Behaviourally and cognitively, psychosis expresses itself by subtle alterations in language. Recent work has suggested that Natural Language Processing markers of transcribed speech might be powerful predictors of later psychosis (Mota et al 2017, Corcoran et al 2018), for example, Corcoran et al 2018 used quantitative markers of semantic coherence collected at baseline from individuals at clinical high risk for psychosis, to predict transition to psychosis with 79% accuracy.
However, it remains unclear which NLP measures are most likely to be predictive, how different NLP measures relate to each other and how best to collect speech data from patients. In this talk, I will discuss our research tackling these questions, as well as the wider challenges of translating this type of approach to the clinic. Ultimately, computational markers of speech have the potential to transform healthcare of mental health conditions such as schizophrenia, since they are relatively easy to collect and could be measured longitudinally to quickly identify changes in patients’ disease trajectories.
Rebecca Ward
Postdoctoral Research Associate, The Alan Turing Institute, with Dr Ruchi Choudhary (University of Cambridge), Data-centric engineering group (ASG)
Title: Growing Underground: towards a digital twin for crop production
The growth of building-integrated agriculture as one potential solution to reducing food miles for city-dwellers brings with it disparate problems that present new challenges to the agriculture industry. The dual aim of keeping energy costs to a minimum while maximising crop growth is particularly challenging. Extensive monitoring provides valuable information and statistical analysis of historic conditions can be used to generate forecasting models. Physics-based simulation can also help; by simulating the interaction of the vegetation with the surrounding environment, ‘what-if’ scenario tests can be performed efficiently and the impact of potentially sub-optimal conditions explored. We are fortunate to work closely with the operators of Growing Underground, a farm located in previously disused tunnels in Clapham, London. As part of a long-running project encompassing both monitoring and simulation, a digital twin is being developed which will enable the operator to both visualise current and forecast farm conditions and to explore future scenarios. In this talk the development and future plans for the digital twin will be described and lessons learnt for wider application of digital twin technology will be discussed.
Rachel Furner
PhD student at the British Antarctic Survey, and Department of Applied Mathematics and Theoretical Physics, University of Cambridge.
Title: Developing data-driven forecast systems
The recent boom in machine learning and data science has led to a number of new opportunities in the environmental sciences. In particular, process-based weather and climate models (simulators) represent the best tools we have to predict, understand and potentially mitigate the impacts of climate change and extreme weather. However these models are incredibly complex and require huge amounts of HPC resources. Machine learning offers opportunities to greatly improve the computational efficiency of these models.
Here I discuss recent work to develop a data-driven model of the ocean, an integral part of the weather and climate system. We train a neural network on the output from an expensive process-based simulator of an idealised channel configuration of oceanic flow. We show the model is able to learn well the complex dynamics of the system, replicating the mean flow and details within the flow over single prediction steps. We also see that when iterating the model, predictions remain stable, and continue to match the ‘truth’ over a short-term forecast period, here around a week.
Adam Berman
Phd Student at Cancer Research UK Cambridge Institute, University of Cambridge
Title: Automated approaches to diagnosing Barrett's Oesophagus from Cytosponge
Deep learning methods have been shown to achieve excellent performance on diagnostic tasks, but how to optimally combine them with expert knowledge and existing clinical decision pathways is still an open challenge. This question is particularly important for the early detection of cancer, where high-volume workflows may benefit from (semi-)automated analysis. Here we present a deep learning framework to analyze samples of the Cytosponge, showing that learning methods can perform quality control, diagnosis, and atypia detection with high accuracy.
Dr Alexis Litvine
Faculty of History and Cambridge Digital Humanities
Title: Extracting structured data from historical insurance records for Aviva. An application of a new tool (THOTH) to extract tabular data at scale
Handwritten Text Recognition technology (HTR) has recently become a viable method for transcribing handwritten historical documents. This has made archival documents ‘machine-readable’ for the first time. HTR is now successfully used by thousands of amateur and professional historians, archivists, genealogists, librarians, and social scientists around the world. However, it is not yet suitable for sources with complex layout/tabular structures. This key limitation prevents social scientists and beneficiaries from applying HTR to documents such as censuses, civil or military records, and tax lists. Our technology (THOTH), developed by Yiannos Stathopoulos (Computer Science), Oliver Dunn (History) and Alexis Litvine (History/CDH) makes this possible by streamlining data extraction from tables.
Archived manuscript tables provide detailed insights into more than 500 years of history, covering nearly every aspect of people's lives, from their birth, education, health, profession, and housing, up to their death and legacy, representing tens of thousands of shelf kilometres of historical documents preserved in European archives alone. Furthermore, large portions of these documents are becoming available as digital images due to systematic digitization efforts. As one shelf kilometre corresponds on average to ten million page-images, most large national archives will soon be hosting several billions of images. THOTH can make a large number of data contained in such documents analysable and searchable both by researchers and the public.
Since 2018, THOTH has combined several proven computer vision technologies into our own AI workflow. What started as a tool for our own data needs soon attracted the interest of non-academic partners, expressing their wish to adopt our technology. We recently benefitted from a £3,000 ESRC-IAA discretionary fund grant to buy equipment to process large scale images and set up Osiris-AI ltd (www.osiris-ai.com). We are currently working with Aviva Plc on the digitisation of a large number of records from the former Hand-in-hand fire insurer held in the London Metropolitan Archive on behalf of Aviva. In order to do this, we will digitize the complete collection of the Hand-in-Hand insurance policy registers held at London Metropolitan Archive and create a structured dataset for use by researchers and the public. These registers are particularly interesting because of their long spanning coverage that provides scholars with lots of useful data, and which serves as an advertisement for Aviva's heritage as a national insurance institution. Containing approximately 1.8M unique observations of London addresses insured against fire for the period 1697-1865, these data will be geolocated to specific locations in London. Partnering with Layers of London (layersoflondon.org), we will then be able to offer a striking visualisation of the data created with THOTH.
A video about the project can be accessed here.
Professor Carola-Bibiane Schönlieb
Department of Applied Mathematics and Theoretical Physics; Turing Fellow.
Title: Looking into the black box: how mathematics can help to turn deep learning inside out
Deep learning has had a transformative impact on a wide range of tasks related to Artificial Intelligence, ranging from computer vision and speech recognition to playing games. Still, the inner workings of deep neural networks are far from clear, and designing and training them is seen as almost a black art. In this talk we will try to open this black box a little bit by using mathematical structure of neural networks described by so-called differential equations and mathematical optimisation. The talk is furnished with some examples in image processing and computer vision, ranging from biomedical imaging to remote sensing.