AI CURRENTS ai research you should know about

Available as a PDF here

AI Currents is a report by AI researcher Libby Kinsey and Exponential View which aims to showcase and explain the most important recent research in Artificial Intelligence and Machine Learning.

Breakthroughs in AI come at an ever more rapid pace. We, at Exponential View, wanted to build a bridge between the newest research and its relevance for businesses, product leaders and data science teams across the world.

AI Currents is our first attempt at this. To distill the most important papers of the previous calendar quarter into a digestible and accessible guide for the practitioner or unit head. It is not an easy task with more than 60,000 AI papers published last year.

We’re grateful to Libby Kinsey for stepping up to the task!

We hope you enjoy it.

Azeem Azhar

Founder, Exponential View

After twelve years working in technology and VC, I went back to university to study machine learning at UCL. It was 2014, Deepmind had recently been acquired by Google, Andrew Ng’s introductory course on machine learning was already an online sensation, yet mostly, machine learning hadn’t hit the mainstream.

Upon graduating, I felt super-powered for all of a few minutes, and then the truth hit me that I didn’t know how to do anything… real. I’d focused so much on learning the maths and implementing the algorithms that I had a lot of context to catch up on.

Since then, I’ve been trying to parse what is happening in the world of research into what that means for commercial opportunity, societal impact and widespread adoption. I work in a variety of roles, as technologist, advisor, and analyst, for startups, larger organisations and government.

In this first edition of AI Currents, I’ve selected five recent papers and one bigger theme that I think are interesting to this wider perspective. There’s a mix of topics among those that are definitely in the ‘fundamental research’ category but which are advancing towards capabilities that are at the heart of artificial intelligence, those that examine the complexities of deployment, and more application-focused concerns.

Research in AI is moving at such a fast pace that it’s impossible to cover everything. What I hope this report does is to slow down, to take papers that are interesting in their own right but that also act as exemplars for some of the ways researchers talk about their work or the claims they sometimes make, and to respectfully consider them outside of their primary, academic audience. Your time is precious and the report is already long enough!

Azeem, Marija and I hope that you enjoy this report, that you learn something and that you’ll let us know what you think.

Libby Kinsey

AI Researcher


Not based on a single paper but a chance to look back at a step-change in natural language understanding and its journey from experimental architecture to widely available building block deployed in hyper-scale settings.


Language understanding has long been a focus of AI research, exemplified by Turing’s famous empirical test for machine intelligence. We can argue whether success in a Turing-type test actually constitutes ‘intelligence’, but it’s clear that the practical applications of machines that (in some sense) understand language are numerous. Such machines could answer questions, translate from one language to another, summarise lengthy documents or conduct automated reasoning; and they could interact more naturally with, or learn more readily from, humans.

Progress towards language understanding experienced a leap with the introduction of the ‘Transformer’ by Google in 2017. The Transformer is a deep learning architecture that was designed to increase performance on natural language tasks in an efficient way. Deep neural networks for learning from text previously used layers based on local convolutions and recurrence. These analyse words in the context of a few surrounding words and an approximate compression of words further away. The Transformer model combines point-convolutions with a new mechanism called attention, in particular self-attention, which allows words to be analysed in a much wider context – whole surrounding sentences or paragraphs or more. With it, the team beat previous state-of-the-art models on English-to-French and English-to-German translation benchmarks, at a fraction of the training cost.

Research interest in attention ignited by the Transformer paper was hot throughout 2018, producing a deluge of architectural and training refinements and expansion to many different languages and tasks. But it was the release of work using simplified structures at a much greater scale (particularly Google’s BERT and OpenAI’s GPT-2) that lit up 2019. What this allowed was the ability to efficiently (everything’s relative) process truly huge amounts of text so as to learn some general attributes of language. This knowledge can then be used as a good starting point for finessing specific language understanding tasks, and thereby it can vastly reduce the amount of labelled training data required – and achieve state-of-the-art results. The first task is known as ‘pre-training’, the second ‘fine-tuning’.

Throughout 2019, the likes of Google, OpenAI, Microsoft and Facebook released pre-trained models. Now anyone can download one and fine-tune it for their particular task, so avoiding the prodigious expense of training from scratch. Hugging Face has collated all of these pre-trained models under a unified API together with everything needed to get started. In addition, in only two years, this novel architecture has gone from being effectively a prototype to large-scale deployments, such as in search at Microsoft Bing and Google Search.

Why is it interesting?

The computational parallelism that the Transformer model allows originally made for faster training times than the methods that had previously been in vogue. However, since then, transformers have become synonymous with massive compute budgets as researchers test the limits of giant architectures and abundant training data. The training data is abundant because it’s just text – often scraped from the internet – and it doesn’t need labelling as the language models are trained in a self-supervised manner to predict words contained in the text.

If more compute plus more data equals better language models, then the only limitation to progress is computing resource, which makes it the purview of only a very small number of well-resourced entities. This raises concerns about who can participate in fundamental research and who will ultimately benefit from it.

Having said that, there’s much to celebrate about how open research has been to date and how so many pre-trained language models have been made available for use. The option of downloading a pre-trained language model and fine-tuning it for specific applications (instead of building a task-specific architecture and training it end-to-end) massively reduces the difficulty of creating cutting-edge applications. For example, it took about 250 chip-days to pre-train the full-sized version of BERT, but it requires only a few chip-hours to fine-tune it to a particular task.

Before we get too carried away, though, there is a reason to question just what deep language models such as the Transformer are learning and what they can be expected to do. Simplistically, these models work by learning associations between words from how they are used. In the words of John Firth, ‘You shall know a word by the company it keeps.’ This results in language generation that is fluent but not typically coherent. The models have in some sense acquired knowledge, perhaps about the structure of language, but they haven’t necessarily learned about the structure of the world – knowledge that is required for reasoning and to make inferences, or to answer common-sense questions.

A naive look at research results might obscure this fact, since research performance is steadily improving against benchmark tasks in reading comprehension, question answering, text summarisation, etc. This is because the benchmark tasks and evaluation metrics used in research imperfectly capture what we want from language understanding systems and are themselves subject to research interrogation and refinement.

That’s one charge against transformers, that they haven’t learned anything about language that can be equated with intelligence or understanding. Gary Marcus is typically forthright on this topic. Another charge is that their successes have been such that they have diverted attention and funding from other approaches, potentially less resource-hungry ones. Stephen Merity’s ‘Single Headed Attention RNN: Stop Thinking With Your Head’ paper is an amusing examination of this point. Perhaps better evaluation metrics, along with less-hyped results, could’ve helped avoid too much focus on one approach, but papers like his do offer glimmers of hope to those not affiliated with FAANG budgets that alternative approaches are worthwhile.

Finally, there’s been no end of hoo-ha about the potential misuse of transformers, specifically the risk of mass-production of high-quality ‘fake news’ and propaganda, or of propagating biases learned from the text corpuses they train on. OpenAI staged the release of their language model, finally releasing the largest, best-performing one only in November last year, after conducting a year-long conversation about what responsible publication looked like. No one would think to do this if the text outputted by transformer models weren’t so convincing.

So, yes, there are limitations to what transformers can do, but they’ve really made a splash, they’ve accelerated from experimental architecture to production at remarkable speed, and anyone can take advantage of their power by downloading a pre-trained model and fine-tuning it for their specific uses.


A Transformer reading list:

  • The original Transformer paper: ‘Attention Is All You Need’ (paper / blog)
  • Google’s BERT paper: ‘BERT: Pre-Training of Deep Bi-directonal Transformers for Language Understanding’ (paper / blog)
  • OpenAI’s GPT-2 paper: ‘Language Models Are Unsupervised Multitask Learners’ (paper / blog)
  • An examination of model performance: ‘What Is BERT Actually Learning? Probing Neural Network Comprehension of Natural Language Arguments’ (paper)
  • Gary Marcus’s talk at the NeurIPS 2019 workshop on context and compositionality in biological and artificial systems, ‘Deep Understanding, the Next Challenge for AI’ (video)
  • An argument against too much focus on transformers: ‘Single Headed Attention RNN: Stop Thinking With Your Head’ (paper)
  • Starting to probe how transformer models actually work: ‘How Much Knowledge of Language Does BERT Actually Learn? Revealing the Dark Secrets of BERT’ (paper / blog)
  • OpenAI’s GPT-2 1.5bn parameter pre-trained model is released (blog)
  • Google AI’s MEENA 2.6bn parameter chatbot: ‘Towards a Human-Like Open-Domain Chatbot’ (paper / blog)
  • ‘Turing-NLG: A 17bn parameter Language Model by Microsoft’ (blog)
  • Try generating some text here: Write with Transformer: Get a modern neural network to auto-complete your thoughts

Deep Learning for Symbolic Mathematics

G. Lample and F. Charton / December 2019 / paper


Neural networks have a reputation for being better at solving statistical or approximate problems than at performing calculations or working with symbolic data. In this paper, we show that they can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations. We propose a syntax for representing mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models. We achieve results that outperform commercial Computer Algebra Systems such as Matlab or Mathematica.


In this work, the authors generate a dataset of mathematical problem-solution pairs and train a deep neural network to learn to solve them. The particular problems in question are function integration and ordinary differential equations of the first and second order, which many of us will remember learning to solve using various techniques and tricks that required quite a lot of symbol manipulation, experience and memory (I seem to have forgotten all of it – such a waste of all that study).

By converting the problem-solution pairs into sequences, the authors were able to use a transformer network and hence to take advantage of the attention mechanisms that have been found so useful in processing text sequences.

They evaluated their trained model on a held-out test set, deeming the solution ‘correct’ if any one of the top 10 candidate solutions outputted by the transformer was correct. In practice, more than one may be correct as they found that many outputs were equivalent after simplification, or differed only by a constant.

The trained model was able not only to learn how to solve these particular mathematical problems but also to outperform the commercial systems (so-called Computer Algebra Systems) using complex algorithms to do the same thing, albeit constraining those systems to a time cut-off.

Why is it interesting?

In Section 1, we saw how transformers (sequence modelling with attention) have become the dominant approach to language modelling in the last couple of years. They’ve also been applied with success to other domains that use sequential data such as in protein sequencing and in reinforcement learning where a sequence of actions is taken. What’s more surprising is their use here, with mathematical expressions. On the face of it, these aren’t sequences and ought not to be susceptible to a ‘pattern-matching’ approach like this.

What I mean by ‘pattern-matching’ is the idea that the transformer learns associations from a large dataset of examples rather than understanding how to solve differential equations and calculate integrals analytically. It was really non-obvious to me that this approach could work (despite prior work; see ‘More’ below). It’s one thing to accept that it’s possible to convert mathematical expressions into sequence representations; quite another to think that deep learning can do hard maths!

Maths has been the subject of previous work in deep learning, but that focused on the intuitively easier problem of developing classifiers to predict whether a given solution is ‘correct’ or not. Actually generating a correct solution represents a step change in utility.

The paper attracted quite a lot of comment on social media and on the OpenReview platform, highlighting concerns about whether the dataset that the authors built introduces any favourable learning biases or whether it satisfactorily covers all potential cases. There were also questions about the fairness of comparing the top-10 accuracy of this system against time-limited Computer Algebra Systems (the benchmark analytical approaches) that had only one shot at a solution, but I think that the authors addressed these well in their revised paper.

It’s a stretch to imagine that a deep learning approach will replace Computer Algebra Systems, but it could readily support them. It would provide candidate answers to problems that those systems fail (or take too long) to compute, since candidate solutions can be readily checked. Since the cost of checking solutions is low, having 10 candidates need not be considered an impediment. I’d love to see some commentary on this from any of the Computer Algebra Systems publishers, but I haven’t been able to find anything yet.

What other seemingly ‘symbolic’ problems will prove susceptible to a pattern-matching approach?

Will we see the impact of this in the next year?

It’s possible that we will see commercial Computer Algebra Systems integrating deep learning approaches like this one in the next year, but the biggest impact is likely to be in inspiring the application of transformers to other domains in which there is readily available (or synthesizable) labelled data that can be transformed into sequences. In language translation, the sequence [je suis etudient] maps to sequence [I am a student]. The mathematical expression 2 + 3x2 can be written as a sequence in normal Polish form, [ + 2 * 3 pow x 2] with derivative [ * 6 x ].


  • Signal: Accepted as a ‘spotlight’ talk at ICLR2020
  • Who to follow: Paper authors @GuillaumeLample, @f_charton (Facebook AI Research)

Other things to read:

Neural Programming involves training neural networks to learn programs, mathematics or logic from data. Some prior work:

  • This paper was one of three awarded ‘best paper’ at ICLR in 2017 and trains a neural network to do grade-school addition, among other things: ‘Making Neural Programming Architectures Generalize via Recursion’ (paper)
  • This ICLR 2018 paper trained neural networks to do equation verification and equation completion: ‘Combining Symbolic Expressions and Blackbox Function Evaluations in Neural Programs’ (paper)
  • In NeurIPS 2018, Trask et al. proposed a new module designed to learn systematic numerical computation that can be used within any neural network: ‘Neural Arithmetic Logic Units’ (paper)
  • In ICLR 2019, Saxton et al. presented a new synthetic dataset to evaluate the mathematical reasoning ability of sequence-to-sequence models and trained several models on a wide range of problems: ‘Analysing Mathematical Reasoning Abilities of Neural Models’ (paper)

Selective Brain Damage: Measuring the Disparate Impact of Model Pruning

S. Hooker, A. Courville, Y. Dauphin, and A. Frome / November 2019 / paper / blog


Neural network pruning techniques have demonstrated that it is possible to remove the majority of weights in a network with surprisingly little degradation to test set accuracy. However, this measure of performance conceals significant differences in how different classes and images are impacted by pruning. We find that certain examples, which we term pruning identified exemplars (PIEs), and classes are systematically more impacted by the introduction of sparsity. Removing PIE images from the test-set greatly improves top-1 accuracy for both pruned and non-pruned models. These hard-to-generalize-to images tend to be mislabelled, of lower image quality, depict multiple objects or require fine-grained classification. These findings shed light on previously unknown trade-offs, and suggest that a high degree of caution should be exercised before pruning is used in sensitive domains.


A trained neural network consists of a model architecture and a set of weights (the learned parameters of the model). These are typically large (they can be very large – the largest of Open AI’s pre-trained GPT-2 language model referred to in Section 1 is 6.2GB!). The size inhibits their storage and transmission and limits where they can be deployed. In resource-constrained settings, such as ‘at the edge’, compact models are clearly preferable.

With this in mind, methods to compress models have been developed. ‘Model pruning’ is one such method, in which some of the neural network’s weights are removed (set to zero) and hence do not need to be stored (reducing memory requirements) and do not contribute to computation at run time (reducing energy consumption and latency). Rather surprisingly, numerous experiments have shown that removing weights in this way has negligible effect on the performance of the model. The inspiration behind this approach is the human brain, which loses ‘50% of all synapses between the ages of two and ten’ in a process called synaptic pruning that improves ‘efficiency by removing redundant neurons and strengthening synaptic connections that are most useful for the environment’.

Because it’s ‘puzzling’ that neural networks are so robust to high levels of pruning, the authors of this paper probe what is actually lost. They find that although global degradation of a pruned network may be almost negligible, certain inputs or classes are disproportionately impacted, and this can have knock-on effects for other objectives such as fairness. They call this ‘selective brain damage’.

The pruning method that the authors use is ‘magnitude’ pruning, which is easy to understand and to implement (weights are successively removed during training if they are below a certain magnitude until a model sparsity target is reached) and very commonly used. The same formal evaluation methodology can be extended to other pruning techniques.

Why is it interesting?

The paper was rejected for ICLR2020. The discussion on OpenReview suggested that it was obvious that pruning would result in non-uniform degradation of performance and that, having proven this to be the case, the authors did not provide a solution.

It may indeed be obvious to the research community, but I wonder whether it is still obvious when it comes to production. Pruning is already a standard library function for compression. For instance, a magnitude pruning algorithm is part of Tensorflow’s Model Optimization toolkit, and pruning is one of the optimisation strategies implemented by semiconductor IP and embedded device manufacturers to automate compression of models for use on their technologies. What this paper highlights is that a naive use of pruning in production, one that looks only at overall model performance, might have negative implications for robustness and fairness objectives.

Will we see the impact of this in the next year?

I hope we’ll see much more discussion like this in the next year. As machine learning models increasingly become part of complex supply chains and integrations, we will require new methods to ensure that.


Signal: Rejected for ICLR2020

Who to follow: @sarahookr

Other things to read:

  • One of the first papers (published in 1990) to investigate whether a version of synaptic pruning in human brains might work for artificial neural networks: ‘Optimal Brain Damage’ (paper)
  • This is the paper that introduced me to the idea that one could remove 90% or more of weights without losing accuracy, at NeurIPS 2015: ‘Learning Both Weights and Connections for Efficient Neural Networks’ (paper)
  • The ‘magnitude’ pruning algorithm, workshop track ICLR2018: ‘To Prune, or Not to Prune: Exploring the Efficacy of Pruning for Model Compression’ (paper)

Explainable Machine Learning in Deployment

U. Bhatt, A. Xiang et al. / 10 December 2019 / paper


Explainable machine learning offers the potential to provide stakeholders with insights into model behavior by using various methods such as feature importance scores, counterfactual explanations, or influential training data. Yet there is little understanding of how organizations use these methods in practice. This study explores how organizations view and use explainability for stakeholder consumption. We find that, currently, the majority of deployments are not for end users affected by the model but rather for machine learning engineers, who use explainability to debug the model itself. There is thus a gap between explainability in practice and the goal of transparency, since explanations primarily serve internal stakeholders rather than external ones. Our study synthesizes the limitations of current explainability techniques that hamper their use for end users. To facilitate end user interaction, we develop a framework for establishing clear goals for explainability. We end by discussing concerns raised regarding explainability.


The goal of this research was to study who is actually using ‘explainability’ techniques and how they are using them. The team conducted interviews with ‘roughly fifty people in approximately thirty organisations’. Twenty of these were data scientists not currently using explainability tools; the other 30 were individuals in organisations that have deployed explainability techniques.

The authors supply local definitions of terms such as ‘explainability’, ‘transparency’ and ‘trustworthiness’ for clarity and because they realised that they needed a shared language for the interviews, which were with individuals from a range of backgrounds – data science, academia and civil society. Therefore, ‘[e]xplainability refers to attempts to provide insights into a model’s behavior’, ‘[t]ransparency refers to attempts to provide stakeholders (particularly external stakeholders) with relevant information about how the model works’, while ‘[t]rustworthiness refers to the extent to which stakeholders can reasonably trust a model’s outputs’.

The authors found that where explanation techniques were in use, they were normally in the service of providing data scientists with insight to debug their systems or to ‘sanity check’ model outputs with domain experts rather than to help those affected by a model output to understand it. The types of technique favoured were those that were easy to implement rather than potentially the most illuminating; and causal explanations were desired but unavailable.

Why is it interesting?

Explainable machine learning is a busy area of research, driven by regulations such as Europe’s GDPR and by its centrality to the many attempts to articulate what is responsible AI – either as an explicit goal or as an enabler for higher-level principles such as justice, fairness and autonomy (explanations should facilitate informed consent and meaningful recourse for the subjects of algorithmic decision-making).

What this paper finds is that the explanation techniques that are currently available fall short of what is required in practice, substantively and technically. The authors offer some suggestions of what organisations should do and where research should focus next, to achieve the aim of building ‘trustworthy explainability solutions’. Their approach, of interviewing practitioners – i.e. of ‘user centred design’, is one I like and would like to see more of (the Holstein et al. paper (see ‘More’) is another notable recent example). It allows for a much broader consideration of utility than simple evaluation metrics – in this case, asking how easy a given solution is to use or to scale, and how useful the explanations it gives are for a given type of stakeholder – and highlights that there is often a gap between research evaluation criteria and deployment needs.

Complementary work has sought to understand what constitutes an explanation in a given context. I recommend looking up Project ExplAIn by the UK’s Information Commissioner’s Office (ICO) and The Alan Turing Institute, which sought to establish norms for the type of explanation required in given situations via Citizen Juries and put together practical guidance on AI explanation.

Starting with the type of explanation that is expected allows us to ask ourselves whether it is actually feasible. This is why it has been argued that there are some domains in which black box algorithms ought never to be deployed (see Cynthia Rudin under ‘More’). The finding that organisations want causal explanations is illustrative of this concern. Deep learning algorithms are good at capturing correlations between phenomena, but not at establishing which caused the other. Attempts to integrate causality into machine learning are an exciting frontier of research (see Bernhard Scholkopf under ‘More’).

As such, the paper is a welcome check to the proliferation of software libraries and commercial services that claim to offer explanation solutions and from which it’s easy to imagine that this problem is essentially solved. As lead author Umang Bhatt said when he presented the paper at FAT* in January, ‘Express scepticism about anyone claiming to be providing explanations.’

Will we see the impact of this in the next year?

Like research into machine learning algorithms themselves, research into explainability has been subject to cycles of hype and correction, and this is already leading to more nuanced discussions which should benefit everyone.


Signal: Accepted for FAT*2020

Who to follow: @umangsbhatt, @alicexiang, @RDBinns, @ICOnews, @turinginst

Other things to read:

  • Holstein et al.: ‘Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need?’ (paper)
  • At NeurIPS 2018’s Critiquing and Correcting Trends in Machine Learning workshop, Cynthia Rudin argued, ‘Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead’ (paper)
  • Bernhard Schölkopf: ‘Causality for Machine Learning’ (paper)
  • The Alan Turing Institute and the Information Commissioner’s Office (ICO) Project ExplAIn (Interim report here, Guidance (draft) here). Final guidance due to be released later in 2020.

International Evaluation of an AI System for Breast Cancer Screening

S. Mayer McKinney et al. / January 2020 / paper (viewable, not downloadable without subscription)


Screening mammography aims to identify breast cancer at earlier stages of the disease, when treatment can be more successful. Despite the existence of screening programmes worldwide, the interpretation of mammograms is affected by high rates of false positives and false negatives. Here we present an artificial intelligence (AI) system that is capable of surpassing human experts in breast cancer prediction. To assess its performance in the clinical setting, we curated a large representative dataset from the UK and a large enriched dataset from the USA. We show an absolute reduction of 5.7% and 1.2% (USA and UK) in false positives and 9.4% and 2.7% in false negatives. We provide evidence of the ability of the system to generalize from the UK to the USA. In an independent study of six radiologists, the AI system outperformed all of the human readers: the area under the receiver operating characteristic curve (AUC-ROC) for the AI system was greater than the AUC-ROC for the average radiologist by an absolute margin of 11.5%. We ran a simulation in which the AI system participated in the double-reading process that is used in the UK, and found that the AI system maintained non-inferior performance and reduced the workload of the second reader by 88%. This robust assessment of the AI system paves the way for clinical trials to improve the accuracy and efficiency of breast cancer screening.


This paper from Deepmind made a splash at the beginning of the year when it was published in Nature and the story was widely picked up by mainstream media outlets. It describes the successful use of an AI system to identify the presence of breast cancer from mammograms and favourably compares performance against expert human radiologists in the US and the UK.

The paper is the culmination of slow and careful work from a multidisciplinary team with the involvement of patients and clinicians. It relies on annotated mammography data (via Cancer Research UK and a US hospital) collected over extended time periods, since the ‘ground truth’ (whether cancer was actually present or not) requires information from subsequent screening events.

Deepmind submits that the performance that it achieved suggests that AI systems might have a role as a decision support tool, either to reduce the reliance on expert human labour (such as in the UK setting, which currently requires the consensus view from two radiologists) or to flag suspicious regions for review by experts. It is possible that use of AI detectors could help to detect cancers earlier than is currently the case and to catch cancers that are missed, but clinical studies are required to test this out.

Why is it interesting?

Detecting cancer earlier and more reliably is a hugely emotional topic and one with news value beyond the technical media. It’s refreshing to see a counterpoint to the negative press that has attended fears of AI-driven targeting, deep fakes and tech-accelerated discrimination in recent months. I, for one, am starving for evidence of applications of AI with positive real-world impact. But does the research justify the hype?

The first thing to note is that this kind of approach to AI and mammography is not novel; it’s of established interest in academia and the commercial sector. For instance, a team from NYU published similar work last summer comparing neural network performance against radiologists, and London’s Kheiron Medical is engaged with clinicians in the NHS to evaluate whether their ‘model is suitable for consideration as an independent reader in double-read screening programmes’. Deepmind’s reputation and effective PR department are perhaps such that the media is more likely to notice its results than these others’.

Where AI performance is evaluated against clinicians, we should also be a little bit careful. The AI system has the advantage of being able to select the decision threshold for determining is / is not cancer that best showcases its abilities. Even with that advantage, it performed (globally, there were some interesting cases where the AI system spotted things that humans did not) no better than the two-reader system used in the UK. This suggests an economic argument for use, but it isn’t yet representative of a step-change in capability.

There’s still a very long way to go from here to deployment. First, as the authors note, understanding the ‘full extent to which this technology can benefit patient care’ will require clinical studies. That means evaluation of performance in clinically realistic settings, across representative patient cohorts and in randomised controlled trials. Then, if the evidence supports deployment, there are some non-trivial updates to service design and technology integration required to incorporate AI into large screening programmes. One might say that Deepmind has demonstrated that we are at the end of the beginning.

Scepticism has also been expressed about the use of AI for mammography at all. There are three major concerns: 1) that research (or perhaps derivative media reports) might be overstating the importance of results since these systems have not been tested for robustness and generalisability in real-world settings; 2) that using AI will distort outcomes if the wrong questions are asked; and 3) about how the data were obtained, who owns the models and who benefits.

Deepmind has clearly taken great care in its experiments and in what it reports, but it has not released its code, and the supplementary information that it has released is very light on the model design and implementation details that would be required to reproduce the experiments that it reports. The US dataset used in the study is not publicly available and we do not know the details of the licence under which the UK OPTIMAM dataset was granted. We don’t know what Deepmind intends to do with its models, either; thus, we don’t have enough information to conduct a cost–benefit analysis.

In this light, an uncharitable judgement would be that the paper published in Nature is more like a white paper than a research one. However, I am optimistic that research conducted in step with engagement around the concerns outlined above will ultimately prove a net positive.

Will we see the impact of this in the next year?

It will take longer than a year to see the results of this work in any clinically realistic setting. We do appear to be at an inflexion point for AI radiology generally, with many companies making progress and starting to move into trials. This reflects in part in the suitability of radiology for machine learning (since it is tech-driven with large amounts of data), but is not necessarily evidence of demand.


Signal: Published in Nature; all over the mainstream press; subject of social media discussion

Who to follow: @DrHughHarvey, @EricTopol, @DeepMind, @KheironMedical, @screenpointmed and @Lunit_AI

Other things to read:

  • Lunit AI (and collaborators) (2020): ‘Changes in Cancer Detection and False-Positive Recall in Mammography Using Artificial Intelligence: A Retrospective, Multireader Study’ (paper)
  • NYU (2019): ‘Deep Neural Networks Improve Radiologists’ Performance in Breast Cancer Screening’ (paper)
  • Nico Karssemeijer and co-authors (2016): ‘A Comparison between a Deep Convolutional Neural Network and Radiologists for Classifying Regions of Interest in Mammography’ (paper)

View of current status of radiology AI:

  • From a radiologist’s perspective: ‘RSBA 2019 Roundup’ (blog)
  • From a machine learning perspective (paywall): ‘Artificial Intelligence for Mammography and Digital Breast Tomosynthesis: Current Concepts and Future Perspectives’ (paper)

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Mode

J. Schrittwieser et al. / November 2019 / paper / poster


Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.


Reinforcement learning is the subfield of machine learning concerned with learning through interaction with the environment. Given an environment and a goal, reinforcement learning algorithms try out (lots of) actions in order to learn which ones allow it to achieve the goal, or to maximise its reward.

There are two principal categories of reinforcement learning: model-based and model-free. As you would expect, a model-based algorithm has a model of the environment. That is, let’s say the objective is to learn to play chess. Then using model-based learning, the algorithm knows what the legal moves are, given a state of play, so it can plan (or ‘look ahead’) accordingly to optimise its next move – high-performance planning. Model-based reinforcement learning typically works well for logically complex problems, such as chess, in which the rules are known or where the environment can be accurately simulated.

In model-free reinforcement learning, the optimal actions are learnt directly from interaction with the environment. The learner encounters a state that it’s seen before, but it does not know what the allowed moves are, only that one action resulted in a better outcome than another. Model-free reinforcement learning tends not to work well in domains that require precision planning, but it is the state-of-the-art for domains that are difficult to precisely define or simulate, such as visually rich Atari games.

In this paper, DeepMind presents MuZero, a new approach to model-based reinforcement learning that combines the benefits of both high-performance planning and model-free reinforcement learning by learning a representation of the environment. The representation it learns is not the ‘actual’ environment but a pared-down version that needs only to ‘represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.’ It matches the performance of existing model-based approaches in games such as chess, Go and Shogi and also achieves state-of-the-art performance in many Atari games.

Why is it interesting?

Deepmind has been making headlines since 2013 when it first used deep reinforcement learning to play ’70s Atari games such as Pong and Space Invaders. In a series of papers since then, DeepMind has improved its performance in the computer games domain (achieving superhuman performance in many Atari games and StarCraft II) and also smashed records in complex planning games such as chess, Go and Shogi (Japanese chess) that have previously been tackled with ‘brute force’ (that is to say, rules plus processing power, rather than learning).

DeepMind’s researchers show with this latest paper that the same algorithm can be used effectively in both domains – planning and visual ones – where before different learning architectures were required. It does this by taking DeepMind’s AlphaZero architecture (which achieved superhuman performance in chess, Go and Shogi in 2017) and adding the capability to learn its own model of the environment. This makes MuZero a general purpose reinforcement learning approach.

To be clear, MuZero, which was not supplied with the games’ rules, matched the performance of AlphaZero, which was. It also achieved state-of-the-art performance on nearly all of the Atari-57 dataset of games by some margin. That’s an impressive arc of achievement from 2013 to now.

But DeepMind has always had a bigger picture in mind than success in games and simulated environments, and that is to be able to use deep reinforcement learning in complex, real-world systems, enabling it to model the economy, the environment, the weather, pandemics and so on. MuZero takes us one step closer to being able to apply reinforcement learning methods in such situations where we are not even sure of the environment dynamics. In the authors’ words, ‘our method does not require any knowledge of the game rules or environment dynamics, potentially paving the way towards the application of powerful learning methods to a host of real-world domains for which there exists no perfect simulator’.

This transition from constrained research problems to real-world applicability may still be a long way off, but we can already see distinct research problems that would extend MuZero’s capabilities on this path. At the moment, MuZero works for deterministic environments with discrete actions. This means that when an action is chosen, the effect on the environment is always the same: if I use my games console control to move right two steps, my game avatar moves right two steps, for instance. In many reinforcement learning problems, this is not true, and we instead have stochastic environments with more realistic and continuous actions.

Similarly, MuZero was extraordinarily effective at most of the Atari games it played, but it was really challenged on a couple, notably Montezuma’s Revenge, which deep reinforcement algorithms always struggle with and which requires long-term planning.

I look forward to seeing the progress against these and other challenges, bringing the dream of scaling reinforcement learning to large-scale, complex environments that much closer.

Will we see the impact of this in the next year?

Deep reinforcement learning has lagged other machine learning techniques in transferring from ‘lab to live’. We have seen applications in autonomous transportation (e.g. Wayve) and robotics (e.g. Covariant), but in principle, the ability to adapt to an environment over time to maximise a reward should have many applications. Research like MuZero brings these closer.


Signal: NeurIPS Deep RL workshop 2019 (video)

Who to follow: @OpenAI, @DeepMind

Other resources:

Select DeepMind deep reinforcement learning papers:

  • (2013) The original Atari playing deep reinforcement learning model: ‘Playing Atari with Deep Reinforcement Learning’ (paper)
  • (2015) Deep Q-Networks achieve human-like performance on 49 Atari 2600 games (paywall): ‘Human-Level Control through Deep Reinforcement Learning’ (paper)
  • (2015) AlphaGo beats the European Go champion Fan Hui, five games to zero (paywall): ‘Mastering the Game of Go with Deep Neural Networks and Tree Search’ (paper, blog)
  • (2017) AlphaGo Zero learns from self-play: ‘Mastering the Game of Go Without Human Knowledge’ (paper, blog)
  • (2017) A generalised form of AlphaGo Zero, AlphaZero, achieves superhuman performance in high-performance planning scenarios: ‘A General Reinforcement Learning Algorithm that Masters Chess, Shogi and Go Through Self-Play’ (paper, blog)
  • (2019) (paywall): AlphaStar achieves superhuman performance at StarCraft II: ‘Grandmaster Level in StarCraft II Using Multi-Agent Reinforcement Learning’ (paper, blog)

Other papers:

  • A previous paper with a method that integrates model-free and model-based RL methods into a single neural network: Junhyuk Oh, Satinder Singh and Honglak Lee (2017): ‘Value Prediction Network’ (paper)
  • Model-based reinforcement learning with a robot arm: A. Zhang et al. (ICLR2019), ‘SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning’ (paper, blog)
  • Complexity, Information and AI session at CogX 2019 with Thore Graepl, Eric Beinhocker and Cesar Hidalgo (video)

Libby Kinsey

Libby is an AI researcher and practitioner. She spent ten years as a VC investing in technology start-ups, and is co-founder of UK AI ecosystem promoter Project Juno. Libby is a Dean's List graduate in Machine Learning from University College London, and has most recently focused her efforts on working with organisations large and small, public and private, to build AI capabilities responsibly.

Azeem Azhar

Azeem is an award-winning entrepreneur, analyst, strategist, investor. He produces Exponential View, the leading newsletter and podcast on the impact of technology on our future economy and society.

Marija Gavrilov

Marija leads business operations at Exponential View. She is also a producer of the Exponential View podcast (Harvard Business Presents Network).

Contact: aicurrents@exponentialview.co