Proofpoint sent me to NAACL 2019, which was my first time attending an NLP conference. I have a few main takeaways that I wanted to share!

Transfer Learning Tutorial

Sebastian Ruder and his co-authors from AI2, CMU, and Huggingface marched through 220 slides with practical tips and tricks for applying this new class of Transformer-based language models, notably BERT, to particular target tasks. I give my brief summary of it below in case you don’t have 4 hours to re-watch the tutorial.

The goal of transfer learning is to improve the performance on a target task by applying knowledge gained through sequential or simultaneous training on a set of source tasks, as summarized by the diagram below from A Survey of Transfer Learning:

Traditional Machine Learning vs. Transfer Learning

There are three general keys to successful transfer learning: finding the set of source tasks that produce generalizble knowledge, selecting a method of knowledge transfer, and combining the generalizable and specific knowledge. Learning higher order concepts that are generalizble is crucial to the transfer. In image processing, those concepts are lines, shapes, patterns. In natural language processing, those concepts are syntax, semantics, morphology, subject verb agreement.

Finding the right set of source tasks is important! Language modeling has been the task of choice for a while now. The transfer medium has been maturing over the years. Word2vec and skip thoughts stored knowledge in a produced vector, but now language models are the generalized knowledge. Quite the paradigm shift! Contextual neural models on language modeling tasks then require the slow introduction of target-specific language.

Finally, how to optimize these models? A variety of techniques and methods were proposed:

Probing Language Models

Researchers are only beginning to develop the tooling necessary to understand these large models. Here are some papers that highlight this research effort:

BERT

BERT won best paper, which was no surprise. Because of the impact of preprints, ELMo felt like old news by the time the conference actually arrived. It resulted in a dissonance between what many of the papers were adapting (ELMo) and what the state-of-the-art was at the moment (BERT).