Check out the Github Repo associated with this post!

Introduction

After finishing the Advanced NLP with spaCy course, I wanted to try out the rule-matching and model-training of the spaCy framework to an text classification problem. I stumbled upon a large batch of Jeopardy! questions and answers and chose the following challenge to tackle: Can we classify Jeopardy! questions as being answered with a single person or not?

The dataset contains 216,930 question and answer pairs from Jeopardy! shows airing between 1984 to 2012. Below is an example of a Jeopardy! question whose answer contains a single person; we’ll refer to such pairs as “positive” for the rest of the post.

  • Q: Best known as a writer of children’s stories, he was also a fine photographer; Alice Liddell was a subject
  • A: Lewis Carroll

A single person answer is in contrast to an answer with multiple people:

  • Q: From this duo’s classic routine: “Who’s on first?” “Yes.” “I mean the fellow’s name.” “Who.” “The guy on first.” “Who”
  • A: Abbott and Costello

The variety and complexity of this challenge favors a solution centered around machine learning. But before we can train a model, we need to find and label positive and negatives pairs in the dataset. This post will walk through methods to select and label such training data with high precision and to train a text classification model, all using the spaCy 2.x Matcher and TextCategorizer modules respectively.

Labeling Training Data

The output of a high precision labeling function contains a high percentage of positive examples (more than 95%). The recall of such a function may not be high, but the recall increases as more of these functions are combined.

Why strive for high precision labeling functions?

  1. The rate-limiting step in almost every nuanced text classification problem is getting labeled data. When a labeler reviews a datum that isn’t a positive example, inefficiency is introduced into the process; therefore, it’s expedient to squeeze out negative examples using automation before passing them along to the labeler.
  2. Some labeling functions have such high precision that review by a labeler is not necessary; essentially, these types of functions are rules (analogous to laws in physics) that the dataset predictably follows. These functions are not only useful for quickly labeling entire pockets of a dataset, but they can also be added as an aide or a feature to the downstream machine learning model.

Let’s further motivate these high precision functions with a lower precision approach to harvesting positive examples.

Low Precision Harvesting

Here we apply spaCy’s built-in NER model to the answers of the dataset and select examples whose answers contain a single PERSON entity. 29,064 pairs were returned, and the majority were positive hits; however,enough mistakes were made that we would need to look through those pairs to judiciously select positive examples. Here’s an overview of the mistakes the model made:

Differential Reference

A person’s name can also be the title of a film or a book.

  • Q: Andrew Marton received a special award for directing the chariot race in this 1959 film
  • A: Ben-Hur
  • Q: Dickens mentions this Defoe work in many of his tales, including “Bleak House” & “A Christmas Carol
  • A: Robinson Crusoe

Partial Match

Occasionally, the model labeled an answer as a person based on only part of the phrase.

  • Q: During World War II, many women wore this shoulder-length style with the ends curled under from ear-to-ear
  • A: Page Boy
  • Q: Americans who crossed the Berlin Wall to East Berlin used the guard station on Friedrichstrasse nicknamed this
  • A: Checkpoint Charlie
  • Q: Ghosts at the Mounds Theatre in this “twin city” of Minneapolis are said to sit with the audience & watch shows
  • A: St. Paul

Head-Scratchers

There were a good number of just plain mistakes.

  • Q: A culinary profession: ‘panadero’
  • A: baker (breadmaker acceptable)
  • Q: This white whale related to the narwhal is the only whale that can bend its neck
  • A: beluga whale
  • Q: Mel Gibson charges into battle as Scottish avenger William Wallace in this 1995 epic
  • A: Braveheart
  • Q: It seems like I’ve seen this Bill Murray-Andie MacDowell film about a zillion times
  • A: Groundhog Day
  • Q: In 1951 the Peron government seized control of this paper whose name is Spanish for “the press
  • A: La Prensa

Annotations:

In a number of instances the peculiarities of the annotations threw off the model.

  • Q: 1984: “Leap”
  • A: Jump (by Van Halen)

Ambiguous

A couple of the results would be ambiguous even to humans.

  • Q: This foe of Bugs Bunny is a marsupial
  • A: Tasmanian Devil

Low Precision Results Exaimination

Let’s examine the positive hits returned by the NER labeling function to prepare us for later construction of high precision labeling functions. Contextual words and phrases are present in each question that allow us to infer if the answer is a PERSON. In each section below, we take a look at the different forms of that context.

Subject Pronouns

In many questions, the context is a singular pronoun (he, she) as the subject of a sentence.

  • Q: In 1996, after winning his third U.S. Amateur title, he left Stanford to turn pro
  • A: Tiger Woods
  • Q: She advised Macbeth to “look like th’ innocent flower, But be the serpent under’t”
  • A: Lady Macbeth

Possessive Pronouns

In other questions, the context was a singular possesive pronoun (his, her).

  • Q: “Man with a Broken Nose”, his first sculpture submitted for exhibit to the Paris Salon, was rejected
  • A: Auguste Rodin
  • A: You may remember there are 3 melting watches in his “The Persistence of Memory”
  • Q: Salvador Dali
  • Q: His recently won Nobel Peace Prize is now on display at his presidential library
  • A: Jimmy Carter

Personal Words

Oftentimes, the context would take the form of a personal word (e.g. job, occuptation, gendered reference) as the subject or object of a sentence or phrase combined with the word “this”

  • Q: This Jello pudding pitchman said the darnedest things to kids at Xavier University’s commencement
  • A: Bill Cosby
  • Q: “Whenever we go out, the people always shout, there goes” this man
  • A: John Jacob Jingleheimer Schmidt
  • Q: Jack Ryan helps avert a nuclear war in “The Sum of All Fears” by this author; thanks, Jack!
  • A: (Tom) Clancy
  • Q: “Hocus Pocus” was a 1990 book by this “Cat’s Cradle” novelist,
  • A: Kurt Vonnegut
  • Q: Bully for Len Cariou, who played this famous man in the musical “Teddy And Alice”
  • Q: Theodore Roosevelt

Quotations

Some questions are beyond the scope of this blog post, such as inferring the question is a qutotation and you are asked to give the person to whom it is attributed.

  • Q: “If one leads a country such as Britain… then you must have a touch of iron about you”
  • A: Margaret Thatcher

High Precision Function Construction

In the previous section we outlined three general categories for labeling functions, but we’ll only implement subject pronouns and personal words since the possessive pronouns weren’t high precision.

Using spaCy’s Matcher module with the patterns below that catch he or she as the subject of the sentence and any instances of this {personal word}. The list of personal words was scraped from this source and contains common job and occuptation titles.

[{"TAG": "PRP", "DEP": "nsubj", "LOWER": {"IN": ["he", "she"]}}]
[{"LOWER": "this"}, {"LOWER": personal_word}] for personal_word in PERSONAL_WORDS

Using these filters, we whittle the 29,064 low-precision positives down to 9,128 high-precision positives! We have several options of what to do next:

  • Train a model with the high precision positives (this post)
  • Create more high precision labeling functions to increase the training set
  • Remove the high-recision positives from the initial set and use a labeling tool to annotate the remaining 19,936 (yikes)

We used the inverse of our original labeling function to select our negative examples; now we are ready for machine learning!

Train and Evaluate a Text Classifier

Using spaCy’s TextCategorizer module, we trained a CNN text classifier on 80% of the labeled data and evaluated it on the other 20%. Details on the training can be found in the [Jupyter notebook]. Here are the stats for the 10 epochs:

Epoch Loss Prec Recall F-score  
  1 4.270 0.706 0.805 0.752
  2 0.167 0.717 0.806 0.759
  3 0.151 0.728 0.800 0.762
  4 0.133 0.731 0.791 0.760
  5 0.119 0.731 0.787 0.758
  6 0.102 0.729 0.776 0.752
  7 0.086 0.729 0.769 0.749
  8 0.072 0.734 0.771 0.752
  9 0.061 0.733 0.755 0.744
  10 0.054 0.729 0.743 0.736

We’ll apply the model from the third epoch, with an F-score of 0.762, to the test set with 43,386 pairs. The model tags 3,565 pairs as positive with a threshold of 0.5; below is a random sample.

  • Q: Lebanese designer Elie Saab created the gown she wore at the 1999 coronation of her husband, King Abdullah
  • A: Rania
  • Score: {‘POSITIVE’: 0.8545047044754028, ‘NEGATIVE’: 0.1454952508211136}
  • Q: Samuel Adams referred to this April 19, 1775 battle when he said, “What a glorious morning for America!”’
  • A: Lexington, Concord
  • Score: {‘POSITIVE’: 0.6066977977752686, ‘NEGATIVE’: 0.39330214262008667}
  • Q: Weakened by scarlet fever, Beth March was sentenced to death by this author’
  • A: (Louisa May) Alcott
  • Score: {‘POSITIVE’: 0.8984272480010986, ‘NEGATIVE’: 0.10157273709774017}
  • Q: Born in Vancouver, he’s brought his curly hair & rubber face to “Funny People” as well as to “Superbad”’
  • A: Seth Rogen
  • Score: {‘POSITIVE’: 0.5894494652748108, ‘NEGATIVE’: 0.4105505347251892}
  • Q: He’s credited with the remark “L’ etat C’est Moi”, “I am the state”’
  • A: Louis XIV
  • Score: {‘POSITIVE’: 0.5193939208984375, ‘NEGATIVE’: 0.4806060492992401}

This very small sample shows a precision of around 0.8; reviewing 100 samples confirms this percentage. This metric is right in line with our validation metrics!

After reviewing 100 random samples from the entire test set, the recall appears to be low (approximately 0.3, correctly labeling 5 of the 15 positives). Logical next steps would be to write more labeling functions or start manually labeling the bulk of training data.

Conclusion

When diving into an NLP problem from scratch, finding high precision labeling functions can help reduce the burden of labeling by either aiding or fully automating the labeling process. The spaCy library makes it easy to do just that!

Acknowledgements

Much of what I’ve learned about labeling in high precision for NLP has its genesis in projects I’ve worked on while at Proofpoint. I’m grateful for my immediate team of Brian Jones, Jeremy Jordan, and Zack Abzug.