Open-domain question answering. Introduction to the topic

Text can be classified as an unstructured knowledge base. That implies that information extraction from text is not straightforward and might be work-intensive. This is in contrast to a structured knowledge base, for example, Freebase, WikiData, or any other database. It is easy to extract information from structured knowledge bases, but it is also time-consuming to organize data in that form. There is no doubt that it would be useful to be able to ask any question and have a system that comes up with an answer. For a long time access to information was a benefit of the wealthy few. It no longer is. The popularity of software engines is born from the need to organize freely available knowledge. They index a vast amount of websites to extract the most relevant ones to the provided query. But to be able to appreciate the engineering excellence that provided Google to humankind, we need to look a bit into the problem at hand.

Table of content

  • Introduction to QA systems – IR systems
  • What is a closed domain answer?
  • What is an open-domain answer?
  • What is a closed and open book system?
  • Example of popular open-source QA systems
  • Extractive and abstractive QA systems
  • How are QA systems trained?
  • The most popular datasets for Question Answering
  • Summary

Introduction to QA systems – IR systems

Before delving into Question Answering (QA) systems, let’s discuss Information Retrieval (IR) systems. IR is a problem of selecting relevant information (or a document containing relevant information) from a storage system. An IR system should return most of the relevant documents (have high recall) and most of the returned documents should be accurately on the searched topic (have high precision) and return them in order of relevance. A good example is a penultimate version of the Google engine. One before information comes highlighted, which is now more like a QA system.

To name just a few of IR algorithms:

  • Inverted index
  • BM25

There are three steps in any QA system:

  1. Analysis of a question
  2. Search
  3. Choosing an answer

A natural extension of IR systems is a QA system, which not only takes ranked documents but also returns a selected part of the document – much like nowadays Google.

There are the following types of question answering systems (Question Answering for short):

  1. closed domain
  2. open domain

and each of these types can be divided into open and closed book QA systems.

What is a closed domain answer?

Closed domain question answering (CDQA) is a broad name for the task of answering questions only from one domain, for example legal, medical, engineering, etc.

What is an open-domain answer?

Open-Domain Question Answering (OPQA) is the task of answering a question from any domain. In this way, a trained model can be asked a question about anything.

In OPQA systems are based on Information Retrieval techniques, first for locating a relevant document to a query and then follow the NLP techniques to extract relevant parts of the document.

Open Domain Question Answering (OPQA) is provided asked a question and it should formulate an answer, for example, “What did Albert Einstein win the Nobel Prize for?”, the answer is “The law of the photoelectric effect” – where the true answer is clearly objective.

Following [1] there are three classes of questions that a trained OPQA system should be able to answer in increasing order of difficulty:

  1. the most basic behavior is to be able to reliably recall the answer to a question that the model has seen at the training time
  2. a model should be able to answer novel questions at test time and choose an answer from the set of answers it has seen during training
  3. a strong system should be able to answer novel questions which have answers which are not contained in the training data

Fig. 1. Overview of three frameworks discussed in this post. Source: https://lilianweng.github.io/lil-log/2020/10/29/open-domain-question-answering.html

What is a closed and open book system?

The ODQA and CDQA systems should be divided into open and closed book systems, following [1].

Open book systems are systems that are focused to retrieve a relevant passage about any topic from some knowledge base, for example, Wikipedia. An example of such a system is the Dense Passage Retrieval model (DPR), which retrieves document based on dense embeddings, before feeding them into a conventional reader – reranker which extracts spans of text as answers.

The retriever-reader QA framework combines information retrieval with machine reading comprehension. Source: https://lilianweng.github.io/lil-log/2020/10/29/open-domain-question-answering.html

Another worth mentioning model is Retrieval-Augmented generation that jointly learns to retrieve and generate answers based on dense retrieval and BART model (it follows a sequence to sequence set-up).

The current state-of-the-art system is a Fusion-Decoder, which is based on a large T5 model which retrieves 100 documents and fuses them so that decoder can attend to all documents at once.

Some of the most popular open-source Question Answering systems are:

  • Start (OPQA) – answers natural questions by presenting components of text and multi-media information
  • QuALiM (published by Microsoft, OPQA) – retrieves textual information by means of Wikipedia and graphic information
  • HONqa (domain restricted or closed domain QA system) – system operated by Health On The Net Foundations (Swiss), which aims to promote the development of reliable medical information
  • MedQA (closed domain question answering system) – health care concerned, analyses thousands of documents to arrive at a coherent response

Closed Book Question Answering system features an enormous model with millions or billions of trainable parameters, which contains a lot of knowledge about the world, which was contained in their training datasets.

Closed book models should preserve information in their parameters, so they are like students on a final exam at which they can be asked any question regarding past years of study, and without looking into notes they should come up with an answer. Typically, these models follow sequence-to-sequence architecture.

Extractive and abstractive QA systems

Open book CDQA and OPQA task are based on the following premise: a system is presented with a piece of text based on which it is expected to formulate an answer. It is the same as a reading comprehension task that many students are familiar with and is very frequent on a wide variety of exams (SAT, ACT, etc). For example, 100 words long abstract about the life of Henry Ford, the system should be able to formulate a thought that the person discussed was an engineer born in the 19th century. Question Answering systems can be divided into the following:

  • extractive question answering
  • abstractive question answering

where the first is focused on extracting the most relevant piece of information from a text, but given the restriction that it can only deal with consecutive words. Abstractive Question Answering on the other hand is a task in which a system looks into abstracting an answer that is contained in the text. An example could come in handy here. Given the text “Building is located in a city of Denver at the street number of 1st Avenue South, 80014 being the zip code, Colorado”, and the question: “Where is the building located?” an extractive system should be able to come up with an answer “Denver at the street number of 1st Avenue South, 80014  being the zip code, Colorado”. The abstractive system would “say”:  “Denver, 1st Avenue South, 80014, Colorado”. It is clear that an abstractive question answering system is much more of interest here and in most conceivable scenarios because it is not reasonable to expect that the desired answer will always be in a consecutive sequence of words.

How are Question Answering systems trained?

BERT is a well-recognized language model which can be used for question answering (OPQA, Open book). It is available in the Hugging Face library. It is trained on two objectives: Masked Language Modelling and Next Sentence Prediction.

The most popular datasets for Question Answering are the following:

  • SquAD Stanford Question Answering Dataset
  • WebQuestions which contains 3778 training and 2032 testing questions, answer pairs. Questions were obtained by mining a search engine, and answers are Freebase entities annotated by crowd workers. The ODQA task consists of predicting the name of the freebase entity.
  • TriviaQA is a dataset of ~79k train, ~9k development and ~11k test questions, answer pairs obtained by scraping trivia websites. Answers consist of Wikipedia entities, and any alias for the answer entity is considered a correct answer.
  • Open Natural Questions consists of search engine questions with answers annotated as spans in Wikipedia articles by crowd workers. Usually, the answer has less than 6 tokens (words).

Summary

Open-domain Questioning Answering systems are a subfield of NLP (used also for more effective Search Engine Optimization) that has recently become much more researched, with a surge of publication. As of now the results, although often not 100% accurate, provide a real value to a user. An increase in research effort in this field promises new breakthroughs, which will surely increase the efficiency of information-intensive work.

[1] Lewis, P., Stenetorp, P. and Riedel, S., 2020. Question and answer test-train overlap in open-domain question answering datasets. arXiv preprint arXiv:2008.02637.