Introduction:

Usually the Extractive Question Answering systems uses pure data-texts to generate answers, but in some cases it we could useful to use them on previous FAQs as dataset
This can be appealing for many reasons:
- We already have data
- Inference time is reduced
- Much control over answers
The problem is that this can generalize good enough only on similar questions, which make this method good for certain situations only.

Create simple DocumentStore:

As we saw before InMemoryDocumentStore is an easy way for creating DocumentStore for simple prototyping.

from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()

Create a Retriever using embeddings:

The idea here to create embeddings for questions we will get from users
This embeddings must match the FAQs we have
First we need to create the embeddings from a model we use
Then apply the same embeddings on the FAQs we will use as dataset

from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store = document_store,
                               embedding_model = "sentence-transformers/all-MiniLM-L6-v2",
                               use_gpu = True,
                               scale_score = False)

Download FAQs dataset

# download
from haystack.utils import fetch_archive_from_http
doc_dir = 'data/tutorial'
urls = 'https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip'
fetch_archive_from_http(url = urls, output_dir = doc_dir)

True

Here we use Pandas to manipulate the dataset we just downloaded.
First create the dataframe

import pandas as pd
df = pd.read_csv(f'{doc_dir}/small_faq_covid.csv')
df.head()

	question	answer	answer_html	link	name	source	category	country	region	city	lang	last_update
0	What is a novel coronavirus?	A novel coronavirus is a new coronavirus that ...	<p>A novel coronavirus is a new coronavirus th...	\nhttps://www.cdc.gov/coronavirus/2019-ncov/fa...	Frequently Asked Questions	Center for Disease Control and Prevention (CDC)	Coronavirus Disease 2019 Basics	USA	NaN	NaN	en	2020/03/17
1	Why is the disease being called coronavirus di...	On February 11, 2020 the World Health Organiza...	<p>On February 11, 2020 the World Health Organ...	\nhttps://www.cdc.gov/coronavirus/2019-ncov/fa...	Frequently Asked Questions	Center for Disease Control and Prevention (CDC)	Coronavirus Disease 2019 Basics	USA	NaN	NaN	en	2020/03/17
2	Why might someone blame or avoid individuals a...	People in the U.S. may be worried or anxious a...	<p>People in the U.S. may be worried or anxiou...	\nhttps://www.cdc.gov/coronavirus/2019-ncov/fa...	Frequently Asked Questions	Center for Disease Control and Prevention (CDC)	Coronavirus Disease 2019 Basics	USA	NaN	NaN	en	2020/03/17
3	How can people help stop stigma related to COV...	People can fight stigma and help, not hurt, ot...	<p>People can fight stigma and help, not hurt,...	\nhttps://www.cdc.gov/coronavirus/2019-ncov/fa...	Frequently Asked Questions	Center for Disease Control and Prevention (CDC)	How It Spreads	USA	NaN	NaN	en	2020/03/17
4	What is the source of the virus?	Coronaviruses are a large family of viruses. S...	<p>Coronaviruses are a large family of viruses...	\nhttps://www.cdc.gov/coronavirus/2019-ncov/fa...	Frequently Asked Questions	Center for Disease Control and Prevention (CDC)	How It Spreads	USA	NaN	NaN	en	2020/03/17

Since we download the data from internet we should clean it properly before use in kind of data processing

df.fillna(value='', inplace= True)
df["question"] = df["question"].apply(lambda x: x.strip())
questions = list(df.question.values)

Create Embeddings on the questions

df.embedding = retriever.embed_queries(queries=questions).tolist()
len(df.embedding)

UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.embedding = retriever.embed_queries(queries=questions).tolist()

Rename the question series with content

df = df.rename(columns={'question':'content'})

Convert te dataframe into a list of dics and index them in our DocumentStore

docs_to_index = df.to_dict(orient='records')
document_store.write_documents(docs_to_index)

Ask questions

We first need to initialize the pipeline:

from haystack.pipelines import FAQPipeline
pipe = FAQPipeline(retriever=retriever)

from haystack.utils import print_answers

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 1}})

print_answers(prediction, details="medium")

WARNING:haystack.document_stores.memory:Skipping some of your documents that don't have embeddings. To generate embeddings, run the document store's update_embeddings() method.

'Query: How is the virus spreading?'
'Answers:'
[]