Utilizing Existing FAQs for Question Answering

Haystack
LLMs
QAs
NLP
Pandas
Author

Ismail TG

Published

October 17, 2023

Introduction:

  • Usually the Extractive Question Answering systems uses pure data-texts to generate answers, but in some cases it we could useful to use them on previous FAQs as dataset
  • This can be appealing for many reasons:
    • We already have data
    • Inference time is reduced
    • Much control over answers
  • The problem is that this can generalize good enough only on similar questions, which make this method good for certain situations only.

Create simple DocumentStore:

  • As we saw before InMemoryDocumentStore is an easy way for creating DocumentStore for simple prototyping.
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()

Create a Retriever using embeddings:

  • The idea here to create embeddings for questions we will get from users
  • This embeddings must match the FAQs we have
  • First we need to create the embeddings from a model we use
  • Then apply the same embeddings on the FAQs we will use as dataset
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(document_store = document_store,
                               embedding_model = "sentence-transformers/all-MiniLM-L6-v2",
                               use_gpu = True,
                               scale_score = False)
  • Download FAQs dataset
# download
from haystack.utils import fetch_archive_from_http
doc_dir = 'data/tutorial'
urls = 'https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip'
fetch_archive_from_http(url = urls, output_dir = doc_dir)
True
  • Here we use Pandas to manipulate the dataset we just downloaded.
  • First create the dataframe
import pandas as pd
df = pd.read_csv(f'{doc_dir}/small_faq_covid.csv')
df.head()
question answer answer_html link name source category country region city lang last_update
0 What is a novel coronavirus? A novel coronavirus is a new coronavirus that ... <p>A novel coronavirus is a new coronavirus th... \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... Frequently Asked Questions Center for Disease Control and Prevention (CDC) Coronavirus Disease 2019 Basics USA NaN NaN en 2020/03/17
1 Why is the disease being called coronavirus di... On February 11, 2020 the World Health Organiza... <p>On February 11, 2020 the World Health Organ... \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... Frequently Asked Questions Center for Disease Control and Prevention (CDC) Coronavirus Disease 2019 Basics USA NaN NaN en 2020/03/17
2 Why might someone blame or avoid individuals a... People in the U.S. may be worried or anxious a... <p>People in the U.S. may be worried or anxiou... \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... Frequently Asked Questions Center for Disease Control and Prevention (CDC) Coronavirus Disease 2019 Basics USA NaN NaN en 2020/03/17
3 How can people help stop stigma related to COV... People can fight stigma and help, not hurt, ot... <p>People can fight stigma and help, not hurt,... \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... Frequently Asked Questions Center for Disease Control and Prevention (CDC) How It Spreads USA NaN NaN en 2020/03/17
4 What is the source of the virus? Coronaviruses are a large family of viruses. S... <p>Coronaviruses are a large family of viruses... \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... Frequently Asked Questions Center for Disease Control and Prevention (CDC) How It Spreads USA NaN NaN en 2020/03/17
  • Since we download the data from internet we should clean it properly before use in kind of data processing
df.fillna(value='', inplace= True)
df["question"] = df["question"].apply(lambda x: x.strip())
questions = list(df.question.values)
  • Create Embeddings on the questions
df.embedding = retriever.embed_queries(queries=questions).tolist()
len(df.embedding)
UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  df.embedding = retriever.embed_queries(queries=questions).tolist()
213
  • Rename the question series with content
df = df.rename(columns={'question':'content'})
  • Convert te dataframe into a list of dics and index them in our DocumentStore
docs_to_index = df.to_dict(orient='records')
document_store.write_documents(docs_to_index)

Ask questions

  • We first need to initialize the pipeline:
from haystack.pipelines import FAQPipeline
pipe = FAQPipeline(retriever=retriever)
from haystack.utils import print_answers

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 1}})

print_answers(prediction, details="medium")
WARNING:haystack.document_stores.memory:Skipping some of your documents that don't have embeddings. To generate embeddings, run the document store's update_embeddings() method.
'Query: How is the virus spreading?'
'Answers:'
[]