from haystack.document_stores import InMemoryDocumentStore
= InMemoryDocumentStore() document_store
Introduction:
- Usually the Extractive Question Answering systems uses pure data-texts to generate answers, but in some cases it we could useful to use them on previous FAQs as dataset
- This can be appealing for many reasons:
- We already have data
- Inference time is reduced
- Much control over answers
- The problem is that this can generalize good enough only on similar questions, which make this method good for certain situations only.
Create simple DocumentStore:
- As we saw before
InMemoryDocumentStore
is an easy way for creatingDocumentStore
for simple prototyping.
Create a Retriever using embeddings:
- The idea here to create embeddings for questions we will get from users
- This embeddings must match the FAQs we have
- First we need to create the embeddings from a model we use
- Then apply the same embeddings on the FAQs we will use as dataset
from haystack.nodes import EmbeddingRetriever
= EmbeddingRetriever(document_store = document_store,
retriever = "sentence-transformers/all-MiniLM-L6-v2",
embedding_model = True,
use_gpu = False) scale_score
- Download FAQs dataset
# download
from haystack.utils import fetch_archive_from_http
= 'data/tutorial'
doc_dir = 'https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip'
urls = urls, output_dir = doc_dir) fetch_archive_from_http(url
True
- Here we use
Pandas
to manipulate the dataset we just downloaded. - First create the dataframe
import pandas as pd
= pd.read_csv(f'{doc_dir}/small_faq_covid.csv')
df df.head()
question | answer | answer_html | link | name | source | category | country | region | city | lang | last_update | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | What is a novel coronavirus? | A novel coronavirus is a new coronavirus that ... | <p>A novel coronavirus is a new coronavirus th... | \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... | Frequently Asked Questions | Center for Disease Control and Prevention (CDC) | Coronavirus Disease 2019 Basics | USA | NaN | NaN | en | 2020/03/17 |
1 | Why is the disease being called coronavirus di... | On February 11, 2020 the World Health Organiza... | <p>On February 11, 2020 the World Health Organ... | \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... | Frequently Asked Questions | Center for Disease Control and Prevention (CDC) | Coronavirus Disease 2019 Basics | USA | NaN | NaN | en | 2020/03/17 |
2 | Why might someone blame or avoid individuals a... | People in the U.S. may be worried or anxious a... | <p>People in the U.S. may be worried or anxiou... | \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... | Frequently Asked Questions | Center for Disease Control and Prevention (CDC) | Coronavirus Disease 2019 Basics | USA | NaN | NaN | en | 2020/03/17 |
3 | How can people help stop stigma related to COV... | People can fight stigma and help, not hurt, ot... | <p>People can fight stigma and help, not hurt,... | \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... | Frequently Asked Questions | Center for Disease Control and Prevention (CDC) | How It Spreads | USA | NaN | NaN | en | 2020/03/17 |
4 | What is the source of the virus? | Coronaviruses are a large family of viruses. S... | <p>Coronaviruses are a large family of viruses... | \nhttps://www.cdc.gov/coronavirus/2019-ncov/fa... | Frequently Asked Questions | Center for Disease Control and Prevention (CDC) | How It Spreads | USA | NaN | NaN | en | 2020/03/17 |
- Since we download the data from internet we should clean it properly before use in kind of data processing
='', inplace= True)
df.fillna(value"question"] = df["question"].apply(lambda x: x.strip())
df[= list(df.question.values) questions
- Create Embeddings on the questions
= retriever.embed_queries(queries=questions).tolist()
df.embedding len(df.embedding)
UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df.embedding = retriever.embed_queries(queries=questions).tolist()
213
- Rename the
question
series with content
= df.rename(columns={'question':'content'}) df
- Convert te dataframe into a list of dics and index them in our DocumentStore
= df.to_dict(orient='records')
docs_to_index document_store.write_documents(docs_to_index)
Ask questions
- We first need to initialize the pipeline:
from haystack.pipelines import FAQPipeline
= FAQPipeline(retriever=retriever) pipe
from haystack.utils import print_answers
# Run any question and change top_k to see more or less answers
= pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 1}})
prediction
="medium") print_answers(prediction, details
WARNING:haystack.document_stores.memory:Skipping some of your documents that don't have embeddings. To generate embeddings, run the document store's update_embeddings() method.
'Query: How is the virus spreading?'
'Answers:'
[]