Building QAs System Using Haystack

Haystack
LLMs
QAs
NLP
Author

Ismail TG

Published

October 17, 2023

Introduction:

  • Haystack Library is an end-2-end framework that allow us orchesting many components in order to build LLM application with minimum lines of code.

Question/Answer Systems: An Overview

  • Q/A system is a discipline which focus on building system that answers human questionm by retrieving the answers from a stored documents.
  • Although this is a very (very!) simplistic way of describing the mechanisms that defines how the Q/A systems works, but is all we need to know at this moment.
  • We have some stored documents, we use a retriever to pick the right documents for that match the question, and we use a language model to get the answer from them.

Building A simple Q/A system for a specific task:

  • In this tutorial we will try to build a simple Q/A system that can answer questions about specific topic using the **HayStack** library.

    • According to the simple Definition we gave to the Q/A above, we need to start the process of building that system by collecting and storing data/documents about that topic, which is by the way: Game Of Thrones. To achive that goal we will use DocumentStore
    • Then we need some kind of mechanism that help us to find documents that may provide us with answers. Wwe will use the Retriever API.
    • Finally, we need a Deep Learning model that highlights possible answers in the context of the picked documents. We will use the Reader to achieve that goal.

Initializing the DocumentStore:

  • In this tutorial we will use the basic type of the DocumentStore class, which is InMemoryDocumentStore
from haystack.document_stores import InMemoryDocumentStore
document_store= InMemoryDocumentStore(use_bm25=True)
  • DocumentStore is a like a database or a warehouse that need to be filled with data/documents.
  • Here we use the fetch_archive_from_http function to download our documents from the web.
  • The downloaded docements needed to be prepared and organized to be processed in the next step:
from haystack.utils import fetch_archive_from_http
doc_dir='data/directory_project'
fetch_archive_from_http(url='https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip', output_dir= doc_dir)
True
  • Now we have all our data stored in this path: data/directory_project and assigned to doc_dir
  • At this momment our data is raw and should be converted in Document objects according the HayStack standars. In order to do that we will use TextIndexingPipeline and write them into DocumentStore
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline

files_to_index = [doc_dir + '/' + f for f in os.listdir(doc_dir)]
indexing_pipeline = TextIndexingPipeline(document_store)
indexing_pipeline.run_batch(file_paths=files_to_index)

Initializing A Retriever:

  • A retriever will map through all documents we stored and find the more likely dosuments that contains possible answer to our question
  • Here we initialize the BM25Retriever algorithm and make it access the InMemoryDocumentStore
from haystack.nodes import BM25Retriever
retriever= BM25Retriever(document_store=document_store)

Initializing the Reader:

  • The Reader get access to all texts from Retriver and extracts candidates answers
  • Reader is based on LLM’s
  • In this turorial we used the roberta-base-squad2 model
from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path='deepset/roberta-base-squad2', use_gpu=True)

Creating the Retriver-Reader Pipeline:

  • Since we have everything we need to build this Q/A system nom, all we have to do is put every piece together in one Pipeline
from haystack.pipelines import ExtractiveQAPipeline
pipe = ExtractiveQAPipeline(reader, retriever)

Asking Questions:

predictions = pipe.run(query='who is the most powerful creature?', params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00,  3.66s/ Batches]
from haystack.utils import print_answers
print_answers(predictions, details='minimum')
'Query: who is the most powerful creature?'
'Answers:'
[   {   'answer': 'Khal Drogo',
        'context': '\n'
                   '\n'
                   "'''Khal Drogo''' is a fictional character in the ''A Song "
                   "of Ice and Fire'' series of fantasy novels by American "
                   'author George R. R. Martin and in t'},
    {   'answer': 'Night King',
        'context': '\n'
                   '\n'
                   "The '''Night King''' is a fictional character appearing in "
                   "the HBO high fantasy television series ''Game of "
                   "Thrones'', based on George R. R. Martin'"},
    {   'answer': 'Drogo',
        'context': ' prove to be fundamental to her growth as both a ruler and '
                   'a conqueror.\n'
                   '\n'
                   'Drogo is portrayed by Jason Momoa in the HBO television '
                   'adaptation.\n'
                   '\n'
                   '==Overvi'},
    {   'answer': 'Jon Snow',
        'context': 'rge and sinewy man that towers over others, such as Davos '
                   'Seaworth and Jon Snow, a Baratheon trait. He lacks the '
                   'long black hair of his brothers, and '},
    {   'answer': 'Night King',
        'context': 'urdik in seasons 6 to 8.\n'
                   '\n'
                   '==Description==\n'
                   "In ''Game of Thrones'', the Night King is physically "
                   'distinguished from the other White Walkers by his "crow'}]