from haystack.document_stores import InMemoryDocumentStore
= InMemoryDocumentStore(use_bm25=True) document_store
Introduction:
- Haystack Library is an end-2-end framework that allow us orchesting many components in order to build LLM application with minimum lines of code.
Question/Answer Systems: An Overview
- Q/A system is a discipline which focus on building system that answers human questionm by retrieving the answers from a stored documents.
- Although this is a very (very!) simplistic way of describing the mechanisms that defines how the Q/A systems works, but is all we need to know at this moment.
- We have some stored documents, we use a retriever to pick the right documents for that match the question, and we use a language model to get the answer from them.
Building A simple Q/A system for a specific task:
In this tutorial we will try to build a simple Q/A system that can answer questions about specific topic using the **
HayStack**
library.- According to the simple Definition we gave to the Q/A above, we need to start the process of building that system by collecting and storing data/documents about that topic, which is by the way: Game Of Thrones. To achive that goal we will use
DocumentStore
- Then we need some kind of mechanism that help us to find documents that may provide us with answers. Wwe will use the
Retriever
API. - Finally, we need a Deep Learning model that highlights possible answers in the context of the picked documents. We will use the
Reader
to achieve that goal.
- According to the simple Definition we gave to the Q/A above, we need to start the process of building that system by collecting and storing data/documents about that topic, which is by the way: Game Of Thrones. To achive that goal we will use
Initializing the DocumentStore:
- In this tutorial we will use the basic type of the
DocumentStore
class, which isInMemoryDocumentStore
- DocumentStore is a like a database or a warehouse that need to be filled with data/documents.
- Here we use the
fetch_archive_from_http
function to download our documents from the web. - The downloaded docements needed to be prepared and organized to be processed in the next step:
from haystack.utils import fetch_archive_from_http
='data/directory_project'
doc_dir='https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip', output_dir= doc_dir) fetch_archive_from_http(url
True
- Now we have all our data stored in this path:
data/directory_project
and assigned todoc_dir
- At this momment our data is raw and should be converted in Document objects according the
HayStack
standars. In order to do that we will useTextIndexingPipeline
and write them intoDocumentStore
import os
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
= [doc_dir + '/' + f for f in os.listdir(doc_dir)]
files_to_index = TextIndexingPipeline(document_store)
indexing_pipeline =files_to_index) indexing_pipeline.run_batch(file_paths
Initializing A Retriever:
- A retriever will map through all documents we stored and find the more likely dosuments that contains possible answer to our question
- Here we initialize the BM25Retriever algorithm and make it access the InMemoryDocumentStore
from haystack.nodes import BM25Retriever
= BM25Retriever(document_store=document_store) retriever
Initializing the Reader:
- The Reader get access to all texts from Retriver and extracts candidates answers
- Reader is based on LLM’s
- In this turorial we used the
roberta-base-squad2
model
from haystack.nodes import FARMReader
= FARMReader(model_name_or_path='deepset/roberta-base-squad2', use_gpu=True) reader
Creating the Retriver-Reader Pipeline:
- Since we have everything we need to build this Q/A system nom, all we have to do is put every piece together in one
Pipeline
from haystack.pipelines import ExtractiveQAPipeline
= ExtractiveQAPipeline(reader, retriever) pipe
Asking Questions:
= pipe.run(query='who is the most powerful creature?', params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}) predictions
Inferencing Samples: 100%|██████████| 1/1 [00:03<00:00, 3.66s/ Batches]
from haystack.utils import print_answers
='minimum') print_answers(predictions, details
'Query: who is the most powerful creature?'
'Answers:'
[ { 'answer': 'Khal Drogo',
'context': '\n'
'\n'
"'''Khal Drogo''' is a fictional character in the ''A Song "
"of Ice and Fire'' series of fantasy novels by American "
'author George R. R. Martin and in t'},
{ 'answer': 'Night King',
'context': '\n'
'\n'
"The '''Night King''' is a fictional character appearing in "
"the HBO high fantasy television series ''Game of "
"Thrones'', based on George R. R. Martin'"},
{ 'answer': 'Drogo',
'context': ' prove to be fundamental to her growth as both a ruler and '
'a conqueror.\n'
'\n'
'Drogo is portrayed by Jason Momoa in the HBO television '
'adaptation.\n'
'\n'
'==Overvi'},
{ 'answer': 'Jon Snow',
'context': 'rge and sinewy man that towers over others, such as Davos '
'Seaworth and Jon Snow, a Baratheon trait. He lacks the '
'long black hair of his brothers, and '},
{ 'answer': 'Night King',
'context': 'urdik in seasons 6 to 8.\n'
'\n'
'==Description==\n'
"In ''Game of Thrones'', the Night King is physically "
'distinguished from the other White Walkers by his "crow'}]