langchain custom prompts & input parameters not clear

Still learning LangChain here myself, but I will share the answers I've come up with in my own search.

Notes:

OP questions edited lightly for clarity.
Each of these questions is probably better as its own separate post, but I did appreciate having them all together as it pushed me to connect the dots between them. So here's hoping this is useful to others as well.

Question 1

In load_qa_with_sources_chain(), PROMPT is defined as:

PROMPT = PromptTemplate(template=template, input_variables=["summaries", "question"])

which expects two inputs, 'summaries' and 'question'.

However, what is passed in only question=query and NOT 'summaries'.

chain = load_qa_with_sources_chain(OpenAI(temperature=0), chain_type="stuff", prompt=PROMPT) query = "What did the president say about Justice Breyer" chain({"input_documents": docs, "question": query}, return_only_outputs=True)

How does input_documents map to summaries?

Answer 1

First, the stuff chain loader takes the prompt we pass in and defines an LLMChain with that prompt. And then you can see that that llm_chain is used to initialize a StuffDocumentsChain.

def _load_stuff_chain(
    llm: BaseLanguageModel,
    prompt: BasePromptTemplate = stuff_prompt.PROMPT,
    document_prompt: BasePromptTemplate = stuff_prompt.EXAMPLE_PROMPT,
    document_variable_name: str = "summaries",
    verbose: Optional[bool] = None,
    **kwargs: Any,
) -> StuffDocumentsChain:
    llm_chain = LLMChain(llm=llm, prompt=prompt, verbose=verbose)
    return StuffDocumentsChain(
        llm_chain=llm_chain,
        document_variable_name=document_variable_name,
        document_prompt=document_prompt,
        verbose=verbose,
        **kwargs,
    )

But also notice that there are two other arguments to _load_stuff_chain(): document_prompt and document_variable_name.

document_prompt: If we do not pass in a custom document_prompt, it relies on the EXAMPLE_PROMPT, which is quite specific. (It is long so I won't repost here.).
document_variable_name: Here you can see where 'summaries' first appears as a default value. And we can see it defined as

the variable name in the llm_chain to put the documents in

In that same stuff.py script there is a _get_inputs() method that collects all of the inputs that will go into the LLM for evaluation. One of those inputs is

inputs[self.document_variable_name] = self.document_separator.join(doc_strings)

So now we know this is actually inputs['summaries'] by default. Also, side note, doc_strings is each doc in docs formatted using document_prompt (via format_document()).

Ok, so now we are almost there, the final step in the stuff system is to send all of the docs, formatted into document_prompts, to the llm_chain for evaluation. That is done in combine_docs() - ending in this call to llm_chain.predict():

return self.llm_chain.predict(callbacks=callbacks, **inputs), {}

Remember, we initialized llm_chain with the original PROMPT we passed in, and now it is clear that it is both expecting 'question' AND 'summaries' as input variables.

Question 2

In the summarize_chain example:

prompt_template = """Write a concise summary of the following:


{text}


CONCISE SUMMARY IN ITALIAN:"""
PROMPT = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = load_summarize_chain(OpenAI(temperature=0), chain_type="map_reduce", return_intermediate_steps=True, map_prompt=PROMPT, combine_prompt=PROMPT)
chain({"input_documents": docs}, return_only_outputs=True)

How does docs map to text?

Answer 2

This gets easier from here, as a lot of the summarize chain code follows similar patterns to the qa chain.

We can see in _load_map_reduce_chain() there's a default value, 'text', which gets assigned to document_variable_name in the MapReduceDocumentChain that is initialized and returned.

Also note L52 and L54 where two different LLMChain objects are initialized, one for map (takes map_prompt) and one for reduce (takes combine_prompt).

# L52
map_chain = LLMChain(llm=llm, prompt=map_prompt, verbose=verbose)
# L54
reduce_chain = LLMChain(llm=_reduce_llm, prompt=combine_prompt, verbose=verbose)

And then reduce_chain is built into combine_document_chain, which is where we can first see the relationship coming in between 'text' (the default value for combine_document_variable_name) and PROMPT (now built into reduce_chain).

 combine_document_chain = StuffDocumentsChain(
        llm_chain=reduce_chain,
        document_variable_name=combine_document_variable_name,
        verbose=verbose,
    )

Question 3

How does it work with map_prompt and combine_prompt being same?

Answer 3

The fact that both prompts are the same here looks like it may be for the convenience of the example, as the suggested prompt is generic: >`"Write a concise summary of the following: {text} #..."`

The user can enter different values for map_prompt and combine_prompt; the map step applies a prompt to each document, and the combine step applies one prompt to bring the map results together.

You can see where these steps occur in the code:

The LLM chain for map is applied in combine_docs() step of the MapReduceDocumentsChain:

# self.document_variable_name = 'text'
# d.page_content is the text content from each :doc: in :docs:

"""Combine documents in a map reduce manner.

Combine by mapping first chain over all documents, then reducing the results.
This reducing can be done recursively if needed (if there are many documents).
"""
# L144
results = self.llm_chain.apply(
            # FYI - this is parallelized and so it is fast.
            [{self.document_variable_name: d.page_content, **kwargs} for d in docs],
            callbacks=callbacks,
        )

And then the reduce steps are called in the _process_results() method, specifically in the _collapse_chain() and combine_documents_chain() sections.

Question 4

Where can i see the parameters of the load_summarize_chain() function?

Answer 4

All of the summarize variants (stuff, map_reduce, etc) are defined in summarize/__init__.py. In this particular example, the map_reduce chain parameters are on L40-51.