I am using Langchain to interact with a long piece of text. The text is split into chunks, which are passed in as the docs
variable below.
I have set up a chain that streams my response, however I am finding that handleLLMNewToken
is called whilst OpenAI is returning the tokens from all Generations which include the Map steps for each document.
This results in a jumbled response which is a combination of all Generations, rather than just the output Generation. I cannot find a method that enables me to only stream the final result as it's being generated.
I can access the completed prompt if I use the variable chainResult
- but it is not streamable unlike the callback.
const SUMMARISE_MODEL = {
temperature: 0.5,
modelName: "gpt-3.5-turbo-16k",
maxTokens: 5000,
verbose: false,
maxConcurrency: 2,
streaming: true,
}
//...OTHER CODE
const chain = loadQAMapReduceChain(model, {returnIntermediateSteps: true});
const chainResult = await chain.call({
input_documents: docs,
question: templateString,
}, [
{
handleLLMNewToken(token: string) {
updateResponse(token);
},
}
],);