how to convert langchain documents back to strings?
Asked Answered
M

5

6

i have built a splitter function with langchain library that splits a series of python files. At another point in the code I need to convert these documents back into python code. Only I do not know how to do this

def index_repo(repo_url):

    os.environ['OPENAI_API_KEY'] = ""

    contents = []
    fileextensions = [
        ".py", ]


    print('cloning repo')
    repo_dir = get_repo(repo_url)

    file_names = []

    for dirpath, dirnames, filenames in os.walk(repo_dir):
        for file in filenames:
            if file.endswith(tuple(fileextensions)):
                file_names.append(os.path.join(dirpath, file))
                try:
                    with open(os.path.join(dirpath, file), "r", encoding="utf-8") as f:
                        contents.append(f.read())

                except Exception as e:
                    pass


    # chunk the files
    text_splitter =  RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=5000, chunk_overlap=0)
    texts = text_splitter.create_documents(contents)

    return texts, file_names
Mourner answered 1/9, 2023 at 20:27 Comment(0)
S
4

Because the documentation and structure of Langchain is a bit messy and chaotic there is not much information to be found about the 'Document' type.

Fortunately you can just convert it back to a dictionary with: doc.dict()

Scandura answered 13/11, 2023 at 13:45 Comment(1)
This exactly. Finally found this after a lot of searching. Thanks, man!Appellative
E
3

You can extract the contents of the individual langchain docs to a string by extracting the page_content with this (replacing the index with the doc string you want extracted):

string_text = texts[0].page_content

This does not work for the full "texts" since it is a list, but you can use this code to extract all:

string_text = [texts[i].page_content for i in range(len(texts))]
Epoch answered 8/12, 2023 at 21:25 Comment(0)
S
2

Try replacing this:

    texts = text_splitter.create_documents(contents)

With this:

    texts = text_splitter.split_text(contents)

The code you provided, with the create_documents method, creates a Document object (which is a list object in which each item is a dictionary containing two keys: page_content: string and metadata: dictionary). Using the split_text method will put each chunk from the RecursiveCharacterTextSplitter as an item in your texts list.

Hope this helps!

Someplace answered 3/9, 2023 at 5:51 Comment(0)
R
0

There are good answers here but just to give an example of the output that you can get from langchain_core.documents.base.Document helps to visualise IMO.

from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader("World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf", mode="elements")
data = loader.load()
data[0].dict()

output

{'page_content': 'World Bank Listing of Ineligible Firms and Individuals',
 'metadata': {'source': 'C:\\Users\\xxxxx\\Downloads\\World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf',
  'coordinates': {'points': ((137.64, 101.49324000000001),
    (137.64, 117.45323999999994),
    (477.84336, 117.45323999999994),
    (477.84336, 101.49324000000001)),
   'system': 'PixelSpace',
   'layout_width': 612.0,
   'layout_height': 792.0},
  'file_directory': 'C:\\Users\\goldsby_c\\Downloads',
  'filename': 'World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf',
  'languages': ['eng'],
  'last_modified': '2024-01-19T17:35:06',
  'page_number': 1,
  'filetype': 'application/pdf',
  'category': 'Title'},
 'type': 'Document'}

Rather than going with the data[0].dict() you can access page_content or metadata like this:-

data[0].page_content

output

'World Bank Listing of Ineligible Firms and Individuals'

or

data[0].metadata

output

{'source': 'C:\\Users\\xxxxx\\Downloads\\World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf',
 'coordinates': {'points': ((137.64, 101.49324000000001),
   (137.64, 117.45323999999994),
   (477.84336, 117.45323999999994),
   (477.84336, 101.49324000000001)),
  'system': 'PixelSpace',
  'layout_width': 612.0,
  'layout_height': 792.0},
 'file_directory': 'C:\\Users\\xxxxx\\Downloads',
 'filename': 'World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf',
 'languages': ['eng'],
 'last_modified': '2024-01-19T17:35:06',
 'page_number': 1,
 'filetype': 'application/pdf',
 'category': 'Title'}

metadata is a dict so you can access like this

data[0].metadata["filename"]

output

'World-Bank-Notes-on-Debarred-Firms-and-Individuals.pdf'
Radiolucent answered 12/2 at 17:24 Comment(0)
W
0

This code will work

result_strings = [x.dict()['page_content'] for x in result]

result_strings
Weidman answered 13/2 at 13:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.