Distinguishing Table of Contents in Word document
Asked Answered
Q

4

6

Does anyone know how when programmatically iterating through a word document, you can tell if a paragraph forms part of a table of contents (or indeed, anything else that forms part of a field).

My reason for asking is that I have a VB program that is supposed to extract the first couple of paragraphs of substantive text from a document - it's doing so by iterating through the Word.Paragraphs collection. I don't want the results to include tables of contents or other fields, I only want stuff that a human being would recognize as a header, title or a normal text paragraph. However it turns out that if there's a table of contents, then not only the table of contents itself but EVERY line in the table of contents appears as a separate item in Word.Paragraphs. I don't want these but haven't been able to find any property on the Paragraph object that would allow me to distinguish and so ignore them (I'm guessing I need the solution to apply to other field types too, like table of figures and table of authorities, which I haven't yet actually encountered but I guess potentially would cause the same problem)

Quiche answered 8/7, 2011 at 9:41 Comment(0)
C
4

Because of the limitations in the Word object model I think the best way to achieve this would be to temporarily remove the TOC field code, iterate through the Word document, and then re-insert the TOC. In VBA, it would look like this:

Dim doc As Document
Dim fld As Field
Dim rng As Range

Set doc = ActiveDocument

For Each fld In doc.Fields
    If fld.Type = wdFieldTOC Then
        fld.Select
        Selection.Collapse
        Set rng = Selection.Range 'capture place to re-insert TOC later
        fld.Cut
    End If
Next

Iterate through the code to extract paragraphs and then

Selection.Range = rng
Selection.Paste

If you are coding in .NET this should translate pretty closely. Also, this should work for Word 2003 and earlier as is, but for Word 2007/2010 the TOC, depending on how it is created, sometimes has a Content Control-like region surrounding it that may require you to write additional detect and remove code.

Conquistador answered 4/8, 2011 at 4:21 Comment(1)
Thanks, that's exactly what I needed - I tried it out and your code worked, allowing me to remove the table of contents nicely. I actually simplified your code a bit more, since I wasn't going to save the document, so didn't need to select the range to restore it at the end - just closed the document with SaveChanges:=False once I'd read the text I needed.Quiche
I
3

This is not guaranteed, but if the standard Word styles are being used for the TOC (highly likely), and if no one has added their own style prefixed with "TOC", then it is OK. This is a crude approach, but workable.

Dim parCurrentParagraph As Paragraph

If Left(parCurrentParagraph.Format.Style.NameLocal, 3) = "TOC" Then

       '    Do something 

End If
Ire answered 16/8, 2016 at 21:3 Comment(0)
A
0

What you could do is create a custom style for each section of your document.

Custom styles in Word 2003 (not sure which version of Word you're using)

Then, when iterating through your paragraph collection you can check the .Style property and safely ignore it if it equals your TOCStyle.

I believe the same technique would work fine for Tables as well.

Alleman answered 10/7, 2011 at 2:21 Comment(3)
Thanks. But unfortunately that solution isn't possible because I have no control over the word documents themselves. I'm attempting to extract information from documents that have already been prepared, so I have to work with whatever styles happen to be already in them - which generally means the standard Word styles.Quiche
Are there any consistent formatting rules applied to the TOC?Alleman
Unfortunately no, the word docs I'm processing come from a wide variety of sources, so I cannot rely on any consistent formatting rules. I'm basically relying on styles to distinguish headers and titles etc. I'm aware that will produce occasional bad results where documents are inappropriately styled - that's acceptable. But where I can use general principles to improve the results I'd like to, and being able to distinguish that a paragraph is part of a field like a table of contents would be a considerable improvement.Quiche
P
0

The following Function will return a Range object that begins after any Table of Contents or Table of Figures. You can then use the Paragraphs property of the returned Range:

Private Function GetMainTextRange() As Range
Dim toc As TableOfContents
Dim tof As TableOfFigures
Dim mainTextStart As Long

mainTextStart = 1
For Each toc In ActiveDocument.TablesOfContents
    If toc.Range.End > mainTextStart Then
        mainTextStart = toc.Range.End + 1
    End If
Next
For Each tof In ActiveDocument.TablesOfFigures
    If tof.Range.End > mainTextStart Then
        mainTextStart = tof.Range.End + 1
    End If
Next

Set GetMainTextRange = ActiveDocument.Range(mainTextStart, ActiveDocument.Range.End)
End Function
Postmeridian answered 24/8, 2022 at 11:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.