docx4j find and replace
Asked Answered
C

4

4

I have docx document with some placeholders. Now I should replace them with other content and save new docx document. I started with docx4j and found this method:

public static List<Object> getAllElementFromObject(Object obj, Class<?> toSearch) {
    List<Object> result = new ArrayList<Object>();
    if (obj instanceof JAXBElement) obj = ((JAXBElement<?>) obj).getValue();

    if (obj.getClass().equals(toSearch))
        result.add(obj);
    else if (obj instanceof ContentAccessor) {
        List<?> children = ((ContentAccessor) obj).getContent();
        for (Object child : children) {
            result.addAll(getAllElementFromObject(child, toSearch));
        }
    }
    return result;
}

public static void findAndReplace(WordprocessingMLPackage doc, String toFind, String replacer){
    List<Object> paragraphs = getAllElementFromObject(doc.getMainDocumentPart(), P.class);
    for(Object par : paragraphs){
        P p = (P) par;
        List<Object> texts = getAllElementFromObject(p, Text.class);
        for(Object text : texts){
            Text t = (Text)text;
            if(t.getValue().contains(toFind)){
                t.setValue(t.getValue().replace(toFind, replacer));
            }
        }
    }
}

But that only work rarely because usually the placeholders splits across multiple texts runs.

I tried UnmarshallFromTemplate but it work rarely too.

How this problem could be solved?

Cambodia answered 30/10, 2013 at 7:30 Comment(0)
C
13

You can use VariableReplace to achieve this which may not have existed at the time of the other answers. This does not do a find/replace per se but works on placeholders eg ${myField}

java.util.HashMap mappings = new java.util.HashMap();
VariablePrepare.prepare(wordMLPackage);//see notes
mappings.put("myField", "foo");
wordMLPackage.getMainDocumentPart().variableReplace(mappings);

Note that you do not pass ${myField} as the field name; rather pass the unescaped field name myField - This is rather inflexible in that as it currently stands your placeholders must be of the format ${xyz} whereas if you could pass in anything then you could use it for any find/replace. The ability to use this also exists for C# people in docx4j.NET

See here for more info on VariableReplace or here for VariablePrepare

Checkerwork answered 21/6, 2015 at 1:58 Comment(2)
I found your answer helpful but it worked only when the variable was with format ${xyz}, not $(xyz).Plaything
Neither of the links work, sadlyTman
S
7

I created a library to publish my solution because it's quite a lot of code: https://github.com/phip1611/docx4j-search-and-replace-util

The workflow is the following:

First step:

// (this method was part of your question)  
List<Text> texts = getAllElementFromObject(docxDocument.getMainDocumentPart(), Text.class);

This way we get all actual Text-content in the correct order but without style markup in-between. We can edit the Text-objects (by setValue) and keep styles.

Resulting problem: Search-text/placeholders can be split accoss multiple Text-instances (because there can be style markup that is invisble in-between in original document), e.g. ${FOOBAR}, ${ + FOOBAR}, or $ + {FOOB + AR}

Second step:

Concat all Text-objects to a full string / "complete string"

Optional<String> completeStringOpt = texts.stream().map(Text::getValue).reduce(String::concat);

Third step:

Create a class TextMetaItem. Each TextMetaItem knows for it's Text-object where it's content begins and ends in the complete string. E.g. If the Text-objects for "foo" and "bar" results in the complete string "foobar" than indices 0-2 belongs to "foo"-Text-object and 3-5 to "bar"-Text-object. Build a List<TextMetaItem>

static List<TextMetaItem> buildMetaItemList(List<Text> texts) {
    final int[] index = {0};
    final int[] iteration = {0};
    List<TextMetaItem> list = new ArrayList<>();
    texts.forEach(text -> {
        int length = text.getValue().length();
        list.add(new TextMetaItem(index[0], index[0] + length - 1, text, iteration[0]));
        index[0] += length;
        iteration[0]++;
    });
    return list;
}

Fourth step:

Build a Map<Integer, TextMetaItem> where the key is the index/char in the complete string. This means the map's length equals completeString.length()

static Map<Integer, TextMetaItem> buildStringIndicesToTextMetaItemMap(List<Text> texts) {
    List<TextMetaItem> metaItemList = buildMetaItemList(texts);
    Map<Integer, TextMetaItem> map = new TreeMap<>();
    int currentStringIndicesToTextIndex = 0;
    // + 1 important here! 
    int max = metaItemList.get(metaItemList.size() - 1).getEnd() + 1;
    for (int i = 0; i < max; i++) {
        TextMetaItem currentTextMetaItem = metaItemList.get(currentStringIndicesToTextIndex);
        map.put(i, currentTextMetaItem);
        if (i >= currentTextMetaItem.getEnd()) {
            currentStringIndicesToTextIndex++;
        }
    }
    return map;
}

interim result:

Now you have enough metadata to delegate every action you want to do on the complete string to the corresponding Text object! (To change the content of Text-objects you just need to call (#setValue()) That's all what's needed in Docx4J to edit text. All style info etc will be preserved!

last step: search and replace

  1. build a method that finds all occurrences of your possible placeholders. You should create a class like FoundResult(int start, int end) that stores begin and end indices of a found value (placeholder) in the complete string

    public static List<FoundResult> findAllOccurrencesInString(String data, String search) {
        List<FoundResult> list = new ArrayList<>();
        String remaining = data;
        int totalIndex = 0;
        while (true) {
            int index = remaining.indexOf(search);
            if (index == -1) {
                break;
            }
    
            int throwAwayCharCount = index + search.length();
            remaining = remaining.substring(throwAwayCharCount);
    
            list.add(new FoundResult(totalIndex + index, search));
    
            totalIndex += throwAwayCharCount;
        }
        return list;
    } 
    

    using this I build a new list of ReplaceCommands. A ReplaceCommand is a simple class and stores a FoundResult and the new value.

  2. next you must order this list from the last item to the first (order by position in complete string)

  3. now you can write a replace all algorithm because you know what action needs to be done on which Text-object. We did (2) so that replace operations won't invalidate indices of other FoundResults.

    3.1.) find Text-object(s) that needs to be changed 3.2.) call getValue() on them 3.3.) edit the string to the new value 3.4.) call setValue() on the Text-objects

This is the code that does all the magic. It executes a single ReplaceCommand.

   /**
     * @param texts All Text-objects
     * @param replaceCommand Command
     * @param map Lookup-Map from index in complete string to TextMetaItem
     */
    public static void executeReplaceCommand(List<Text> texts, ReplaceCommand replaceCommand, Map<Integer, TextMetaItem> map) {
        TextMetaItem tmi1 = map.get(replaceCommand.getFoundResult().getStart());
        TextMetaItem tmi2 = map.get(replaceCommand.getFoundResult().getEnd());
        if (tmi2.getPosition() - tmi1.getPosition() > 0) {
            // it can happen that text objects are in-between
            // we can remove them (set to null)
            int upperBorder = tmi2.getPosition();
            int lowerBorder = tmi1.getPosition() + 1;
            for (int i = lowerBorder; i < upperBorder; i++) {
                texts.get(i).setValue(null);
            }
        }

       if (tmi1.getPosition() == tmi2.getPosition()) {
            // do replacement inside a single Text-object

            String t1 = tmi1.getText().getValue();
            int beginIndex = tmi1.getPositionInsideTextObject(replaceCommand.getFoundResult().getStart());
            int endIndex = tmi2.getPositionInsideTextObject(replaceCommand.getFoundResult().getEnd());

            String keepBefore = t1.substring(0, beginIndex);
            String keepAfter = t1.substring(endIndex + 1);

            tmi1.getText().setValue(keepBefore + replaceCommand.getNewValue() + keepAfter);
        } else {
            // do replacement across two Text-objects

            // check where to start and replace 
            // the Text-objects value inside both Text-objects
            String t1 = tmi1.getText().getValue();
            String t2 = tmi2.getText().getValue();

            int beginIndex = tmi1.getPositionInsideTextObject(replaceCommand.getFoundResult().getStart());
            int endIndex = tmi2.getPositionInsideTextObject(replaceCommand.getFoundResult().getEnd());

            t1 = t1.substring(0, beginIndex);
            t1 = t1.concat(replaceCommand.getNewValue());
            t2 = t2.substring(endIndex + 1);

            tmi1.getText().setValue(t1);
            tmi2.getText().setValue(t2);
        }
    }
Slave answered 24/2, 2020 at 21:57 Comment(2)
I just made a quick test with the library and it works fine with java 11Beers
Nice job, seems like a complete solution.Gravely
T
5

Good day, I made an example how to quickly replace text to something you need by regexp. I find ${param.sumname} and replace it in document. Note, you have to insert text as 'text only'! Have fun!

  WordprocessingMLPackage mlp = WordprocessingMLPackage.load(new File("filepath"));
  replaceText(mlp.getMainDocumentPart());

  static void replaceText(ContentAccessor c)
    throws Exception
  {
    for (Object p: c.getContent())
    {
      if (p instanceof ContentAccessor)
        replaceText((ContentAccessor) p);

      else if (p instanceof JAXBElement)
      {
        Object v = ((JAXBElement) p).getValue();

        if (v instanceof ContentAccessor)
          replaceText((ContentAccessor) v);

        else if (v instanceof org.docx4j.wml.Text)
        {
          org.docx4j.wml.Text t = (org.docx4j.wml.Text) v;
          String text = t.getValue();

          if (text != null)
          {
            t.setSpace("preserve"); // needed?
            t.setValue(replaceParams(text));
          }
        }
      }
    }
  }

  static Pattern paramPatern = Pattern.compile("(?i)(\\$\\{([\\w\\.]+)\\})");

  static String replaceParams(String text)
  {
    Matcher m = paramPatern.matcher(text);

    if (!m.find())
      return text;

    StringBuffer sb = new StringBuffer();
    String param, replacement;

    do
    {
      param = m.group(2);

      if (param != null)
      {
        replacement = getParamValue(param);
        m.appendReplacement(sb, replacement);
      }
      else
        m.appendReplacement(sb, "");
    }
    while (m.find());

    m.appendTail(sb);
    return sb.toString();
  }

  static String getParamValue(String name)
  {
    // replace from map or something else
    return name;
  }
Tumular answered 8/11, 2013 at 11:56 Comment(0)
H
3

This can be a problem. I cover how to mitigate broken-up text runs in this answer here: https://mcmap.net/q/1330014/-docx-template-docx4j-replacing-text-in-java

... but you might want to consider content controls instead. The docx4j source site has various content control samples here:

https://github.com/plutext/docx4j/tree/master/src/samples/docx4j/org/docx4j/samples

Hintze answered 30/10, 2013 at 8:39 Comment(4)
thanks, this works for new documents. When I save old document(where 'rsid' entities existed) it still doesn't work. Is it possible to fix the "old" document?Cambodia
Only via the user interface I think. You'd need to disable the relevant tooling and re-save in Word, which would then blat the rsid entities from the underlying XML. More here: docx4java.org/forums/docx-java-f6/…Hintze
I disabled grammar and spelling checking, turned off rsid, re-saved the document but it is still not working. p.s. this is log: 'Invalid key '</w:t></w:r><w:r><w:rPr><w:b/><w:sz w:val="20"/><w:szCs w:val="20"/><w:lang w:val="en-US"/></w:rPr><w:t>e</w:t></w:r><w:r><w:rPr><w:b/><w:sz w:val="20"/><w:szCs w:val="20"/></w:rPr><w:t>+</w:t></w:r><w:r><w:rPr><w:b/><w:sz w:val="20"/><w:szCs w:val="20"/><w:lang w:val="en-US"/></w:rPr><w:t>x</w:t></w:r><w:r><w:rPr><w:b/><w:sz w:val="20"/><w:szCs w:val="20"/></w:rPr><w:t>001' or key not mapped to a value]]'Cambodia
Yeah that's still broken up clearly. Your key all needs to be in the same text node. If you rename the .docx file suffix to .zip, and then edit the document.xml file therein, you can fix it (not very elegant, but it will get your code running anyway).Hintze

© 2022 - 2024 — McMap. All rights reserved.