how to read bullets from RTF file
Asked Answered
L

1

8

I have a rtf file which has some text with bullets as shown in the screenshot below

enter image description here

I want to extract the data along with the bullets but when I print in the console, I get junk values. How do I print exactly the same from console. The way I tried is as below

public static void main(String[] args) throws IOException, BadLocationException {
    RTFEditorKit rtf = new RTFEditorKit();
    Document doc = rtf.createDefaultDocument();

    FileInputStream fis = new FileInputStream("C:\\Users\\Guest\\Desktop\\abc.rtf");
    InputStreamReader i =new InputStreamReader(fis,"UTF-8");
    rtf.read(i,doc,0);
    System.out.println(doc.getText(0,doc.getLength()));
}

Console output:

enter image description here

I assumed junk values are due to console not supporting chareset so I tried to generate a pdf file but in pdf also I get the same junk values. this is the pdf code

Paragraph de=new Paragraph();
            Phrase pde=new Phrase();
            pde.add(new Chunk(getText("C:\\Users\\Guest\\Desktop\\abc.rtf"),smallNormal_11));
            de.add(pde);

            de.getFont().setStyle(BaseFont.IDENTITY_H);
            document.add(de);
public static String getText() throws IOException, BadLocationException {
        RTFEditorKit rtf = new RTFEditorKit();
        Document doc = rtf.createDefaultDocument();

        FileInputStream fis = new FileInputStream("C:\\Users\\Guest\\Desktop\\abc.rtf");
        InputStreamReader i =new InputStreamReader(fis,"UTF-8");
        rtf.read(i,doc,0);
        String output=doc.getText(0,doc.getLength());
return output;
    }
Lyons answered 15/11, 2016 at 18:36 Comment(5)
Instead of writing to something as complex as a pdf file, write the same thing as your console output to a plain UTF-8 text file, then hex-dump that file to see the actual values being written.Inadequate
I deleted the itext tag (edit pending moderator approval), because your question is not about iText. It's about RTF. Stephen is absolutely right. Split up your problem, first make sure it works in the console before you even start thinking about PDF.Okubo
I think he tried to change to pdf to get around the problem, it's not his final goal.Olivarez
You can walk through document elements, doc.getDefaultRootElement(). I expect the bullets type is stored in paragraph attributes, see Element.getAttributes().Birdsong
Is it possible to have your input file (abc.rtf) made available somewhere ? I would like to test it on my environment ? Thanks.Mossy
M
7

Despite what you said, my guess is that it is a console encoding problem.

Anyway you can easily check it:

Just replace this line:

    System.out.println(doc.getText(0,doc.getLength()));

With these 2 lines :

    PrintStream ps = new PrintStream(System.out, true, "UTF-8");
    ps.println(doc.getText(0,doc.getLength()));

This will force console encoding to UTF-8.

If it is still wrong, I would suspect your file is not fully rtf-compliant.


I made some tests and your code works well (the console one, I did not try the pdf) under Linux, but the console is natively in UTF-8.

Mossy answered 20/11, 2016 at 22:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.