I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs.
f, _ := os.Open("test.pdf")
defer f.Close()
io.Copy(os.Stdout, f)
I want to work with the strings....
I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs.
f, _ := os.Open("test.pdf")
defer f.Close()
io.Copy(os.Stdout, f)
I want to work with the strings....
I tried some go pdf libs, and found sajari/docconv works like I expect.
easy to use, here is a example:
package main
import (
"fmt"
"log"
"code.sajari.com/docconv"
)
func main() {
res, err := docconv.ConvertPath("your-file.pdf")
if err != nil {
log.Fatal(err)
}
fmt.Println(res)
}
brew install poppler
and brew install tesseract
–
Psychrometer It's because the PDF doesn't only contain the text, but it also contains the formats (fonts, padding, margin, position, shapes, image) information.
In case you need to read the plain text without format. I have forked a repository and implement the function to do that. You can check it at https://github.com/ledongthuc/pdf
I also have put an example, help it useful for you.
package main
import (
"bytes"
"fmt"
"github.com/ledongthuc/pdf"
)
func main() {
content, err := readPdf("test.pdf") // Read local pdf file
if err != nil {
panic(err)
}
fmt.Println(content)
return
}
func readPdf(path string) (string, error) {
r, err := pdf.Open(path)
if err != nil {
return "", err
}
totalPage := r.NumPage()
var textBuilder bytes.Buffer
for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
p := r.Page(pageIndex)
if p.V.IsNull() {
continue
}
textBuilder.WriteString(p.GetPlainText("\n"))
}
return textBuilder.String(), nil
}
ledongthuc/pdf
Git. –
Pounce panic: malformed PDF: reading at offset 0: stream not present
when I run ``` r, err := pdf.Open(path) r.Page(1).Content() ``` For example, this PDF: cs.utexas.edu/~roshan/CHET.pdf r.NumPage()
and r.Outline()
work tho. –
Unsteel all I get is a bunch of hieroglyphs.
What you get is the content of a pdf file, which is not clear text.
If you want to read a pdf file in Go, use one of the golang pdf libraries like rsc.io/pdf
, or one of those libraries like yob/pdfreader
.
As mentioned here:
I doubt there is any 'solid framework' for this kind of stuff. PDF format isn't meant to be machine-friendly by design, and AFAIK there is no guaranteed way to parse arbitrary PDFs.
You can try to use pdf2go lib together with the popular: pdf2go
import (
"fmt"
"github.com/rudolfoborges/pdf2go"
)
func main() {
pdf, err := pdf2go.New("path/to/file.pdf", pdf2go.Config{
LogLevel: pdf2go.LogLevelError,
})
if err != nil {
panic(err)
}
text, err := pdf.Text()
if err != nil {
panic(err)
}
fmt.Println(text)
pages, err := pdf.Pages()
if err != nil {
panic(err)
}
for _, page := range pages {
fmt.Println(page.Text())
}
}
© 2022 - 2024 — McMap. All rights reserved.