Skip to content

extracts nothing with simple pdf table created with libreoffice calc #9

@cppljevans

Description

@cppljevans

A simple 3 columns by 4 row table is not extracted with simple turtletext code:

require 'pdf-reader'
require 'pdf/reader/patch/object_hash'
require 'pdf/reader/positional_text_receiver'

require 'pdf/reader/turtletext'
require 'pdf/reader/turtletext/version'
require 'pdf/reader/turtletext/textangle'

require 'pp'

pdf_filename = ARGV[0]
pp(pdf_filename)
options = { :y_precision => 5}
reader = PDF::Reader::Turtletext.new(pdf_filename,options)
textangle = reader.bounding_box do
  below /__total_income__/
end
#pp textangle
pp textangle.text

The following is the .pdf file:
table-calc.pdf

The output produced is:

ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib -I/home/evansl/dwnlds/ruby/gems/pdf-reader/pdf-reader-git/lib  table-calc.rb table-calc.pdf
"table-calc.pdf"
[]

Found that when nothing put in the bounding_box do...end, get garbage out:

ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib  table-simple.rb ~/PDF/table-emacs.pdf
"/home/evansl/PDF/table-emacs.pdf"
[["☞✁✝✌"],
 ["✎✄☎", "✎✄☎"],
 ["✑✑", "✑✒"],
 ["✒✑", "✒✒"],
 ["✗✑", "✗✒"],
 ["✛✗", "✛✛"]]

Suggesting the characters are not being interpreted right.
The .rb file does have:

coding: utf-8

at the top, but I guess that's not right. I've tried other values, such as
utf-16

but that got errors.

However, using enscript and then
ps2pdf14, solved problem:

enscript -o table-emacs.gs table-emacs.txt ; ps2pdf14 table-emacs.gs table-emacs.pdf
[ 1 page * 1 copy ] left in table-emacs.gs
ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib  table-simple.rb table-emacs.pdf
"table-emacs.pdf"
[["   TotalIncome"],
 ["            CD Bond   "],
 ["month      val  val"],
 ["Jan         11   12"],
 ["Feb         21   22"],
 ["Mar         31   32"],
 ["Total-type  63   66"]]

Compilation finished at Fri Jun 22 05:39:21

However, this solution was not obvious :(

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions