-
Notifications
You must be signed in to change notification settings - Fork 23
Open
Description
A simple 3 columns by 4 row table is not extracted with simple turtletext code:
require 'pdf-reader'
require 'pdf/reader/patch/object_hash'
require 'pdf/reader/positional_text_receiver'
require 'pdf/reader/turtletext'
require 'pdf/reader/turtletext/version'
require 'pdf/reader/turtletext/textangle'
require 'pp'
pdf_filename = ARGV[0]
pp(pdf_filename)
options = { :y_precision => 5}
reader = PDF::Reader::Turtletext.new(pdf_filename,options)
textangle = reader.bounding_box do
below /__total_income__/
end
#pp textangle
pp textangle.text
The following is the .pdf file:
table-calc.pdf
The output produced is:
ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib -I/home/evansl/dwnlds/ruby/gems/pdf-reader/pdf-reader-git/lib table-calc.rb table-calc.pdf
"table-calc.pdf"
[]
Found that when nothing put in the bounding_box do...end, get garbage out:
ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib table-simple.rb ~/PDF/table-emacs.pdf
"/home/evansl/PDF/table-emacs.pdf"
[["☞✁✝✌"],
["✎✄☎", "✎✄☎"],
["✑✑", "✑✒"],
["✒✑", "✒✒"],
["✗✑", "✗✒"],
["✛✗", "✛✛"]]
Suggesting the characters are not being interpreted right.
The .rb file does have:
coding: utf-8
at the top, but I guess that's not right. I've tried other values, such as
utf-16
but that got errors.
However, using enscript and then
ps2pdf14, solved problem:
enscript -o table-emacs.gs table-emacs.txt ; ps2pdf14 table-emacs.gs table-emacs.pdf
[ 1 page * 1 copy ] left in table-emacs.gs
ruby2.3 -I/home/evansl/dwnlds/ruby/gems/pdf-reader-turtletext/kieleyt/pdf-reader-turtletext-git/lib table-simple.rb table-emacs.pdf
"table-emacs.pdf"
[[" TotalIncome"],
[" CD Bond "],
["month val val"],
["Jan 11 12"],
["Feb 21 22"],
["Mar 31 32"],
["Total-type 63 66"]]
Compilation finished at Fri Jun 22 05:39:21
However, this solution was not obvious :(
Metadata
Metadata
Assignees
Labels
No labels