[ACCEPTED]-Return text string from physical coordinates in a PDF with Python-pdf

Accepted answer
Score: 10

I've been writing a library to try to simplify 16 this process, pdfquery. To extract text from a particular 15 place in a particular page, you would do:

pdf = pdfquery.PDFQuery(file)
# load first, third, fourth pages
pdf.load(0, 2, 3) 
# find text between 100 and 300 points from left bottom corner of first page
text = pdf.pq('LTPage[page_index=0] :in_bbox("100,100,300,300")').text() 
# save tree as XML to try to figure out why the last line didn't work the way you expected :)
pdf.tree.write(filename, pretty_print=True)

If 14 you want to find individual characters within 13 that box, instead of text lines entirely 12 within that box, pass merge_tags=None to 11 PDFQuery (by default it merges consecutive 10 characters into a single element to make 9 the tree less ridiculous, so the whole line 8 would have to be inside the box). If you 7 want to find anything that partially overlaps 6 the box, use :overlaps_bbox instead of :in_bbox.

This 5 is basically using PyQuery selector syntax 4 to grab text from a PDFMiner layout, so 3 if your document is too messy for PDFMiner, it 2 may be too messy for this as well, but at 1 least it will be faster to play with.

Score: 3

I was able to find my way around pdfminer 11 thanks to some code by Denis Papathanasiou. The 10 code is discussed in his blog, and you can find 9 the source here: layout_scanner.py

In particular, take a look 8 at the method parse_lt_objs( ). In the final 7 loop, k should be a pair containing the 6 coordinates of that bit of text (and it 5 is discarded). I don't have a working coordinate 4 extractor to post here (I was not interested 3 in them), but it sounds like you'll have 2 no trouble finding your way from there.

Good 1 luck with it!

More Related questions