[ACCEPTED]-Return text string from physical coordinates in a PDF with Python-pdf
I've been writing a library to try to simplify 16 this process, pdfquery. To extract text from a particular 15 place in a particular page, you would do:
pdf = pdfquery.PDFQuery(file) # load first, third, fourth pages pdf.load(0, 2, 3) # find text between 100 and 300 points from left bottom corner of first page text = pdf.pq('LTPage[page_index=0] :in_bbox("100,100,300,300")').text() # save tree as XML to try to figure out why the last line didn't work the way you expected :) pdf.tree.write(filename, pretty_print=True)
If 14 you want to find individual characters within 13 that box, instead of text lines entirely 12 within that box, pass merge_tags=None to 11 PDFQuery (by default it merges consecutive 10 characters into a single element to make 9 the tree less ridiculous, so the whole line 8 would have to be inside the box). If you 7 want to find anything that partially overlaps 6 the box, use :overlaps_bbox instead of :in_bbox.
This 5 is basically using PyQuery selector syntax 4 to grab text from a PDFMiner layout, so 3 if your document is too messy for PDFMiner, it 2 may be too messy for this as well, but at 1 least it will be faster to play with.
In particular, take a look 8 at the method parse_lt_objs( ). In the final 7 loop, k should be a pair containing the 6 coordinates of that bit of text (and it 5 is discarded). I don't have a working coordinate 4 extractor to post here (I was not interested 3 in them), but it sounds like you'll have 2 no trouble finding your way from there.
Good 1 luck with it!
More Related questions