convore.json/groups/python/creating-a-tool-to-search-for-multiple-words-in-pdf-files-with-python/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299663748.5128961, "message": "The tool should accomplish the following tasks: 1. Open pdf-files 2. Search the file for a predefined set of words 3. highlight the findings, would be cool if highlights were in different colors", "group_id": 292, "id": 303083}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299663595.0983, "message": "Hey, i\u00b4m an programming novice. Fiddled around a bit with C on the Arduino platform. At the moment i have a job where i have to search for a set of words in many pdf files. I am looking for a tool for 2 hours now and cannot find one. So i thought, why not dive into python and try to code it myself.", "group_id": 292, "id": 303061}, {"user_id": 5582, "stars": [], "topic_id": 11661, "date_created": 1299664175.1321011, "message": "Here's how I'd start:", "group_id": 292, "id": 303146}, {"user_id": 3354, "stars": [], "topic_id": 11661, "date_created": 1299664540.8679919, "message": "A direct answer: yes, it's going to be hard when \"process\" involves reading PDFs, which are not always as logically constructed as one might think. But I can guarantee it would be at least one order of magnitude more difficult in C.", "group_id": 292, "id": 303223}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299665116.4702661, "message": "I can see the highlighting feature is not so easy to accomplish, would it possibly make sense to convert the .pdf\u00b4s to .txt and write a tool to search and highlight in the raw text? For me, the tool only has to get the job done. I\u00b4m just looking for the minimalistic way", "group_id": 292, "id": 303311}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299663632.7010159, "message": "is it hard to process pdf-files with python?", "group_id": 292, "id": 303066}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299664396.998709, "message": "thats a nice starting point, thanks", "group_id": 292, "id": 303203}, {"user_id": 5582, "stars": [], "topic_id": 11661, "date_created": 1299664198.1524761, "message": "Go to pypi.python.org and search for PDF. You'll find quite a few choices http://pypi.python.org/pypi?%3Aaction=search&term=pdf&submit=search", "group_id": 292, "id": 303153}, {"user_id": 5582, "stars": [{"date_created": 1299733227.846519, "user_id": 5778}], "topic_id": 11661, "date_created": 1299664243.7093289, "message": "I picked PDFMiner: http://www.unixuser.org/~euske/python/pdfminer/index.html , which seems to do be capable at least for the opening and searching part. On the bottom of the page you can conveniently find other libraries the author recommends", "group_id": 292, "id": 303167}, {"user_id": 5582, "stars": [], "topic_id": 11661, "date_created": 1299664587.849762, "message": "@jfkreuter you're welcome. On the highlighting part, I'm not so sure it's trivial.", "group_id": 292, "id": 303230}, {"user_id": 5582, "stars": [], "topic_id": 11661, "date_created": 1299665811.776438, "message": "@jfkreuter Do you need to highlight it on the screen only? Or in an output file also?", "group_id": 292, "id": 303419}, {"user_id": 5582, "stars": [{"date_created": 1299733682.678833, "user_id": 5778}], "topic_id": 11661, "date_created": 1299665951.5192571, "message": "@jfkreuter regarding the minimalistic way: you can say \"less file.pdf\" and get the text from file.pdf ... but that's only for unix systems I guess", "group_id": 292, "id": 303454}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299693695.2887969, "message": "@0chris sorry, had to study. i would say its not necessary to highlight the words in an output file. It\u00b4s ok if i see them on the screen. i have to do the further analysis with excel (yeah, i know)also,  i\u00b4ll try \"less\" on some pdf\u00b4s and see how the outcomings are", "group_id": 292, "id": 307245}, {"user_id": 1822, "stars": [{"date_created": 1299733284.9218049, "user_id": 5778}], "topic_id": 11661, "date_created": 1299701098.173265, "message": "another straight *nix approach: strings xyz.pdf | grep <pattern>", "group_id": 292, "id": 307860}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299704053.437042, "message": "heyhey..i\u00b4m getting near the solution", "group_id": 292, "id": 308250}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299704545.0030439, "message": "i\u00b4m actually on an *nix system (Mac OS) and found this automator script called pdf to text which outputs the text data of the pdf file", "group_id": 292, "id": 308315}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299704721.5473311, "message": "now i can search the .rtf with \"strings xyz.rtf | grep <pattern> \" and i get the lines where <pattern> is located in. I think I\u00b4m understanding regular expressions to get grep to search for multiple words.  Is there any way to", "group_id": 292, "id": 308325}, {"user_id": 10577, "stars": [], "topic_id": 11661, "date_created": 1299704783.9597099, "message": "show the context of the words? Like, let\u00b4s say the 4 words before and behind of <pattern> ?", "group_id": 292, "id": 308333}, {"user_id": 5778, "stars": [], "topic_id": 11661, "date_created": 1299733464.077287, "message": "and -A to specify the number of lines after the matching line", "group_id": 292, "id": 311040}, {"user_id": 5778, "stars": [], "topic_id": 11661, "date_created": 1299733454.3685269, "message": "you also usually have -B to specify number of lines before the matching line", "group_id": 292, "id": 311039}, {"user_id": 5778, "stars": [], "topic_id": 11661, "date_created": 1299733361.940309, "message": "@jfkreuter I would go with egrep... like \"strings xyz.rtf | egrep '(word1|word2|word3)'", "group_id": 292, "id": 311031}, {"user_id": 5778, "stars": [], "topic_id": 11661, "date_created": 1299733401.8814001, "message": "most grep implementations these days have a -C option to specify how many lines of context you want surrounding the line that contains the matching string", "group_id": 292, "id": 311035}, {"user_id": 5778, "stars": [], "topic_id": 11661, "date_created": 1299733533.50931, "message": "not sure how you'd go about just getting X matching words on either side of your needle without some actual programming.... though I am curious if it's possible with regular command line tools!", "group_id": 292, "id": 311044}, {"user_id": 5436, "stars": [], "topic_id": 11661, "date_created": 1299765049.9044559, "message": "There are many pdftotxt tools but I would suggest you use pdfminer to turn it into HTML then use lxml.html to do the search and diffs.", "group_id": 292, "id": 313324}]