convore.json/groups/python/is-there-a-good-html-unescaping-function-in-the-27-stdlib/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878499.3406351, "message": "Guido van Rossum &lt;guido&#32;&#97;t&#32;python.org&gt;", "group_id": 292, "id": 700778}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878523.1811769, "message": "f(Guido van Rossum &lt;guido&#32;&#97;t&#32;python.org&gt;) == Guido van Rossum <guido at python.org>", "group_id": 292, "id": 700780}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878500.1518431, "message": "err, rather", "group_id": 292, "id": 700779}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878489.598248, "message": "What I want is that:", "group_id": 292, "id": 700777}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878549.508374, "message": "xml.sax.saxutils.unescape will do the &lt; -> '<', but doesn't get the HTML entities", "group_id": 292, "id": 700784}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302878564.538933, "message": "and if not, is there a good lib out there for this?", "group_id": 292, "id": 700787}, {"user_id": 3354, "stars": [{"date_created": 1302889564.2421639, "user_id": 1284}], "topic_id": 19634, "date_created": 1302885305.8605371, "message": "You could just lift the but you need (washes mouth out with soap).", "group_id": 292, "id": 702283}, {"user_id": 3978, "stars": [{"date_created": 1302889561.7235711, "user_id": 1284}], "topic_id": 19634, "date_created": 1302884229.9668119, "message": ">>> from BeautifulSoup import BeautifulStoneSoup\n>>> BeautifulStoneSoup('&lt;guido&#32;&#97;t&#32;python.org&gt;', convertEntities=BeautifulStoneSoup.HTML_ENTITIES)\n<guido at python.org>", "group_id": 292, "id": 702027}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1302885266.0832629, "message": "I'd rather not add a BeautifulSoup dependency, but that was the first library I thought of", "group_id": 292, "id": 702271}, {"user_id": 29156, "stars": [], "topic_id": 19634, "date_created": 1302885408.8244779, "message": "not sure if it does what you need though ;]", "group_id": 292, "id": 702321}, {"user_id": 3354, "stars": [], "topic_id": 19634, "date_created": 1302885312.269511, "message": "s/but/bit/", "group_id": 292, "id": 702284}, {"user_id": 29156, "stars": [{"date_created": 1302889565.8921111, "user_id": 1284}], "topic_id": 19634, "date_created": 1302885376.039964, "message": "maybe this can help: http://code.google.com/p/jsonbot/source/browse/jsb/utils/textutils.py", "group_id": 292, "id": 702308}, {"user_id": 8391, "stars": [{"date_created": 1302898310.783107, "user_id": 1243}], "topic_id": 19634, "date_created": 1302895478.8893571, "message": "http://bugs.python.org/issue2927", "group_id": 292, "id": 704486}, {"user_id": 20326, "stars": [], "topic_id": 19634, "date_created": 1303308335.1692059, "message": "@llimllib How are you getting the data out of HTML? Your HTML parser can unescape for you\u2014lxml.html does this by default and BeautifulSoup does this if asked to.\n\nUsually when I hear someone asking this, it's because they're trying to extract data from HTML using regular expressions. However, parsing HTML with regex is not just bad form, it's *provably impossible*.\n\nUsing lxml.html as an HTML parser is preferable to BeautifulSoup. BeautifulSoup is on ultra-low-maintainable mode and the creator wishes he could abandon it. There are no plans for it to ever work well with 3.x, but lxml.html works great on 3.x.", "group_id": 292, "id": 755032}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1303309157.3926671, "message": "@mikegraham I don't want to depend on lxml either. In this particular instance, I'm scraping PEP pages, whose headers have text formatted in a particular way which makes them quite easy to grab with a regex, and so I do so. Look at just how simple the code is: \n\nhttps://github.com/llimllib/pyphage/blob/master/plugins/pep.py#L28", "group_id": 292, "id": 755168}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1303309235.133661, "message": "@mikegraham I'm aware of why to suggest a parsing library instead of regexen, but in this case a hack works and adding a dependency would be overkill, hence me asking if there is a function I could use in the stdlib", "group_id": 292, "id": 755181}, {"user_id": 20326, "stars": [], "topic_id": 19634, "date_created": 1303329784.3645101, "message": "@llimllib You don't seem nearly as aware as you say. =(", "group_id": 292, "id": 760573}, {"user_id": 1284, "stars": [], "topic_id": 19634, "date_created": 1303348638.9915891, "message": "@mikegraham *shrug* wfm", "group_id": 292, "id": 763916}]