convore.json/groups/hackreduce/new-dataset-for-hackreduce-3-boston/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 36648, "stars": [], "topic_id": 40039, "date_created": 1308687977.8885751, "message": "I'm wondering if anyone would be interested in collaborating with me to create a new text-based dataset that would consist of ~2,900 Chinese county websites? The county is an increasingly important unit of governance in China, but no one has a clear idea of how  counties differ, e.g., in terms of their relative focus on economic development vs. social welfare, in terms of their level of independence from higher levels such as the city or prefecture. Counties are mandated by central authorities to host websites that provide information history, leadership, polices, etc. for the general public. The idea is to scrape textual data from these websites, which range in size from 100 to over 200,000 pages. The resulting data set would provide a really unique look at how China works and which opens up the possibility of lots of different types of analysis, e.g., economic, political, policy-oriented, etc. If you're interested in the idea, please get in touch with me jjpan -at- fas -dot- harvard -dot- edu. Comments/thoughts also appreciated. Thanks!!", "group_id": 9834, "id": 1451809}, {"user_id": 35186, "stars": [], "topic_id": 40039, "date_created": 1308698361.0165091, "message": "What do you plan to use to scrape from 2900 websites hosted in China?", "group_id": 9834, "id": 1453247}, {"user_id": 36648, "stars": [], "topic_id": 40039, "date_created": 1308710073.7787409, "message": "I'm quite new to this and am definitely open to other suggestions, but I am planning on using Python, either Scrapy or a combination of mechanize and BeautifulSoup", "group_id": 9834, "id": 1454301}, {"user_id": 35186, "stars": [], "topic_id": 40039, "date_created": 1308745300.3555839, "message": "I was lead guy behind the Tira modd database: http://www.theregister.co.uk/2008/04/23/tira_modd/\nWe used WebQL and Groovy to scrape ~250 websites to harvest handset data. The scrape ran only once every Sunday, to avoid getting blacklisted and also because weekly updates were sufficient for us. That was in 2007. Today I'd consider using Scala. What you want to do is a much bigger challenge than what I had to do. Good luck with it.", "group_id": 9834, "id": 1456522}, {"user_id": 36648, "stars": [], "topic_id": 40039, "date_created": 1308777178.596312, "message": "Thanks William", "group_id": 9834, "id": 1460335}, {"user_id": 33069, "stars": [], "topic_id": 40039, "date_created": 1308782610.495044, "message": "Hope we find someone who'd be good at this. Just sent out an email to all the participants about the discussions on Convore, maybe more people will find this.", "group_id": 9834, "id": 1460943}, {"user_id": 33069, "stars": [], "topic_id": 40039, "date_created": 1308782858.0581429, "message": "I changed the topic to make it more descriptive", "group_id": 9834, "id": 1460966}, {"user_id": 35186, "stars": [], "topic_id": 40039, "date_created": 1314654374.590596, "message": "Jen, another suggestion for your scraping project is to use node.js. I participated in the node knockout contest over the weekend and my entry was http://nodeknockout.com/teams/toronto  I built the collage you see in my app from images scraped using the jsdom node module. It's pretty cool because it lets you do jQuery Sizzle parsing server side.", "group_id": 9834, "id": 1996854}]