Files
2012-02-21 01:15:00 -05:00

1 line
1.5 KiB
JSON

[{"user_id": 36801, "stars": [], "topic_id": 40262, "date_created": 1308860315.5108881, "message": "When processing an 1-gram, what is the idiomatic Hadoop way to find it in the geonames dataset. I wouldn't want each 1-gram forcing another Hadoop job to search the geonames dataset, creating N+1 queries.", "group_id": 9834, "id": 1470493}, {"user_id": 36801, "stars": [], "topic_id": 40262, "date_created": 1308860189.048439, "message": "I'll be participating on the Boston Hack/Reduce. I'm most interested in learning Hadoop, and hence are not particularly interested in what I solve this weekend, e.g. I don't want to struggling with the problem, I'd rather be wrangling the code. To keep things interesting, I want to combine two data sets.\n\nIdea: Process the Google 1-gram dataset, looking for 1-grams which are cities in the geonames dataset. I'm hoping to find out if some cities are always referenced, ala Paris, or is there a distribution across the globe, over time.\n\nThoughts? Any problems? Any takers?", "group_id": 9834, "id": 1470435}, {"user_id": 33069, "stars": [], "topic_id": 40262, "date_created": 1309003962.382859, "message": "I would first create a job that extracts cities out of Geonames and creates a simplified dataset. If that dataset is small enough, I would use DistributedCache to load it into memory of each task and then do the join with random lookups in a hash. If memory constraints prevent that, come talk to us at the event: you will need to map two data sets and merge them.", "group_id": 9834, "id": 1483769}]