mirror of
https://github.com/not-kennethreitz/convore.json.git
synced 2026-06-21 07:31:00 +00:00
1 line
1.5 KiB
JSON
1 line
1.5 KiB
JSON
[{"user_id": 36801, "stars": [], "topic_id": 40262, "date_created": 1308860315.5108881, "message": "When processing an 1-gram, what is the idiomatic Hadoop way to find it in the geonames dataset. I wouldn't want each 1-gram forcing another Hadoop job to search the geonames dataset, creating N+1 queries.", "group_id": 9834, "id": 1470493}, {"user_id": 36801, "stars": [], "topic_id": 40262, "date_created": 1308860189.048439, "message": "I'll be participating on the Boston Hack/Reduce. I'm most interested in learning Hadoop, and hence are not particularly interested in what I solve this weekend, e.g. I don't want to struggling with the problem, I'd rather be wrangling the code. To keep things interesting, I want to combine two data sets.\n\nIdea: Process the Google 1-gram dataset, looking for 1-grams which are cities in the geonames dataset. I'm hoping to find out if some cities are always referenced, ala Paris, or is there a distribution across the globe, over time.\n\nThoughts? Any problems? Any takers?", "group_id": 9834, "id": 1470435}, {"user_id": 33069, "stars": [], "topic_id": 40262, "date_created": 1309003962.382859, "message": "I would first create a job that extracts cities out of Geonames and creates a simplified dataset. If that dataset is small enough, I would use DistributedCache to load it into memory of each task and then do the join with random lookups in a hash. If memory constraints prevent that, come talk to us at the event: you will need to map two data sets and merge them.", "group_id": 9834, "id": 1483769}] |