convore.json/groups/datasearchstuff/normalization-aggregate-similar-words-batch-and-on-the-fly/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 14741, "stars": [], "topic_id": 48648, "date_created": 1323620287.264081, "message": "Think about music collections. We have some categorization, like last.fm tags http://www.last.fm/music/Francesco+De+Gregori/+tags or some user music collection which have 9 different Rock genre (punk rock, hard rock, rock 80s, prog-rock, prog rock...\"", "group_id": 12421, "id": 2739505}, {"user_id": 14741, "stars": [], "topic_id": 48648, "date_created": 1323620351.1909831, "message": "Let's imagine a search like \"year = 1990\" where I want facet over the music genre. But instead of a facet for each possible rock sub-genres, I want to cluster all those sub-genres in just \"Rock\".", "group_id": 12421, "id": 2739506}, {"user_id": 14741, "stars": [], "topic_id": 48648, "date_created": 1323620674.023531, "message": "I'd also be happy with a two-passes process solution: first I do normalize data, extracting some macro-genres over the low-level genres. Anyway I need the right alghoritm to elaborate those data and I think that maybe Lucene can have something for that. Maybe something simple like: search by genre = rock using the score of the results.", "group_id": 12421, "id": 2739507}, {"user_id": 14741, "stars": [], "topic_id": 48648, "date_created": 1323620682.7872231, "message": "Any ideas or experiences?", "group_id": 12421, "id": 2739508}, {"user_id": 10340, "stars": [], "topic_id": 48648, "date_created": 1323637852.492413, "message": "better: 'tokens' instead of terms/words", "group_id": 12421, "id": 2739581}, {"user_id": 10340, "stars": [], "topic_id": 48648, "date_created": 1323637811.418407, "message": "No experience in this area but I would tackle the problem as you described. Directly using a standard analyzer to get an array of terms for the low-level genres => [punk, rock, hard, 80s, prog]. Then I would remove all terms with a 'low' term frequency.But if you also want to determine two or three worded terms then you should use the shingle tokenizer: http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter.html", "group_id": 12421, "id": 2739579}]