convore.json/groups/django-community/best-practices-for-rebuilding-large-haystacksolr-indexes/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1314743025.537781, "message": "Relatively new to Solr & Haystack... I've got a number of indexes, and some of them are pretty costly (in CPU, memory, and time) to rebuild. They're being updated in real-time, so that's not a huge deal.. but it's possible for them to get out of sync if a service or network connection goes down.. and there's also the issue of making changes to the index and wanting to rebuild for that.", "group_id": 81, "id": 2007048}, {"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1314743210.1742489, "message": "So far I've been rebuilding nightly, but that's looking unrealistic now.. wondering what others suggest?", "group_id": 81, "id": 2007075}, {"user_id": 28981, "stars": [], "topic_id": 44692, "date_created": 1314795021.6649351, "message": "https://groups.google.com/forum/#!topic/django-haystack/2H0UWhBB9HI", "group_id": 81, "id": 2010687}, {"user_id": 28981, "stars": [], "topic_id": 44692, "date_created": 1314794942.5431581, "message": "How about restricting the nightly rebuild to items updated in the pats 24 hrs http://docs.haystacksearch.org/dev/searchindex_api.html#get-updated-field", "group_id": 81, "id": 2010674}, {"user_id": 28981, "stars": [], "topic_id": 44692, "date_created": 1314795027.5967669, "message": "Also how about using celery/queued search index instead of realtimeindex? You can add some retry logic on the task somehow?", "group_id": 81, "id": 2010688}, {"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1314826476.6087339, "message": "@sidmitra thanks, yeah.. last 24 hours, etc makes sense.. and that's definitely gonna be part of my strategy.. but I'm wondering what folks do when they want to change the index substantially, or find a bug that suggests some indexed records might be invalid or stale.. The index_queryset() seems limiting.. in that it forces me to give a queryset that might return tons of rows.. So it'd seem I'd wanna do something like run tasks that build some reasonable number of chunks at a time, to keep resource overhead down.. Just wondering if that's something folks do, and whether there are hidden helpers in haystack or you have to roll your own.", "group_id": 81, "id": 2014695}, {"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1314826567.265101, "message": "Oh, and yeah.. I do use a celery/queued index update for my real-time stuff. I also hooked it up to sort of 'distinct' the records I need to update post-commit so that I don't run tasks that can't see about-to-be-committed data, and I don't index twice if I update a record twice in a transaction. That stuff is working nicely for me.", "group_id": 81, "id": 2014707}, {"user_id": 3580, "stars": [{"date_created": 1315415090.427649, "user_id": 28981}], "topic_id": 44692, "date_created": 1314915582.6344471, "message": "In case anyone's curious, I hacked something together that works nicely. Basically I just wrote a celery task that re-builds the index for a model in batches. It starts at zero, optionally clears the index, builds a batch, then fires off another async task to do the next batch. Combined with my celery decorator that blows up a worker after jobs that I suspect will chew and hold memory.. this approach keeps my footprint way down. I'd still love to hear any other solutions though", "group_id": 81, "id": 2024129}, {"user_id": 28981, "stars": [], "topic_id": 44692, "date_created": 1315415053.9541049, "message": "@phill Would love to see your code on github, and collaborate on it.", "group_id": 81, "id": 2068994}, {"user_id": 28981, "stars": [], "topic_id": 44692, "date_created": 1315414998.5586979, "message": "@phill interesting. I might need to do something like that for a project of mine. I didn't have a project that didn't finish updating the entire index within 24hrs, which was fine for me as a daily cron. Although i was still using queued index to do it faster.", "group_id": 81, "id": 2068990}, {"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1315416118.192549, "message": "@sidmitra Here you go: https://gist.github.com/1201168", "group_id": 81, "id": 2069095}, {"user_id": 3580, "stars": [], "topic_id": 44692, "date_created": 1315416360.7012141, "message": "@sidmitra Also.. note that of course if you wanted you could deploy all of the batch-tasks and let your Celery concurrency decide how asynchronous the index is. That might be more desirable for most-folks... Since the whole point for me was to control my resources though, I preferred to daisy-chain them and let them build slowly. OH! Also note.. your index_queryset() method should include an order_by.. otherwise the batch-to-batch queries might not sync up! I order mine by '-id' so that I get recent ones first, and so that new entries that come in while I'm building end up in the index, instead of screwing up the batches... but really I only use this to bootstrap, while I don't have users adding new items anyways. Hope it's handy.. suggestions welcome.", "group_id": 81, "id": 2069131}, {"user_id": 1126, "stars": [], "topic_id": 44692, "date_created": 1315425439.8981061, "message": "just to confirm, you _are_ actually updating the index, not trying to rebuild it every time, yes?", "group_id": 81, "id": 2069896}]