convore.json/groups/python-scientific-computing/strategy-for-consolidating-the-number-of-datasets-in-pytableshdf5/messages.json


			
				
					
					
						
						
							
							
							[{"user_id": 10411, "stars": [{"date_created": 1300460234.157737, "user_id": 14562}], "topic_id": 13723, "date_created": 1300423051.1598711, "message": "This is from the PyTables mailing list:\n\n\nby having data in fewer, but larger datasets is how you will see i/o gains.  For example, 10 arrays of 1 million elements will preform far better than 1 million arrays of 10 elements. \n\nSo if possible, I would consolidate your datasets.  How you go about doing this depends highly on the structure of the data that you have.  One trick I sometimes use is that I have a single 'info' Table that contains metadata about the rows of (multiple) other data sets.   For example, say that you have weather data coming from multiple different locations, all sampled at possibly different times.  You could have the following structure:\n\nStation1 (Group)\n  |- times (Array)\n  |- temperature (Array)\n  |- windspeed (Array)\n  |- ...\nStation2 (Group)\n  |- times (Array)\n  |- temperature (Array)\n  |- windspeed (Array)\n  |- ...\nStation3 (Group)\n  |- times (Array)\n  |- temperature (Array)\n  |- windspeed (Array)\n  |- ...\n...\n\nBut a better way to do it might be to have an info table, with only one, much larger array for each of measurement type\n\nstation_times (Table):\n   |- station (StrCol)\n   |- time (Time64Col)\ntemperatures (Array)\nwindspeed (Array)\n...\n\nwhere the indices of the station_times table match the indices of the data arrays.", "group_id": 6727, "id": 379821}, {"user_id": 21310, "stars": [], "topic_id": 13723, "date_created": 1300425725.857317, "message": "Hmm, that's a good point. Interesting to me since working with genome data, typically the focus is on matching patterns, and the longer the strings, the worse performance gets. So I tend to chop up the data into smaller pieces as much as I can to avoid doing alignments of long strings. And I make new objects out of the smaller pieces, because that makes it easy to handle them, add metadata to them, access that selectively later and so on. But reading this gets me to worry about performance cost... So this may be a stupid question, but is there a ballpark order of magnitude (in terms of number of separate items) where this typically becomes a significant issue? And is parallelizing i/o operations a potential solution? if that makes sense?", "group_id": 6727, "id": 380013}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1300427707.8985131, "message": "@scopatz Generally, you want to try to steer away from parallelizing i/o operations yourself if you can.  That space gets very messy very quickly.  (GIL + your data + your disk drive = no fun.)", "group_id": 6727, "id": 380118}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1300427621.04494, "message": "@gglobster Unfortunately, it all depends on the structure of data that you are working with.  Personally, I don't have a lot of experience with genome data so I don't have any rules of thumb.\n\nThe reason this came up was because someone was creating 30,000 arrays of length 2000 and (rightly!) stating that this had worse performance than expected.   HDF5 is optimized for the arrays themselves, not for creating them.", "group_id": 6727, "id": 380113}, {"user_id": 7688, "stars": [], "topic_id": 13723, "date_created": 1300452764.7181759, "message": "But accessing large arrays usually needs larger buffer in memory.  Yes, then data structure becomes very important.", "group_id": 6727, "id": 381732}, {"user_id": 21728, "stars": [], "topic_id": 13723, "date_created": 1300455720.0348439, "message": "@yungyuc - only if you need it all in memory at once. But often you are processing the data as a stream, so it can be on-demand loaded into memory.", "group_id": 6727, "id": 381981}, {"user_id": 7688, "stars": [], "topic_id": 13723, "date_created": 1300455964.880048, "message": "I think HDF5 should be able to allow reading partial arrays?  But I am not very familiar with the API yet.", "group_id": 6727, "id": 381997}, {"user_id": 7688, "stars": [], "topic_id": 13723, "date_created": 1300455879.9817679, "message": "@nmichaudagrawal It's true, but I guess the first version of the code that most people write would just load the whole array into memory, at least for debugging.", "group_id": 6727, "id": 381992}, {"user_id": 10454, "stars": [{"date_created": 1300467347.8519571, "user_id": 10411}], "topic_id": 13723, "date_created": 1300463198.524076, "message": "Yes, you just slice (at least with PyTables; I think h5py has similar functionality).", "group_id": 6727, "id": 382945}, {"user_id": 10348, "stars": [], "topic_id": 13723, "date_created": 1300463807.0394411, "message": "As Robert mentions, h5py and PyTables have support for this.  My understanding is that h5py exposes much more of the underlying H5S* functionality, which is where the advanced dataspace selection occurs in HDF5.  It also translates slices of (non-loaded) datasets into dataspace selection.", "group_id": 6727, "id": 383034}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1300467521.7844989, "message": "Yup, slicing is your friend here.  On the C/C++/Fortran-level you can also select partial datasets (hypercubes).  But the slicing syntax is not supported.  Python makes loading part of the array into memory much easier.", "group_id": 6727, "id": 383641}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305836212.5693319, "message": "Most people use the standard HDF Group implementation, but you don't have to.", "group_id": 6727, "id": 1115759}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305835056.0568991, "message": "That's great, if you're using pytables to write your hdf5 database...", "group_id": 6727, "id": 1115513}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305837321.5077159, "message": "http://en.wikipedia.org/wiki/Sparse_matrix#Compressed_Sparse_Column_.28CSC_or_CCS.29", "group_id": 6727, "id": 1116006}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305835034.5305409, "message": "Once, I asked :  In HDF5, how problematic is sparse data?\n\nAnd @scopatz, you answered : \nIt isn't a problem at all.  This is what HDF5 Tables are built for  Sparse data is really just an array of the points you care about.  For example say you have a sparse matrix, \n\nM = [[ 0, 0, 0, 2, ],\n        [ 1, 0, 0, 0, ],\n        [ 0, 0, 0, 0, ] ,\n        [ 0, 0, 42, 0, ],]\n\nThe sparse version is simply:\n\nsparse(M) = [[0, 3, 2],\n                   [1, 0, 1],\n                   [3, 2, 42],]\n\nYou can imagine a sparse hierarchy therefore, where the first several columns are the path the data and the last columns are the data itself.  In fact, you can cut the data out of the table entirely and store them in separate arrays.  The remaining path table becomes the 'info' table I referred to on the convore thread.", "group_id": 6727, "id": 1115506}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305837051.9039781, "message": "Sorry @scopatz, I'm afraid I'm going to pollute this thread. I know all that, above. I'm afraid my question wasn't clear. I'll email you.", "group_id": 6727, "id": 1115924}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305836183.939152, "message": "Really HDF5 is two things. 1) a binary file format specification, and 2) methods for doing i/o with that specification.", "group_id": 6727, "id": 1115746}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305836301.974941, "message": "The C++ API really is just a class hierarchy tacked on top of the C-API.  PyTables uses the C-API directly and tacks on python bindings...so they even use the same implementation.", "group_id": 6727, "id": 1115780}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305836324.197567, "message": "I hope this answered your question...", "group_id": 6727, "id": 1115783}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305835946.50318, "message": "But, what if you're using the h5cpp api? So, I ask, if I create a 3D csc database in HDF5 via the c++ api, is there any reason pytables would have trouble reading it out properly? Certainly, this is one of those things I should try first and ask when there are problems, but in case you were online, I thought I'd check.", "group_id": 6727, "id": 1115709}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305836086.450624, "message": "@katyhuff Pytables will have no problem reading in a database that was written using the HDF5 C++ API.", "group_id": 6727, "id": 1115720}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305837242.296284, "message": "@katyhuff Ok.  Also what does 'csc' mean?", "group_id": 6727, "id": 1115993}, {"user_id": 10421, "stars": [], "topic_id": 13723, "date_created": 1305841036.455755, "message": "I'll say it again @scopatz , you're the wind beneath my wings.", "group_id": 6727, "id": 1116876}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305839875.037508, "message": "Ahh yes, I see.  The problem is that scipy has a sparse matrix but not a sparse array. I have a dummy sparse array that just needs to be filled out!", "group_id": 6727, "id": 1116643}, {"user_id": 10411, "stars": [], "topic_id": 13723, "date_created": 1305839944.0825529, "message": "import numpy as np\n\nclass csc_array(object):\n\n    def __init__(self, arr, shape, dtype):\n        self._arr = arr\n        self.shape = shape\n        self.dtype = dtype \n\n    def todense(self):\n        dense = np.zeros(self.shape, dtype=self.dtype)\n        dense[tuple(self._arr.T[:-1])] = self._arr.T[-1]\n        return dense\n\n\nif __name__ == '__main__':\n    a = np.array([[1, 2, 0, 42], [0, 1, 0, 16]])\n    c = csc_array(a, (3, 3, 3), float)\n    print c.todense()\n", "group_id": 6727, "id": 1116658}, {"user_id": 23030, "stars": [], "topic_id": 13723, "date_created": 1305884692.703043, "message": "@scopatz Wouldn't that be a COO format array, in scipy.sparse jargon? CSC is slightly more indexing work.", "group_id": 6727, "id": 1123751}]