5.6. Getting the most from the node LRU cache

Starting from PyTables 1.2 on, it has been introduced a new LRU cache that prevents from loading all the nodes of the object tree in memory. This cache is responsible of loading just up to a certain amount of nodes and discard the least recent used ones when there is a need to load new ones. This represents a big advantage over the old schema, specially in terms of memory usage (as there is no need to load every node in memory), but it also adds very convenient optimizations for working interactively like, for example, speeding-up the opening times of files with lots of nodes, allowing to open almost any kind of file in typically less than one tenth of second (compare this with the more than 10 seconds for files with more than 10000 nodes in PyTables pre-1.2 era). See [] for more info on the advantages (and also drawbacks) of this approach.

One thing that deserves some discussion is the election of the parameter that sets the maximum amount of nodes to be held in memory at any time. As PyTables is meant to be deployed in machines that have potentially low memory, the default for it is quite conservative (you can look at its actual value in the NODE_CACHE_SIZE parameter in module tables/constants.py). However, if you usually have to deal with files that have much more nodes than the maximum default, and you have a lot of free memory in your system, then you may want to experiment which is the appropriate value of NODE_CACHE_SIZE that fits better your needs.

As an example, look at the next code:


	    def browse_tables(filename):
	    fileh = openFile(filename,'a')
	    group = fileh.root.newgroup
	    for j in range(10):
	    for tt in fileh.walkNodes(group, "Table"):
            title = tt.attrs.TITLE
            for row in tt:
	    pass
	    fileh.close()
	  

We will be running the code above against a couple of files having a /newgroup containing 100 tables and 1000 tables respectively. We will run this small benchmark for different values of the LRU cache size, namely 256 and 1024. You can see the results in table 5.1.

Table 5.1. Retrieving speed and memory consumption dependency of the number of nodes in LRU cache.

 100 nodes1000 nodes
 Memory (MB)Time (ms)Memory (MB)Time (ms)
Node is coming from...Cache size2561024256102425610242561024
From disk 14141.241.2451661.331.31
From cache 14140.530.5265731.350.68

From the data in table 5.1, one can see that, when the number of objects that you are dealing with does fit in cache, you will get better access times to them. Also, incrementing the node cache size does effectively consumes more memory only if the total nodes exceeds the slots in cache; otherwise the memory consumption remains the same. It is also worth noting that incrementing the node cache size in the case you want to fit all your nodes in cache, it does not take much more memory than keeping too conservative. On another hand, it might happen that the speed-up that you can achieve by allocating more slots in your cache maybe is not worth the amount of memory used.

Anyway, if you feel that this issue is important for you, setup your own experiments and proceed fine-tuning the NODE_CACHE_SIZE parameter.