mirror of
https://github.com/kennethreitz-archive/redis-docs.git
synced 2026-06-05 15:30:19 +00:00
568 lines
28 KiB
ReStructuredText
568 lines
28 KiB
ReStructuredText
`|Redis Documentation| <index.html>`_
|
||
**VirtualMemorySpecification: Contents**
|
||
`Virtual Memory technical specification <#Virtual%20Memory%20technical%20specification>`_
|
||
`Keys vs Values: what is swapped out? <#Keys%20vs%20Values:%20what%20is%20swapped%20out?>`_
|
||
`How does a swapped value looks like internally <#How%20does%20a%20swapped%20value%20looks%20like%20internally>`_
|
||
`The Swap File <#The%20Swap%20File>`_
|
||
`Transfering objects from memory to swap <#Transfering%20objects%20from%20memory%20to%20swap>`_
|
||
`Loading objects back in memory <#Loading%20objects%20back%20in%20memory>`_
|
||
`How blocking VM works <#How%20blocking%20VM%20works>`_
|
||
`Blocking VM swapping <#Blocking%20VM%20swapping>`_
|
||
`What values to swap when we are out of memory? <#What%20values%20to%20swap%20when%20we%20are%20out%20of%20memory?>`_
|
||
`Blocking VM loading <#Blocking%20VM%20loading>`_
|
||
`Background saving when VM is active <#Background%20saving%20when%20VM%20is%20active>`_
|
||
`The problem with the blocking VM <#The%20problem%20with%20the%20blocking%20VM>`_
|
||
`Threaded VM <#Threaded%20VM>`_
|
||
`I/O Threads <#I/O%20Threads>`_
|
||
`Non blocking VM as probabilistic enhancement of blocking VM <#Non%20blocking%20VM%20as%20probabilistic%20enhancement%20of%20blocking%20VM>`_
|
||
`Blocking clients on swapped keys <#Blocking%20clients%20on%20swapped%20keys>`_
|
||
`Aborting I/O jobs <#Aborting%20I/O%20jobs>`_
|
||
`Questions? <#Questions?>`_
|
||
VirtualMemorySpecification
|
||
==========================
|
||
|
||
#sidebar `RedisInternals <RedisInternals.html>`_
|
||
Virtual Memory technical specification
|
||
======================================
|
||
|
||
This document details the internals of the Redis Virtual Memory
|
||
subsystem. The intended audience is not the final user but
|
||
programmers willing to understand or modify the Virtual Memory
|
||
implementation.
|
||
Keys vs Values: what is swapped out?
|
||
------------------------------------
|
||
|
||
The goal of the VM subsystem is to free memory transferring Redis
|
||
Objects from memory to disk. This is a very generic command, but
|
||
specifically, Redis transfers only objects associated with
|
||
*values*. In order to understand better this concept we'll show,
|
||
using the DEBUG command, how a key holding a value looks from the
|
||
point of view of the Redis internals:
|
||
::
|
||
|
||
redis> set foo bar
|
||
OK
|
||
redis> debug object foo
|
||
Key at:0x100101d00 refcount:1, value at:0x100101ce0 refcount:1 encoding:raw serializedlength:4
|
||
|
||
As you can see from the above output, the Redis top level hash
|
||
table maps Redis Objects (keys) to other Redis Objects (values).
|
||
The Virtual Memory is only able to swap *values* on disk, the
|
||
objects associated to *keys* are always taken in memory: this trade
|
||
off guarantees very good lookup performances, as one of the main
|
||
design goals of the Redis VM is to have performances similar to
|
||
Redis with VM disabled when the part of the dataset frequently used
|
||
fits in RAM.
|
||
How does a swapped value looks like internally
|
||
----------------------------------------------
|
||
|
||
When an object is swapped out, this is what happens in the hash
|
||
table entry:
|
||
|
||
- The key continues to hold a Redis Object representing the key.
|
||
- The value is set to NULL
|
||
|
||
So you may wonder where we store the information that a given value
|
||
(associated to a given key) was swapped out. Just in the key
|
||
object!
|
||
This is how the Redis Object structure *robj* looks like:
|
||
::
|
||
|
||
/* The actual Redis Object */
|
||
typedef struct redisObject {
|
||
void *ptr;
|
||
unsigned char type;
|
||
unsigned char encoding;
|
||
unsigned char storage; /* If this object is a key, where is the value?
|
||
* REDIS_VM_MEMORY, REDIS_VM_SWAPPED, ... */
|
||
unsigned char vtype; /* If this object is a key, and value is swapped out,
|
||
* this is the type of the swapped out object. */
|
||
int refcount;
|
||
/* VM fields, this are only allocated if VM is active, otherwise the
|
||
* object allocation function will just allocate
|
||
* sizeof(redisObjct) minus sizeof(redisObjectVM), so using
|
||
* Redis without VM active will not have any overhead. */
|
||
struct redisObjectVM vm;
|
||
} robj;
|
||
|
||
As you can see there are a few fields about VM. The most important
|
||
one is *storage*, that can be one of this values:
|
||
|
||
- REDIS\_VM\_MEMORY: the associated value is in memory.
|
||
- REDIS\_VM\_SWAPPED: the associated values is swapped, and the
|
||
value entry of the hash table is just set to NULL.
|
||
- REDIS\_VM\_LOADING: the value is swapped on disk, the entry is
|
||
NULL, but there is a job to load the object from the swap to the
|
||
memory (this field is only used when threaded VM is active).
|
||
- REDIS\_VM\_SWAPPING: the value is in memory, the entry is a
|
||
pointer to the actual Redis Object, but there is an I/O job in
|
||
order to transfer this value to the swap file.
|
||
|
||
If an object is swapped on disk (REDIS\_VM\_SWAPPED or
|
||
REDIS\_VM\_LOADING), how do we know where it is stored, what type
|
||
it is, and so forth? That's simple: the *vtype* field is set to the
|
||
original type of the Redis object swapped, while the *vm* field
|
||
(that is a *redisObjectVM* structure) holds information about the
|
||
location of the object. This is the definition of this additional
|
||
structure:
|
||
::
|
||
|
||
/* The VM object structure */
|
||
struct redisObjectVM {
|
||
off_t page; /* the page at which the object is stored on disk */
|
||
off_t usedpages; /* number of pages used on disk */
|
||
time_t atime; /* Last access time */
|
||
} vm;
|
||
|
||
As you can see the structure contains the page at which the object
|
||
is located in the swap file, the number of pages used, and the last
|
||
access time of the object (this is very useful for the algorithm
|
||
that select what object is a good candidate for swapping, as we
|
||
want to transfer on disk objects that are rarely accessed).
|
||
As you can see, while all the other fields are using unused bytes
|
||
in the old Redis Object structure (we had some free bit due to
|
||
natural memory alignment concerns), the *vm* field is new, and
|
||
indeed uses additional memory. Should we pay such a memory cost
|
||
even when VM is disabled? No! This is the code to create a new
|
||
Redis Object:
|
||
::
|
||
|
||
... some code ...
|
||
if (server.vm_enabled) {
|
||
pthread_mutex_unlock(&server.obj_freelist_mutex);
|
||
o = zmalloc(sizeof(*o));
|
||
} else {
|
||
o = zmalloc(sizeof(*o)-sizeof(struct redisObjectVM));
|
||
}
|
||
... some code ...
|
||
|
||
As you can see if the VM system is not enabled we allocate just
|
||
``sizeof(*o)-sizeof(struct redisObjectVM)`` of memory. Given that
|
||
the *vm* field is the last in the object structure, and that this
|
||
fields are never accessed if VM is disabled, we are safe and Redis
|
||
without VM does not pay the memory overhead.
|
||
The Swap File
|
||
-------------
|
||
|
||
The next step in order to understand how the VM subsystem works is
|
||
understanding how objects are stored inside the swap file. The good
|
||
news is that's not some kind of special format, we just use the
|
||
same format used to store the objects in .rdb files, that are the
|
||
usual dump files produced by Redis using the SAVE command.
|
||
The swap file is composed of a given number of pages, where every
|
||
page size is a given number of bytes. This parameters can be
|
||
changed in redis.conf, since different Redis instances may work
|
||
better with different values: it depends on the actual data you
|
||
store inside it. The following are the default values:
|
||
::
|
||
|
||
vm-page-size 32
|
||
vm-pages 134217728
|
||
|
||
Redis takes a "bitmap" (an contiguous array of bits set to zero or
|
||
one) in memory, every bit represent a page of the swap file on
|
||
disk: if a given bit is set to 1, it represents a page that is
|
||
already used (there is some Redis Object stored there), while if
|
||
the corresponding bit is zero, the page is free.
|
||
Taking this bitmap (that will call the page table) in memory is a
|
||
huge win in terms of performances, and the memory used is small: we
|
||
just need 1 bit for every page on disk. For instance in the example
|
||
below 134217728 pages of 32 bytes each (4GB swap file) is using
|
||
just 16 MB of RAM for the page table.
|
||
Transfering objects from memory to swap
|
||
---------------------------------------
|
||
|
||
In order to transfer an object from memory to disk we need to
|
||
perform the following steps (assuming non threaded VM, just a
|
||
simple blocking approach):
|
||
|
||
- Find how many pages are needed in order to store this object on
|
||
the swap file. This is trivially accomplished just calling the
|
||
function ``rdbSavedObjectPages`` that returns the number of pages
|
||
used by an object on disk. Note that this function does not
|
||
duplicate the .rdb saving code just to understand what will be the
|
||
length **after** an object will be saved on disk, we use the trick
|
||
of opening /dev/null and writing the object there, finally calling
|
||
``ftello`` in order check the amount of bytes required. What we do
|
||
basically is to save the object on a virtual very fast file, that
|
||
is, /dev/null.
|
||
- Now that we know how many pages are required in the swap file,
|
||
we need to find this number of contiguous free pages inside the
|
||
swap file. This task is accomplished by the
|
||
``vmFindContiguousPages`` function. As you can guess this function
|
||
may fail if the swap is full, or so fragmented that we can't easily
|
||
find the required number of contiguous free pages. When this
|
||
happens we just abort the swapping of the object, that will
|
||
continue to live in memory.
|
||
- Finally we can write the object on disk, at the specified
|
||
position, just calling the function ``vmWriteObjectOnSwap``.
|
||
|
||
As you can guess once the object was correctly written in the swap
|
||
file, it is freed from memory, the storage field in the associated
|
||
key is set to REDIS\_VM\_SWAPPED, and the used pages are marked as
|
||
used in the page table.
|
||
Loading objects back in memory
|
||
------------------------------
|
||
|
||
Loading an object from swap to memory is simpler, as we already
|
||
know where the object is located and how many pages it is using. We
|
||
also know the type of the object (the loading functions are
|
||
required to know this information, as there is no header or any
|
||
other information about the object type on disk), but this is
|
||
stored in the *vtype* field of the associated key as already seen
|
||
above.
|
||
Calling the function ``vmLoadObject`` passing the key object
|
||
associated to the value object we want to load back is enough. The
|
||
function will also take care of fixing the storage type of the key
|
||
(that will be REDIS\_VM\_MEMORY), marking the pages as freed in the
|
||
page table, and so forth.
|
||
The return value of the function is the loaded Redis Object itself,
|
||
that we'll have to set again as value in the main hash table
|
||
(instead of the NULL value we put in place of the object pointer
|
||
when the value was originally swapped out).
|
||
How blocking VM works
|
||
---------------------
|
||
|
||
Now we have all the building blocks in order to describe how the
|
||
blocking VM works. First of all, an important detail about
|
||
configuration. In order to enable blocking VM in Redis
|
||
``server.vm_max_threads`` must be set to zero. We'll see later how
|
||
this max number of threads info is used in the threaded VM, for now
|
||
all it's needed to now is that Redis reverts to fully blocking VM
|
||
when this is set to zero.
|
||
We also need to introduce another important VM parameter, that is,
|
||
``server.vm_max_memory``. This parameter is very important as it is
|
||
used in order to trigger swapping: Redis will try to swap objects
|
||
only if it is using more memory than the max memory setting,
|
||
otherwise there is no need to swap as we are matching the user
|
||
requested memory usage.
|
||
Blocking VM swapping
|
||
--------------------
|
||
|
||
Swapping of object from memory to disk happens in the cron
|
||
function. This function used to be called every second, while in
|
||
the recent Redis versions on git it is called every 100
|
||
milliseconds (that is, 10 times per second). If this function
|
||
detects we are out of memory, that is, the memory used is greater
|
||
than the vm-max-memory setting, it starts transferring objects from
|
||
memory to disk in a loop calling the function ``vmSwapOneObect``.
|
||
This function takes just one argument, if 0 it will swap objects in
|
||
a blocking way, otherwise if it is 1, I/O threads are used. In the
|
||
blocking scenario we just call it with zero as argument.
|
||
vmSwapOneObject acts performing the following steps:
|
||
|
||
- The key space in inspected in order to find a good candidate for
|
||
swapping (we'll see later what a good candidate for swapping is).
|
||
- The associated value is transfered to disk, in a blocking way.
|
||
- The key storage field is set to REDIS\_VM\_SWAPPED, while the
|
||
*vm* fields of the object are set to the right values (the page
|
||
index where the object was swapped, and the number of pages used to
|
||
swap it).
|
||
- Finally the value object is freed and the value entry of the
|
||
hash table is set to NULL.
|
||
|
||
The function is called again and again until one of the following
|
||
happens: there is no way to swap more objects because either the
|
||
swap file is full or nearly all the objects are already transfered
|
||
on disk, or simply the memory usage is already under the
|
||
vm-max-memory parameter.
|
||
What values to swap when we are out of memory?
|
||
----------------------------------------------
|
||
|
||
Understanding what's a good candidate for swapping is not too hard.
|
||
A few objects at random are sampled, and for each their
|
||
*swappability* is commuted as:
|
||
::
|
||
|
||
swappability = age*log(size_in_memory)
|
||
|
||
The age is the number of seconds the key was not requested, while
|
||
size\_in\_memory is a fast estimation of the amount of memory (in
|
||
bytes) used by the object in memory. So we try to swap out objects
|
||
that are rarely accessed, and we try to swap bigger objects over
|
||
smaller one, but the latter is a less important factor (because of
|
||
the logarithmic function used). This is because we don't want
|
||
bigger objects to be swapped out and in too often as the bigger the
|
||
object the more I/O and CPU is required in order to transfer it.
|
||
Blocking VM loading
|
||
-------------------
|
||
|
||
What happens if an operation against a key associated with a
|
||
swapped out object is requested? For instance Redis may just happen
|
||
to process the following command:
|
||
::
|
||
|
||
GET foo
|
||
|
||
If the value object of the ``foo`` key is swapped we need to load
|
||
it back in memory before processing the operation. In Redis the key
|
||
lookup process is centralized in the ``lookupKeyRead`` and
|
||
``lookupKeyWrite`` functions, this two functions are used in the
|
||
implementation of all the Redis commands accessing the keyspace, so
|
||
we have a single point in the code where to handle the loading of
|
||
the key from the swap file to memory.
|
||
So this is what happens:
|
||
|
||
- The user calls some command having as argumenet a swapped key
|
||
- The command implementation calls the lookup function
|
||
- The lookup function search for the key in the top level hash
|
||
table. If the value associated with the requested key is swapped
|
||
(we can see that checking the *storage* field of the key object),
|
||
we load it back in memory in a blocking way before to return to the
|
||
user.
|
||
|
||
This is pretty straightforward, but things will get more
|
||
*interesting* with the threads. From the point of view of the
|
||
blocking VM the only real problem is the saving of the dataset
|
||
using another process, that is, handling BGSAVE and BGREWRITEAOF
|
||
commands.
|
||
Background saving when VM is active
|
||
-----------------------------------
|
||
|
||
The default Redis way to persist on disk is to create .rdb files
|
||
using a child process. Redis calls the fork() system call in order
|
||
to create a child, that has the exact copy of the in memory
|
||
dataset, since fork duplicates the whole program memory space
|
||
(actually thanks to a technique called Copy on Write memory pages
|
||
are shared between the parent and child process, so the fork() call
|
||
will not require too much memory).
|
||
In the child process we have a copy of the dataset in a given point
|
||
in the time. Other commands issued by clients will just be served
|
||
by the parent process and will not modify the child data.
|
||
The child process will just store the whole dataset into the
|
||
dump.rdb file and finally will exit. But what happens when the VM
|
||
is active? Values can be swapped out so we don't have all the data
|
||
in memory, and we need to access the swap file in order to retrieve
|
||
the swapped values. While child process is saving the swap file is
|
||
shared between the parent and child process, since:
|
||
|
||
- The parent process needs to access the swap file in order to
|
||
load values back into memory if an operation against swapped out
|
||
values are performed.
|
||
- The child process needs to access the swap file in order to
|
||
retrieve the full dataset while saving the data set on disk.
|
||
|
||
In order to avoid problems while both the processes are accessing
|
||
the same swap file we do a simple thing, that is, not allowing
|
||
values to be swapped out in the parent process while a background
|
||
saving is in progress. This way both the processes will access the
|
||
swap file in read only. This approach has the problem that while
|
||
the child process is saving no new values can be transfered on the
|
||
swap file even if Redis is using more memory than the max memory
|
||
parameters dictates. This is usually not a problem as the
|
||
background saving will terminate in a short amount of time and if
|
||
still needed a percentage of values will be swapped on disk ASAP.
|
||
An alternative to this scenario is to enable the Append Only File
|
||
that will have this problem only when a log rewrite is performed
|
||
using the BGREWRITEAOF command.
|
||
The problem with the blocking VM
|
||
--------------------------------
|
||
|
||
The problem of blocking VM is that... it's blocking :) This is not
|
||
a problem when Redis is used in batch processing activities, but
|
||
for real-time usage one of the good points of Redis is the low
|
||
latency. The blocking VM will have bad latency behaviors as when a
|
||
client is accessing a swapped out value, or when Redis needs to
|
||
swap out values, no other clients will be served in the meantime.
|
||
Swapping out keys should happen in background. Similarly when a
|
||
client is accessing a swapped out value other clients accessing in
|
||
memory values should be served mostly as fast as when VM is
|
||
disabled. Only the clients dealing with swapped out keys should be
|
||
delayed.
|
||
All this limitations called for a non-blocking VM implementation.
|
||
Threaded VM
|
||
-----------
|
||
|
||
There are basically three main ways to turn the blocking VM into a
|
||
non blocking one.
|
||
|
||
- 1: One way is obvious, and in my opionion, not a good idea at
|
||
all, that is, turning Redis itself into a theaded server: if every
|
||
request is served by a different thread automatically other clients
|
||
don't need to wait for blocked ones. Redis is fast, exports atomic
|
||
operations, has no locks, and is just 10k lines of code,
|
||
**because** it is single threaded, so this was not an option for
|
||
me.
|
||
- 2: Using non-blocking I/O against the swap file. After all you
|
||
can think Redis already event-loop based, why don't just handle
|
||
disk I/O in a non-blocking fashion? I also discarded this
|
||
possiblity because of two main reasons. One is that non blocking
|
||
file operations, unlike sockets, are an incompatibility nightmare.
|
||
It's not just like calling select, you need to use OS-specific
|
||
things. The other problem is that the I/O is just one part of the
|
||
time consumed to handle VM, another big part is the CPU used in
|
||
order to encode/decode data to/from the swap file. This is I picked
|
||
option three, that is...
|
||
- 3: Using I/O threads, that is, a pool of threads handling the
|
||
swap I/O operations. This is what the Redis VM is using, so let's
|
||
detail how this works.
|
||
|
||
I/O Threads
|
||
-----------
|
||
|
||
The threaded VM design goals where the following, in order of
|
||
importance:
|
||
|
||
- Simple implementation, little room for race condtions, simple
|
||
locking, VM system more or less completeley decoupled from the rest
|
||
of Redis code.
|
||
- Good performances, no locks for clients accessing values in
|
||
memory.
|
||
- Ability to decode/encode objects in the I/O threads.
|
||
|
||
The above goals resulted in an implementation where the Redis main
|
||
thread (the one serving actual clients) and the I/O threads
|
||
communicate using a queue of jobs, with a single mutex. Basically
|
||
when main thread requires some work done in the background by some
|
||
I/O thread, it pushes an I/O job structure in the
|
||
``server.io_newjobs`` queue (that is, just a linked list). If there
|
||
are no active I/O threads, one is started. At this point some I/O
|
||
thread will process the I/O job, and the result of the processing
|
||
is pushed in the ``server.io_processed`` queue. The I/O thread will
|
||
send a byte using an UNIX pipe to the main thread in order to
|
||
signal that a new job was processed and the result is ready to be
|
||
processed.
|
||
This is how the ``iojob`` structure looks like:
|
||
::
|
||
|
||
typedef struct iojob {
|
||
int type; /* Request type, REDIS_IOJOB_* */
|
||
redisDb *db;/* Redis database */
|
||
robj *key; /* This I/O request is about swapping this key */
|
||
robj *val; /* the value to swap for REDIS_IOREQ_*_SWAP, otherwise this
|
||
* field is populated by the I/O thread for REDIS_IOREQ_LOAD. */
|
||
off_t page; /* Swap page where to read/write the object */
|
||
off_t pages; /* Swap pages needed to save object. PREPARE_SWAP return val */
|
||
int canceled; /* True if this command was canceled by blocking side of VM */
|
||
pthread_t thread; /* ID of the thread processing this entry */
|
||
} iojob;
|
||
|
||
There are just three type of jobs that an I/O thread can perform
|
||
(the type is specified by the ``type`` field of the structure):
|
||
|
||
- REDIS\_IOJOB\_LOAD: load the value associated to a given key
|
||
from swap to memory. The object offset inside the swap file is
|
||
``page``, the object type is ``key->vtype``. The result of this
|
||
operation will populate the ``val`` field of the structure.
|
||
- REDIS\_IOJOB\_PREPARE\_SWAP: compute the number of pages needed
|
||
in order to save the object pointed by ``val`` into the swap. The
|
||
result of this operation will populate the ``pages`` field.
|
||
- REDIS\_IOJOB\_DO\_SWAP: Transfer the object pointed by ``val``
|
||
to the swap file, at page offset ``page``.
|
||
|
||
The main thread delegates just the above three tasks. All the rest
|
||
is handled by the main thread itself, for instance finding a
|
||
suitable range of free pages in the swap file page table (that is a
|
||
fast operation), deciding what object to swap, altering the storage
|
||
field of a Redis object to reflect the current state of a value.
|
||
Non blocking VM as probabilistic enhancement of blocking VM
|
||
-----------------------------------------------------------
|
||
|
||
So now we have a way to request background jobs dealing with slow
|
||
VM operations. How to add this to the mix of the rest of the work
|
||
done by the main thread? While blocking VM was aware that an object
|
||
was swapped out just when the object was looked up, this is too
|
||
late for us: in C it is not trivial to start a background job in
|
||
the middle of the command, leave the function, and re-enter in the
|
||
same point the computation when the I/O thread finished what we
|
||
requested (that is, no co-routines or continuations or alike).
|
||
Fortunately there was a much, much simpler way to do this. And we
|
||
love simple things: basically consider the VM implementation a
|
||
blocking one, but add an optimization (using non the no blocking VM
|
||
operations we are able to perform) to make the blocking **very**
|
||
unlikely.
|
||
This is what we do:
|
||
|
||
- Every time a client sends us a command, **before** the command
|
||
is executed, we examine the argument vector of the command in
|
||
search for swapped keys. After all we know for every command what
|
||
arguments are keys, as the Redis command format is pretty simple.
|
||
- If we detect that at least a key in the requested command is
|
||
swapped on disk, we block the client instead of really issuing the
|
||
command. For every swapped value associated to a requested key, an
|
||
I/O job is created, in order to bring the values back in memory.
|
||
The main thread continues the execution of the event loop, without
|
||
caring about the blocked client.
|
||
- In the meanwhile, I/O threads are loading values in memory.
|
||
Every time an I/O thread finished loading a value, it sends a byte
|
||
to the main thread using an UNIX pipe. The pipe file descriptor has
|
||
a readable event associated in the main thread event loop, that is
|
||
the function ``vmThreadedIOCompletedJob``. If this function detects
|
||
that all the values needed for a blocked client were loaded, the
|
||
client is restarted and the original command called.
|
||
|
||
So you can think at this as a blocked VM that almost always happen
|
||
to have the right keys in memory, since we pause clients that are
|
||
going to issue commands about swapped out values until this values
|
||
are loaded.
|
||
If the function checking what argument is a key fails in some way,
|
||
there is no problem: the lookup function will see that a given key
|
||
is associated to a swapped out value and will block loading it. So
|
||
our non blocking VM reverts to a blocking one when it is not
|
||
possible to anticipate what keys are touched.
|
||
For instance in the case of the SORT command used together with the
|
||
GET or BY options, it is not trivial to know beforehand what keys
|
||
will be requested, so at least in the first implementation, SORT
|
||
BY/GET resorts to the blocking VM implementation.
|
||
Blocking clients on swapped keys
|
||
--------------------------------
|
||
|
||
How to block clients? To suspend a client in an event-loop based
|
||
server is pretty trivial. All we do is cancelling its read handler.
|
||
Sometimes we do something different (for instance for BLPOP) that
|
||
is just marking the client as blocked, but not processing new data
|
||
(just accumulating the new data into input buffers).
|
||
Aborting I/O jobs
|
||
-----------------
|
||
|
||
There is something hard to solve about the interactions between our
|
||
blocking and non blocking VM, that is, what happens if a blocking
|
||
operation starts about a key that is also "interested" by a non
|
||
blocking operation at the same time?
|
||
For instance while SORT BY is executed, a few keys are being loaded
|
||
in a blocking manner by the sort command. At the same time, another
|
||
client may request the same keys with a simple *GET key* command,
|
||
that will trigger the creation of an I/O job to load the key in
|
||
background.
|
||
The only simple way to deal with this problem is to be able to kill
|
||
I/O jobs in the main thread, so that if a key that we want to load
|
||
or swap in a blocking way is in the REDIS\_VM\_LOADING or
|
||
REDIS\_VM\_SWAPPING state (that is, there is an I/O job about this
|
||
key), we can just kill the I/O job about this key, and go ahead
|
||
with the blocking operation we want to perform.
|
||
This is not as trivial as it is. In a given moment an I/O job can
|
||
be in one of the following three queues:
|
||
|
||
- server.io\_newjobs: the job was already queued but no thread is
|
||
handling it.
|
||
- server.io\_processing: the job is being processed by an I/O
|
||
thread.
|
||
- server.io\_processed: the job was already processed.
|
||
|
||
The function able to kill an I/O job is ``vmCancelThreadedIOJob``,
|
||
and this is what it does:
|
||
|
||
- If the job is in the newjobs queue, that's simple, removing the
|
||
iojob structure from the queue is enough as no thread is still
|
||
executing any operation.
|
||
- If the job is in the processing queue, a thread is messing with
|
||
our job (and possibly with the associated object!). The only thing
|
||
we can do is waiting for the item to move to the next queue in a
|
||
**blocking way**. Fortunately this condition happens very rarely so
|
||
it's not a performance problem.
|
||
- If the job is in the processed queue, we just mark it as
|
||
*canceled* marking setting the ``canceled`` field to 1 in the iojob
|
||
structure. The function processing completed jobs will just ignored
|
||
and free the job instead of really processing it.
|
||
|
||
Questions?
|
||
----------
|
||
|
||
This document is in no way complete, the only way to get the whole
|
||
picture is reading the source code, but it should be a good
|
||
introduction in order to make the code review / understanding a lot
|
||
simpler.
|
||
Something is not clear about this page? Please leave a comment and
|
||
I'll try to address the issue possibly integrating the answer in
|
||
this document.
|
||
.. |Redis Documentation| image:: redis.png |