Changeset 991d2d8 in mainline
- Timestamp:
- 2012-11-15T23:52:18Z (12 years ago)
- Branches:
- lfn, master, serial, ticket/834-toolchain-update, topic/msim-upgrade, topic/simplify-dev-export
- Children:
- 4f5e1c7c
- Parents:
- dcb0751
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
kernel/generic/src/adt/cht.c
rdcb0751 r991d2d8 35 35 * @file 36 36 * @brief Scalable resizable concurrent lock-free hash table. 37 * 38 * CHT is a concurrent hash table that is scalable resizable and lock-free. 39 * resizable = the number of buckets of the table increases or decreases 40 * depending on the average number of elements per bucket (ie load) 41 * scalable = accessing the table from more cpus increases performance 42 * almost linearly 43 * lock-free = common operations never block; even if any of the operations 44 * is preempted or interrupted at any time, other operations will still 45 * make forward progress 37 46 * 38 */ 47 * CHT is designed for read mostly scenarios. Performance degrades as the 48 * fraction of updates (insert/remove) increases. Other data structures 49 * significantly outperform CHT if the fraction of updates exceeds ~40%. 50 * 51 * CHT tolerates hardware exceptions and may be accessed from exception 52 * handlers as long as the underlying RCU implementation is exception safe. 53 * 54 * @par Caveats 55 * 56 * 0) Never assume an item is still in the table. 57 * The table may be accessed concurrently; therefore, other threads may 58 * insert or remove an item at any time. Do not assume an item is still 59 * in the table if cht_find() just returned it to you. Similarly, an 60 * item may have already been inserted by the time cht_find() returns NULL. 61 * 62 * 1) Always use RCU read locks when searching the table. 63 * Holding an RCU lock guarantees that an item found in the table remains 64 * valid (eg is not freed) even if the item was removed from the table 65 * in the meantime by another thread. 66 * 67 * 2) Never update values in place. 68 * Do not update items in the table in place, ie directly. The changes 69 * will not propagate to other readers (on other cpus) immediately or even 70 * correctly. Some readers may then encounter items that have only some 71 * of their fields changed or are completely inconsistent. 72 * 73 * Instead consider inserting an updated/changed copy of the item and 74 * removing the original item. Or contact the maintainer to provide 75 * you with a function that atomically replaces an item with a copy. 76 * 77 * 3) Use cht_insert_unique() instead of checking for duplicates with cht_find() 78 * The following code is prone to race conditions: 79 * @code 80 * if (NULL == cht_find(&h, key)) { 81 * // If another thread inserts and item here, we'll insert a duplicate. 82 * cht_insert(&h, item); 83 * } 84 * @endcode 85 * See cht_insert_unique() on how to correctly fix this. 86 * 87 * 88 * @par Semantics 89 * 90 * Lazy readers = cht_find_lazy(), cht_find_next_lazy() 91 * Readers = lazy readers, cht_find(), cht_find_next() 92 * Updates = cht_insert(), cht_insert_unique(), cht_remove_key(), 93 * cht_remove_item() 94 * 95 * Readers (but not lazy readers) are guaranteed to see the effects 96 * of @e completed updates. In other words, if cht_find() is invoked 97 * after a cht_insert() @e returned eg on another cpu, cht_find() is 98 * guaranteed to see the inserted item. 99 * 100 * Similarly, updates see the effects of @e completed updates. For example, 101 * issuing cht_remove() after a cht_insert() for that key returned (even 102 * on another cpu) is guaranteed to remove the inserted item. 103 * 104 * Reading or updating the table concurrently with other updates 105 * always returns consistent data and never corrupts the table. 106 * However the effects of concurrent updates may or may not be 107 * visible to all other concurrent readers or updaters. Eg, not 108 * all readers may see that an item has already been inserted 109 * if cht_insert() has not yet returned. 110 * 111 * Lazy readers are guaranteed to eventually see updates but it 112 * may take some time (possibly milliseconds) after the update 113 * completes for the change to propagate to lazy readers on all 114 * cpus. 115 * 116 * @par Implementation 117 * 118 * Collisions in CHT are resolved with chaining. The number of buckets 119 * is always a power of 2. Each bucket is represented with a single linked 120 * lock-free list [1]. Items in buckets are sorted by their mixed hashes 121 * in ascending order. All buckets are terminated with a single global 122 * sentinel node whose mixed hash value is the greatest possible. 123 * 124 * CHT with 2^k buckets uses the k most significant bits of a hash value 125 * to determine the bucket number where an item is to be stored. To 126 * avoid storing all items in a single bucket if the user supplied 127 * hash function does not produce uniform hashes, hash values are 128 * mixed first so that the top bits of a mixed hash change even if hash 129 * values differ only in the least significant bits. The mixed hash 130 * values are cached in cht_link.hash (which is overwritten once the 131 * item is scheduled for removal via rcu_call). 132 * 133 * A new item is inserted before all other existing items in the bucket 134 * with the same hash value as the newly inserted item (a la the original 135 * lock-free list [2]). Placing new items at the start of a same-hash 136 * sequence of items (eg duplicates) allows us to easily check for duplicates 137 * in cht_insert_unique(). The function can first check that there are 138 * no duplicates of the newly inserted item amongst the items with the 139 * same hash as the new item. If there were no duplicates the new item 140 * is linked before the same-hash items. Inserting a duplicate while 141 * the function is checking for duplicates is detected as a change of 142 * the link to the first checked same-hash item (and the search for 143 * duplicates can be restarted). 144 * 145 * @par Table resize algorithm 146 * 147 * Table resize is based on [3] and [5]. First, a new bucket head array 148 * is allocated and initialized. Second, old bucket heads are moved 149 * to the new bucket head array with the protocol mentioned in [5]. 150 * At this point updaters start using the new bucket heads. Third, 151 * buckets are split (or joined) so that the table can make use of 152 * the extra bucket head slots in the new array (or stop wasting space 153 * with the unnecessary extra slots in the old array). Splitting 154 * or joining buckets employs a custom protocol. Last, the new array 155 * replaces the original bucket array. 156 * 157 * A single background work item (of the system work queue) guides 158 * resizing of the table. If an updater detects that the bucket it 159 * is about to access is undergoing a resize (ie its head is moving 160 * or it needs to be split/joined), it helps out and completes the 161 * head move or the bucket split/join. 162 * 163 * The table always grows or shrinks by a factor of 2. Because items 164 * are assigned a bucket based on the top k bits of their mixed hash 165 * values, when growing the table each bucket is split into two buckets 166 * and all items of the two new buckets come from the single bucket in the 167 * original table. Ie items from separate buckets in the original table 168 * never intermix in the new buckets. Moreover 169 * since the buckets are sorted by their mixed hash values the items 170 * at the beginning of the old bucket will end up in the first new 171 * bucket while all the remaining items of the old bucket will end up 172 * in the second new bucket. Therefore, there is a single point where 173 * to split the linked list of the old bucket into two correctly sorted 174 * linked lists of the new buckets: 175 * .- bucket split 176 * | 177 * <-- first --> v <-- second --> 178 * [old] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel 179 * ^ ^ 180 * [new0] -- -+ | 181 * [new1] -- -- -- -- -- -- -- -+ 182 * 183 * Resize in greater detail: 184 * 185 * a) First, a resizer (a single background system work queue item 186 * in charge of resizing the table) allocates and initializes a new 187 * bucket head array. New bucket heads are pointed to the sentinel 188 * and marked Invalid (in the lower order bits of the pointer to the 189 * next item, ie the sentinel in this case): 190 * 191 * [old, N] --> [00b] -> [01b] -> [10b] -> [11b] -> sentinel 192 * ^ ^ 193 * [new0, Inv] -------------------------------------+ | 194 * [new1, Inv] ---------------------------------------+ 195 * 196 * 197 * b) Second, the resizer starts moving old bucket heads with the following 198 * lock-free protocol (from [5]) where cas(variable, expected_val, new_val) 199 * is short for compare-and-swap: 200 * 201 * old head new0 head transition to next state 202 * -------- --------- ------------------------ 203 * addr, N sentinel, Inv cas(old, (addr, N), (addr, Const)) 204 * .. mark the old head as immutable, so that 205 * updaters do not relink it to other nodes 206 * until the head move is done. 207 * addr, Const sentinel, Inv cas(new0, (sentinel, Inv), (addr, N)) 208 * .. move the address to the new head and mark 209 * the new head normal so updaters can start 210 * using it. 211 * addr, Const addr, N cas(old, (addr, Const), (addr, Inv)) 212 * .. mark the old head Invalid to signify 213 * the head move is done. 214 * addr, Inv addr, N 215 * 216 * Notice that concurrent updaters may step in at any point and correctly 217 * complete the head move without disrupting the resizer. At worst, the 218 * resizer or other concurrent updaters will attempt a number of CAS() that 219 * will correctly fail. 220 * 221 * [old, Inv] -> [00b] -> [01b] -> [10b] -> [11b] -> sentinel 222 * ^ ^ 223 * [new0, N] ----+ | 224 * [new1, Inv] --------------------------------------+ 225 * 226 * 227 * c) Third, buckets are split if the table is growing; or joined if 228 * shrinking (by the resizer or updaters depending on whoever accesses 229 * the bucket first). See split_bucket() and join_buckets() for details. 230 * 231 * 1) Mark the last item of new0 with JOIN_FOLLOWS: 232 * [old, Inv] -> [00b] -> [01b, JF] -> [10b] -> [11b] -> sentinel 233 * ^ ^ 234 * [new0, N] ----+ | 235 * [new1, Inv] ------------------------------------------+ 236 * 237 * 2) Mark the first item of new1 with JOIN_NODE: 238 * [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel 239 * ^ ^ 240 * [new0, N] ----+ | 241 * [new1, Inv] ----------------------------------------------+ 242 * 243 * 3) Point new1 to the join-node and mark new1 NORMAL. 244 * [old, Inv] -> [00b] -> [01b, JF] -> [10b, JN] -> [11b] -> sentinel 245 * ^ ^ 246 * [new0, N] ----+ | 247 * [new1, N] --------------------------+ 248 * 249 * 250 * d) Fourth, the resizer cleans up extra marks added during bucket 251 * splits/joins but only when it is sure all updaters are accessing 252 * the table via the new bucket heads only (ie it is certain there 253 * are no delayed updaters unaware of the resize and accessing the 254 * table via the old bucket head). 255 * 256 * [old, Inv] ---+ 257 * v 258 * [new0, N] --> [00b] -> [01b, N] ---+ 259 * v 260 * [new1, N] --> [10b, N] -> [11b] -> sentinel 261 * 262 * 263 * e) Last, the resizer publishes the new bucket head array for everyone 264 * to see and use. This signals the end of the resize and the old bucket 265 * array is freed. 266 * 267 * 268 * To understand details of how the table is resized, read [1, 3, 5] 269 * and comments in join_buckets(), split_bucket(). 270 * 271 * 272 * [1] High performance dynamic lock-free hash tables and list-based sets, 273 * Michael, 2002 274 * http://www.research.ibm.com/people/m/michael/spaa-2002.pdf 275 * [2] Lock-free linked lists using compare-and-swap, 276 * Valois, 1995 277 * http://people.csail.mit.edu/bushl2/rpi/portfolio/lockfree-grape/documents/lock-free-linked-lists.pdf 278 * [3] Resizable, scalable, concurrent hash tables via relativistic programming, 279 * Triplett, 2011 280 * http://www.usenix.org/event/atc11/tech/final_files/Triplett.pdf 281 * [4] Split-ordered Lists: Lock-free Extensible Hash Tables, 282 * Shavit, 2006 283 * http://www.cs.ucf.edu/~dcm/Teaching/COT4810-Spring2011/Literature/SplitOrderedLists.pdf 284 * [5] Towards a Scalable Non-blocking Coding Style, 285 * Click, 2008 286 * http://www.azulsystems.com/events/javaone_2008/2008_CodingNonBlock.pdf 287 */ 288 39 289 40 290 #include <adt/cht.h> … … 645 895 * table. 646 896 * 647 * The following is NOT thread-safe, so do not use:648 * \code897 * The following is @e NOT thread-safe, so do not use: 898 * @code 649 899 * if (!cht_find(h, key)) { 650 900 * // A concurrent insert here may go unnoticed by cht_find() above. … … 653 903 * // Now we may have two items with equal search keys. 654 904 * } 655 * \endcode905 * @endcode 656 906 * 657 907 * Replace such code with: 658 * \code908 * @code 659 909 * item = malloc(..); 660 910 * if (!cht_insert_unique(h, item)) { … … 665 915 * // there are no other equal items. 666 916 * } 667 * \endcode917 * @endcode 668 918 * 669 919 */ … … 1533 1783 * [dest_head | Inv] 1534 1784 * 1535 * 2) ,-- split_hash1785 * 3) ,-- split_hash 1536 1786 * v 1537 1787 * [src_head | N] -> .. -> [JF] -> [JN]
Note:
See TracChangeset
for help on using the changeset viewer.