docs/specs/ppc-spapr-numa.rst

   1
   2 NUMA mechanics for sPAPR (pseries machines)
   3 ============================================
   4
   5 NUMA in sPAPR works different than the System Locality Distance
   6 Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR
   7 1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
   8 document aims to complement this specification, providing details
   9 of the elements that impacts how QEMU views NUMA in pseries.
  10
  11 Associativity and ibm,associativity property
  12 --------------------------------------------
  13
  14 Associativity is defined as a group of platform resources that has
  15 similar mean performance (or in our context here, distance) relative to
  16 everyone else outside of the group.
  17
  18 The format of the ibm,associativity property varies with the value of
  19 bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
  20 bit 0 equal to zero is deprecated. The current format, with the bit 0
  21 with the value of one, makes ibm,associativity property represent the
  22 physical hierarchy of the platform, as one or more lists that starts
  23 with the highest level grouping up to the smallest. Considering the
  24 following topology:
  25
  26 ::
  27
  28     Mem M1 ---- Proc P1    |
  29     -----------------      | Socket S1  ---|
  30           chip C1          |               |
  31                                            | HW module 1 (MOD1)
  32     Mem M2 ---- Proc P2    |               |
  33     -----------------      | Socket S2  ---|
  34           chip C2          |
  35
  36 The ibm,associativity property for the processors would be:
  37
  38 * P1: {MOD1, S1, C1, P1}
  39 * P2: {MOD1, S2, C2, P2}
  40
  41 Each allocable resource has an ibm,associativity property. The LOPAPR
  42 specification allows multiple lists to be present in this property,
  43 considering that the same resource can have multiple connections to the
  44 platform.
  45
  46 Relative Performance Distance and ibm,associativity-reference-points
  47 --------------------------------------------------------------------
  48
  49 The ibm,associativity-reference-points property is an array that is used
  50 to define the relevant performance/distance  related boundaries, defining
  51 the NUMA levels for the platform.
  52
  53 The definition of its elements also varies with the value of bit 0 of byte 5
  54 of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
  55 is also deprecated. With the current format, each integer of the
  56 ibm,associativity-reference-points represents an 1 based ordinal index (i.e.
  57 the first element is 1) of the ibm,associativity array. The first
  58 boundary is the most significant to application performance, followed by
  59 less significant boundaries. Allocated resources that belongs to the
  60 same performance boundaries are expected to have relative NUMA distance
  61 that matches the relevancy of the boundary itself. Resources that belongs
  62 to the same first boundary will have the shortest distance from each
  63 other. Subsequent boundaries represents greater distances and degraded
  64 performance.
  65
  66 Using the previous example, the following setting reference points defines
  67 three NUMA levels:
  68
  69 * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
  70
  71 The first NUMA level (0x3) is interpreted as the third element of each
  72 ibm,associativity array, the second level is the second element and
  73 the third level is the first element. Let's also consider that elements
  74 belonging to the first NUMA level have distance equal to 10 from each
  75 other, and each NUMA level doubles the distance from the previous. This
  76 means that the second would be 20 and the third level 40. For the P1 and
  77 P2 processors, we would have the following NUMA levels:
  78
  79 ::
  80
  81   * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
  82
  83   * P1: associativity{MOD1, S1, C1, P1}
  84
  85   First NUMA level (0x3) => associativity[2] = C1
  86   Second NUMA level (0x2) => associativity[1] = S1
  87   Third NUMA level (0x1) => associativity[0] = MOD1
  88
  89   * P2: associativity{MOD1, S2, C2, P2}
  90
  91   First NUMA level (0x3) => associativity[2] = C2
  92   Second NUMA level (0x2) => associativity[1] = S2
  93   Third NUMA level (0x1) => associativity[0] = MOD1
  94
  95   P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40
  96
  97 Changing the ibm,associativity-reference-points array changes the performance
  98 distance attributes for the same associativity arrays, as the following
  99 example illustrates:
 100
 101 ::
 102
 103   * ibm,associativity-reference-points = {0x2}
 104
 105   * P1: associativity{MOD1, S1, C1, P1}
 106
 107   First NUMA level (0x2) => associativity[1] = S1
 108
 109   * P2: associativity{MOD1, S2, C2, P2}
 110
 111   First NUMA level (0x2) => associativity[1] = S2
 112
 113   P1 and P2 does not have a common performance boundary. Since this is a one level
 114   NUMA configuration, distance between them is one boundary above the first
 115   level, 20.
 116
 117
 118 In a hypothetical platform where all resources inside the same hardware module
 119 is considered to be on the same performance boundary:
 120
 121 ::
 122
 123   * ibm,associativity-reference-points = {0x1}
 124
 125   * P1: associativity{MOD1, S1, C1, P1}
 126
 127   First NUMA level (0x1) => associativity[0] = MOD0
 128
 129   * P2: associativity{MOD1, S2, C2, P2}
 130
 131   First NUMA level (0x1) => associativity[0] = MOD0
 132
 133   P1 and P2 belongs to the same first order boundary. The distance between then
 134   is 10.
 135
 136
 137 How the pseries Linux guest calculates NUMA distances
 138 =====================================================
 139
 140 Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is
 141 how the distances are expressed. The SLIT table provides the NUMA distance
 142 value between the relevant resources. LOPAPR does not provide a standard
 143 way to calculate it. We have the ibm,associativity for each resource, which
 144 provides a common-performance hierarchy,  and the ibm,associativity-reference-points
 145 array that tells which level of associativity is considered to be relevant
 146 or not.
 147
 148 The result is that each OS is free to implement and to interpret the distance
 149 as it sees fit. For the pseries Linux guest, each level of NUMA duplicates
 150 the distance of the previous level, and the maximum amount of levels is
 151 limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the
 152 kernel tree). This results in the following distances:
 153
 154 * both resources in the first NUMA level: 10
 155 * resources one NUMA level apart: 20
 156 * resources two NUMA levels apart: 40
 157 * resources three NUMA levels apart: 80
 158 * resources four NUMA levels apart: 160
 159
 160
 161 pseries NUMA mechanics
 162 ======================
 163
 164 Starting in QEMU 5.2, the pseries machine considers user input when setting NUMA
 165 topology of the guest. The overall design is:
 166
 167 * ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing
 168   for 4 distinct NUMA distance values based on the NUMA levels
 169
 170 * ibm,max-associativity-domains supports multiple associativity domains in all
 171   NUMA levels, granting user flexibility
 172
 173 * ibm,associativity for all resources varies with user input
 174
 175 These changes are only effective for pseries-5.2 and newer machines that are
 176 created with more than one NUMA node (disconsidering NUMA nodes created by
 177 the machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
 178 around for such a long time, with users seeing NUMA distances 10 and 40
 179 (and 80 if using NVLink2 GPUs), and there is no need to disrupt the
 180 existing experience of those guests.
 181
 182 To bring the user experience x86 users have when tuning up NUMA, we had
 183 to operate under the current pseries Linux kernel logic described in
 184 `How the pseries Linux guest calculates NUMA distances`_. The result
 185 is that we needed to translate NUMA distance user input to pseries
 186 Linux kernel input.
 187
 188 Translating user distance to kernel distance
 189 --------------------------------------------
 190
 191 User input for NUMA distance can vary from 10 to 254. We need to translate
 192 that to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
 193 This is how it is being done:
 194
 195 * user distance 11 to 30 will be interpreted as 20
 196 * user distance 31 to 60 will be interpreted as 40
 197 * user distance 61 to 120 will be interpreted as 80
 198 * user distance 121 and beyond will be interpreted as 160
 199 * user distance 10 stays 10
 200
 201 The reasoning behind this approximation is to avoid any round up to the local
 202 distance (10), keeping it exclusive to the 4th NUMA level (which is still
 203 exclusive to the node_id). All other ranges were chosen under the developer
 204 discretion of what would be (somewhat) sensible considering the user input.
 205 Any other strategy can be used here, but in the end the reality is that we'll
 206 have to accept that a large array of values will be translated to the same
 207 NUMA topology in the guest, e.g. this user input:
 208
 209 ::
 210
 211       0   1   2
 212   0  10  31 120
 213   1  31  10  30
 214   2 120  30  10
 215
 216 And this other user input:
 217
 218 ::
 219
 220       0   1   2
 221   0  10  60  61
 222   1  60  10  11
 223   2  61  11  10
 224
 225 Will both be translated to the same values internally:
 226
 227 ::
 228
 229       0   1   2
 230   0  10  40  80
 231   1  40  10  20
 232   2  80  20  10
 233
 234 Users are encouraged to use only the kernel values in the NUMA definition to
 235 avoid being taken by surprise with that the guest is actually seeing in the
 236 topology. There are enough potential surprises that are inherent to the
 237 associativity domain assignment process, discussed below.
 238
 239
 240 How associativity domains are assigned
 241 --------------------------------------
 242
 243 LOPAPR allows more than one associativity array (or 'string') per allocated
 244 resource. This would be used to represent that the resource has multiple
 245 connections with the board, and then the operational system, when deciding
 246 NUMA distancing, should consider the associativity information that provides
 247 the shortest distance.
 248
 249 The spapr implementation does not support multiple associativity arrays per
 250 resource, neither does the pseries Linux kernel. We'll have to represent the
 251 NUMA topology using one associativity per resource, which means that choices
 252 and compromises are going to be made.
 253
 254 Consider the following NUMA topology entered by user input:
 255
 256 ::
 257
 258       0   1   2   3
 259   0  10  40  20  40
 260   1  40  10  80  40
 261   2  20  80  10  20
 262   3  40  40  20  10
 263
 264 All the associativity arrays are initialized with NUMA id in all associativity
 265 domains:
 266
 267 * node 0: 0 0 0 0
 268 * node 1: 1 1 1 1
 269 * node 2: 2 2 2 2
 270 * node 3: 3 3 3 3
 271
 272
 273 Honoring just the relative distances of node 0 to every other node, we find the
 274 NUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for
 275 each distance:
 276
 277 * distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match
 278   at 0x2)
 279 * distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
 280 * distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
 281   at 0x2)
 282
 283 We'll copy the associativity domains of node 0 to all other nodes, based on
 284 the NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy
 285 the domains 0x2 and 0x1 from 0 to 1 as well. This will give us:
 286
 287 * node 0: 0 0 0 0
 288 * node 1: 0 0 1 1
 289
 290 Doing the same to node 2 and node 3, these are the associativity arrays
 291 after considering all matches with node 0:
 292
 293 * node 0: 0 0 0 0
 294 * node 1: 0 0 1 1
 295 * node 2: 0 0 0 2
 296 * node 3: 0 0 3 3
 297
 298 The distances related to node 0 are accounted for. For node 1, and keeping
 299 in mind that we don't need to revisit node 0 again, the distance from
 300 node 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40,
 301 match in 0x2. Repeating the same logic of copying all domains up to
 302 the NUMA level match:
 303
 304 * node 0: 0 0 0 0
 305 * node 1: 1 0 1 1
 306 * node 2: 1 0 0 2
 307 * node 3: 1 0 3 3
 308
 309 In the last step we will analyze just nodes 2 and 3. The desired distance
 310 between 2 and 3 is 20, i.e. a match in 0x3:
 311
 312 * node 0: 0 0 0 0
 313 * node 1: 1 0 1 1
 314 * node 2: 1 0 0 2
 315 * node 3: 1 0 0 3
 316
 317
 318 The kernel will read these arrays and will calculate the following NUMA topology for
 319 the guest:
 320
 321 ::
 322
 323       0   1   2   3
 324   0  10  40  20  20
 325   1  40  10  40  40
 326   2  20  40  10  20
 327   3  20  40  20  10
 328
 329 Note that this is not what the user wanted - the desired distance between
 330 0 and 3 is 40, we calculated it as 20. This is what the current logic and
 331 implementation constraints of the kernel and QEMU will provide inside the
 332 LOPAPR specification.
 333
 334 Users are welcome to use this knowledge and experiment with the input to get
 335 the NUMA topology they want, or as closer as they want. The important thing
 336 is to keep expectations up to par with what we are capable of provide at this
 337 moment: an approximation.
 338
 339 Limitations of the implementation
 340 ---------------------------------
 341
 342 As mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate
 343 user choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways
 344 to fully map user input to actual NUMA distance the guest will use. These limitations
 345 creates two notable limitations in our support:
 346
 347 * Asymmetrical topologies aren't supported. We only support NUMA topologies where
 348   the distance from node A to B is always the same as B to A. We do not support
 349   any A-B pair where the distance back and forth is asymmetric. For example, the
 350   following topology isn't supported and the pSeries guest will not boot with this
 351   user input:
 352
 353 ::
 354
 355       0   1
 356   0  10  40
 357   1  20  10
 358
 359
 360 * 'non-transitive' topologies will be poorly translated to the guest. This is the
 361   kind of topology where the distance from a node A to B is X, B to C is X, but
 362   the distance A to C is not X. E.g.:
 363
 364 ::
 365
 366       0   1   2   3
 367   0  10  20  20  40
 368   1  20  10  80  40
 369   2  20  80  10  20
 370   3  40  40  20  10
 371
 372   In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40.
 373   The kernel will always match with the shortest associativity domain possible,
 374   and we're attempting to retain the previous established relations between the
 375   nodes. This means that a distance equal to 20 between nodes 0 and 2 and the
 376   same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3
 377   to also be 20.
 378
 379
 380 Legacy (5.1 and older) pseries NUMA mechanics
 381 =============================================
 382
 383 In short, we can summarize the NUMA distances seem in pseries Linux guests, using
 384 QEMU up to 5.1, as follows:
 385
 386 * local distance, i.e. the distance of the resource to its own NUMA node: 10
 387 * if it's a NVLink GPU device, distance: 80
 388 * every other resource, distance: 40
 389
 390 The way the pseries Linux guest calculates NUMA distances has a direct effect
 391 on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
 392 the default ibm,associativity-reference-points being used in the pseries
 393 machine:
 394
 395 ibm,associativity-reference-points = {0x4, 0x4, 0x2}
 396
 397 The first and second level are equal, 0x4, and a third one was added in
 398 commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
 399 regardless of how the ibm,associativity properties are being created in
 400 the device tree, the pseries Linux guest will only recognize three scenarios
 401 as far as NUMA distance goes:
 402
 403 * if the resources belongs to the same first NUMA level = 10
 404 * second level is skipped since it's equal to the first
 405 * all resources that aren't a NVLink GPU, it is guaranteed that they will belong
 406   to the same third NUMA level, having distance = 40
 407 * for NVLink GPUs, distance = 80 from everything else
 408
 409 This also means that user input in QEMU command line does not change the
 410 NUMA distancing inside the guest for the pseries machine.