TODO

   1 iSCSI DEVELOPMENT HOWTO AND TODO
   2 --------------------------------
   3 July 7th 2011
   4
   5
   6 If you are admin or user and just want to send a fix, just send the fix any
   7 way you can. We can port the patch to the proper tree and fix up the patch
   8 for you. Engineers that would like to do more advanced development then the
   9 following guideline should be followed.
  10
  11 Submitting Patches
  12 ------------------
  13 Code should follow the Linux kernel codying style doc:
  14 http://www.kernel.org/doc/Documentation/CodingStyle
  15
  16 Patches should be submitted to the open-iscsi list open-iscsi@googlegroups.com.
  17 They should be made with "git diff" or "diff -up" or "diff -uprN", and
  18 kernel patches must have a "Signed-off-by" line. See section 12
  19 http://www.kernel.org/doc/Documentation/SubmittingPatches for more
  20 information on the the signed off line.
  21
  22 Getting the Code
  23 ----------------
  24 Kernel patches should be made against the linux-2.6-iscsi tree. This can
  25 be downloaded from kernel.org with git with the following commands:
  26
  27 git clone git://git.kernel.org/pub/scm/linux/kernel/git/mnc/linux-2.6-iscsi.git
  28
  29 Userspace patches should be made against the open-iscsi git tree:
  30
  31 git clone git://git.kernel.org/pub/scm/linux/kernel/git/mnc/open-iscsi.git
  32
  33
  34
  35 KERNEL TODO ITEMS
  36 -----------------
  37
  38 1. Make iSCSI log messages humanly readable. In many cases the iscsi tools
  39 and modules will log a error number value. The most well known is conn
  40 error 1011. Users should not have to search on google for what this means.
  41
  42 We should:
  43
  44 1. Write a simple table to convert the error values to a string and print
  45 them out.
  46
  47 2. Document the values, how you commonly hit them and common solutions
  48 in the iSCSI docs.
  49
  50 See scsi_transport_iscsi.c:iscsi_conn_error_event for where the evil
  51 "detected conn error 1011" is printed. See the enum iscsi_err in iscsi_if.h
  52 for a definition of the error code values.
  53
  54 ---------------------------------------------------------------------------
  55
  56 2. Implement iSCSI dev loss support.
  57
  58 Currently if a session is down for longer than replacement/recovery_timeout
  59 seconds, the iscsi layer will unblock the devices and fail IO. Other
  60 transport, like FC and SAS, will do something similar. FC has a
  61 fast_io_fail tmo which will unblock devices and fail IO, then it has a
  62 dev_loss_tmo which will delete the devices accessed through that port.
  63
  64 iSCSI needs to implement dev_loss_tmo behavior, because apps are beginning
  65 to expect this behavior. An initial path was made here:
  66 http://groups.google.com/group/open-iscsi/msg/031510ab4cecccfd?dmode=source
  67
  68 Since all drivers want this behavior we want to make it common. We need to
  69 change the patch in that link to add a dev_loss_tmo handler callback to the
  70 scsi_transport_template struct, and add some common sysfs and helpers
  71 functions to manage the dev_loss_tmo variable.
  72
  73
  74 ---------------------------------------------------------------------------
  75
  76 3. Reduce locking contention between session lock.
  77
  78 The session lock is basically one big lock that protects everything
  79 in the iscsi_session. This lock could be broken down into smaller locks
  80 and maybe even replaced with something that would not require a lock.
  81
  82 For example:
  83
  84 1. The session lock serializes access to the current R2T the initiator is
  85 handling (a R2T from the target or the initialR2T if being used). libiscsi/
  86 libiscsi_tcp will call iscsi_tcp_get_curr_r2t and grab the session lock in
  87 the xmit path from the xmit thread and then in the recv path
  88 libiscsi_tcp/iscsi_tcp will call iscsi_tcp_r2t_rsp (this function is called
  89 with the session lock held). We could add a new per iscsi_task lock and
  90 use that to gaurd the R2T.
  91
  92 2. For iscsi_tcp and cxgb*i, libiscsi uses the session->cmdqueue linked list
  93 and the session lock to queue IO from the queuecommand function (run from
  94 scsi softirq or kblockd context) to the iscsi xmit thread. Once the task is
  95 sent from that thread, it is deleted from the list.
  96
  97 It seems we should be able to remove the linked list use here. The tasks
  98 are all preallocated in the session->cmds array. We can access that
  99 array and check the task->state (see fail_scsi_tasks for an example).
 100 We just need to come up with a way to safely set the task state,
 101 wake the xmit thread and make sure that tasks are executed in the order
 102 that the scsi layer sent them to our queuecommand function.
 103
 104 A starting point on the queueing:
 105 We might be able to create a workqueue per processor, queue the work,
 106 which in this case is the execution of the task, from the queuecommand,
 107 then rely on the work queue synchronization and serialization code.
 108 Not 100% sure about this.
 109
 110 Alternative to changing the threading:
 111 Can we figure out a way to just remove the xmit thread? We currently
 112 cannot because the network may only be able to send 1000 bytes, but
 113 to send the current command we need to send 2000. We cannot sleep
 114 from the queuecommand context until another 1000 bytes frees up and for
 115 iscsi_tcp we cannot sleep from the recv conext (this happens because we
 116 could have got a R2T from target and are handling it from the recv path).
 117
 118
 119 Note: that for iser and offload drivers like bnx2i and be2iscsi their
 120 is no xmit thread used.
 121
 122 Note2: cxgb*i does not actually need the xmit thread so a side project
 123 could be to convert that driver.
 124
 125
 126 ---------------------------------------------------------------------------
 127
 128 4. Make memory access more efficient on multi-processor machines.
 129 We are moving twords per process queues in the block layer, so it would
 130 be a good idea to move the iscsi structs to be allocated on a per process
 131 basis.
 132
 133 ---------------------------------------------------------------------------
 134
 135 5. Make blk_iopoll support (see block/blk-iopoll.c and be2iscsi for an
 136 example) being able to round robin IO across processors or complete
 137 on the processor it was queued on
 138 (today it always completes the IO on the processor the softirq was raised on),
 139 and convert bnx2i, ib_iser and cxgb*i to it.
 140
 141 Not sure if it will help iscsi_tcp and cxgb, because their completion is done
 142 from the network softirq which does polling already. With irq balancing it
 143 can also be spread over all processors too.
 144
 145 ---------------------------------------------------------------------------
 146
 147 6. Replace iscsi_get_next_target_id with idr use.
 148
 149 iscsi_tcp and ib_iser allocate a session per host, so the target_id is
 150 always just 0. The offload drivers allocate a host per pci resource, so they
 151 will have multiple sessions for each host. When a session is added,
 152 iscsi_add_session will try to find a target_id to use by looping over
 153 all the targets on the host. We could replace that loop with idr.
 154
 155
 156 * Being worked on by John Jose.
 157
 158 ---------------------------------------------------------------------------
 159
 160 7. When userspace calls into the kernel using the iscsi netlink interface
 161 to execute oprations like creating/destroying a session, create a connection
 162 to a target, etc the rx_queue_mutex is held the entire time (see
 163 iscsi_if_rx for the iscsi netlink interface entry point). This means
 164 if the driver should block every thing will be held up.
 165
 166 iscsi_tcp does not block, but some offload drivers might for a couple seconds
 167 to 10 or 15 secs while it figures out what is going on or cleans up. This a
 168 major problem for things like multipath where one connection blocking up the
 169 recovery of every other connection will delay IO from re-flowing quickly.
 170
 171 We should looking into breaking up the rx_queue_mutex into finer grained
 172 locks or making it multi threaded. For the latter we could queue operations
 173 into workqueues.
 174
 175 ---------------------------------------------------------------------------
 176
 177 7. Add tracing support to iscsi modules. See the scsi layer's
 178 trace_scsi_dispatch_cmd_start for an example.
 179
 180 Well, actually in general look into all the tracing stuff available
 181 (trace_printk/ftrace, etc) and use one.
 182
 183 See http://lwn.net/Articles/291091/ for some details on what is out
 184 there. We can only use something that is upstream though.
 185
 186 ---------------------------------------------------------------------------
 187
 188 8. Improve the iscsi driver logging. Each driver has a different
 189 way to control logging. We should unify them and make it managable
 190 by iscsiadm. So each driver would use a common format, there would
 191 be a common kernel interface to set the logging level, etc.
 192
 193 ---------------------------------------------------------------------------
 194
 195 9. Implement more features from the iSCSI RFC if they are worth it.
 196
 197 - Error Recovery Level (ERL) 1 support - will help tape support.
 198 - Multi R2T support - Might improve write performance.
 199 - OutOfOrder support - Might imrpove performance.
 200
 201 ---------------------------------------------------------------------------
 202
 203 10. Add support for digest/CRC offload.
 204
 205 ---------------------------------------------------------------------------
 206
 207 11. Finish intel IOAT support. I started this here:
 208 http://groups.google.com/group/open-iscsi/msg/2626b8606edbe690?dmode=source
 209 but could only test on boxes with 1 gig interfaces which showed no
 210 difference in performance. Intel had said they saw significant throughput
 211 gains when using 10 gig.
 212
 213 ---------------------------------------------------------------------------
 214
 215 12. Remove the login buffer preallocated buffer. Storage drivers must be able
 216 to make forward process, so that they can always write out a page incase the
 217 kernel needs to allocate the page to another process. If the connection were
 218 to be disconnected and the initiator needed to relogin to the target at this
 219 time, we might not be abe to allocate a page for the login commands buffer.
 220
 221 To work around the problem the initiator prealloctes a 8K (sometimes
 222 more depending on the page size) buffer for each session (see iscsi_conn_setup'
 223 s __get_free_pages call). This is obviously very wasteful since it will be
 224 a rate occurance. Can we think of a way to allow multiple sessions to
 225 be relogged in at the same time, but not have to preallocate so many
 226 buffers?
 227
 228 ---------------------------------------------------------------------------
 229
 230 13. Support iSCSI over swap safely.
 231
 232 Basically just need to hook iscsi_tcp into the patches that
 233 were submitted here for NBD.
 234
 235 https://lwn.net/Articles/446831/
 236
 237
 238 ---------------------------------------------------------------------------
 239
 240
 241
 242
 243
 244 USERSPACE TODO ITEMS
 245 --------------------
 246 1. The iscsi tools, iscsid, iscsiadm and iscsid, have a debug flag, -d N, that
 247 allows the user to control the amount of output that is logged. The argument
 248 N is a integer from 1 to 8, with 8 printing out the most output.
 249
 250 The problem is that the values from 1 to 8 do not really mean much. It would
 251 helpful if we could replace them with something that controls what exactly
 252 the iscsi tools and kernel modules log.
 253
 254 For example, if we made the debug level argument a bitmap then
 255
 256 iscsiadm -m node --login -d LOGIN_ERRS,PDUS,FUNCTION
 257
 258 might print out extended iscsi login error information (LOGIN_ERRS),
 259 the iSCSI packets that were sent/receieved (PDUS), and the functions
 260 that were run (FUNCTION). Note, the use of a bitmapp and the debug
 261 levels are just an example. Feel free to do something else.
 262
 263
 264 We would want to be able to have iscsiadm control the iscsi kernel
 265 logging as well. There are interfaces like
 266 /sys/module/libiscsi/paramters/*debug*
 267 /sys/module/libiscsi_tcp/paramters/*debug*
 268 /sys/module/iscsi_tcp/paramters/*debug*
 269 /sys/module/scsi_transport_iscsi/paramters/*debug*
 270
 271 but we would want to extend the debugging options to be finer grained
 272 and we would want to make it supportable by all iscsi drivers.
 273 (see #8 on the kernel todo).
 274
 275
 276 ---------------------------------------------------------------------------
 277
 278 2. "iscsiadm -m session -P 3" can print out a lot of information about the
 279 session, but not all configuration values are printed.
 280
 281 iscsiadm should be modified to print out other settings like timeouts,
 282 Chap settings,  the iSCSI values that were requested vs negotiated for, etc.
 283
 284 ---------------------------------------------------------------------------
 285
 286 3. iscsiadm cannot update a setting of a running session. If you want
 287 to change a timeout you have to run the iscsiadm logout command,
 288 then update the record value, then login:
 289
 290 iscsiadm -m node -T target -p ip -u
 291 iscsidm -m node -T target -p ip -o update -n node.session.timeo.replacement_timeout -v 30
 292 iscsiadm -m node -T target -p ip -l
 293
 294 iscsiadm should be modified to allow updating of a setting without having
 295 to run the iscsiadm command.
 296
 297 Note that for some settings like iSCSI ones (ImmediateData, FirstBurstLength,
 298 etc)  that must be negotiated with the target we will have to logout the
 299 target then re-login, but we should not have to completely destroy the session
 300 and scsi devices like is done when running the iscsiadm logout command. We
 301 should be able to pass iscsid the new values and then have iscsid logout and
 302 relogin.
 303
 304 Other settings like the abort timeout will not need a logout/login. We can
 305 just pass those to the kernel or iscsid to use.
 306
 307 ---------------------------------------------------------------------------
 308
 309 4. iscsiadm will attempt to perform logins/logouts in parallel. Running
 310 iscsiadm -m node -L, will cause iscsiadm to login to all portals with
 311 the startup=automatic field set at the same time.
 312
 313 To log into a target, iscsiadm opens a socket to iscsid, sends iscsid a
 314 request to login to a target, iscsid performs the iSCSI login operation,
 315 then iscsid sends iscsiadm a reply.
 316
 317 To perform multiple logins iscsiadm will open a socket for each login
 318 request, then wait for a reply. This is a problem because for 1000s of targets
 319 we will have 1000s of sockets open. There is a rlimit to control how many
 320 files a process can have open and iscsiadm currently runs setrlimit to
 321 increase this.
 322
 323 With users creating lots of virtual iscsi interfaces on the target and
 324 initiator with each having multiple paths it beomes inefficient to open
 325 a socket for each requests.
 326
 327 At the very least we want to handle setrlimit RLIMIT_NOFILE limit better,
 328 and it would be best to just stop openening a socket per login request.
 329
 330 ---------------------------------------------------------------------------
 331
 332 5. Make iSCSI log messages humanly readable. In many cases the iscsi tools
 333 will log a error number value. The most well known is conn error 1011.
 334 Users should not have to search on google for what this means.
 335
 336 We should:
 337
 338 1. Write a simple table to convert the error values to a string and print
 339 them out.
 340
 341 2. Document the values, how you commonly hit them and common solutions
 342 in the iSCSI docs.
 343
 344
 345 See session_conn_error and __check_iscsi_status_class as a start.
 346
 347 ---------------------------------------------------------------------------
 348
 349 6. Implement broadcast/multicasts support, so the initiator can
 350 find iSNS servers without the user having to set the iSNS server address.
 351
 352 See
 353 5.6.5.14. Name Service Heartbeat (Heartbeat)
 354 in
 355 http://tools.ietf.org/html//rfc4171
 356
 357 ---------------------------------------------------------------------------
 358
 359 7. Open-iscsi uses the open-isns iSNS library. The library might be a little
 360 too complicated and a little too heavy for what we need. Investigate
 361 replacing it.
 362
 363 Also explore merging the open-isns and linux-isns projects, so we do not have
 364 to support multiple isns clients/servers in linux.
 365
 366 ---------------------------------------------------------------------------
 367
 368 8. Implement the DHCP iSNS option support, so we the initiator can
 369 find the iSNS sever without the user having to set the iSNS server address.
 370 See:
 371 http://www.ietf.org/rfc/rfc4174.txt
 372
 373 ---------------------------------------------------------------------------
 374
 375 9. Some iscsiadm/iscsid operations that access the iscsi DB and sysfs can be
 376 up to Big O(N^2). Some of the code was written when we thought 64 sessions
 377 would be a lot and the norm would be 4 or 8. Due to virtualization, cloud use,
 378 and targets like equallogic that do a target per logical unit (device) we can
 379 see 1000s of sessions.
 380
 381 - We should look into making the record DB more efficient. Maybe
 382 time to use a real DB (something small simple and efficient since this
 383 needs to run in places like the initramfs).
 384
 385 - Rewrite code to look up a running session so we do not have loop
 386 over every session in sysfs.
 387
 388
 389 ---------------------------------------------------------------------------
 390
 391 10. Look into using udev's libudev for our sysfs access in iscsiadm/iscsid/
 392 iscsistart.
 393
 394 ---------------------------------------------------------------------------
 395
 396 11. iSCSI lib.
 397
 398 I am working on this one. Hopefully it should be done soon.
 399
 400 ---------------------------------------------------------------------------