lib/libc/sys/syslink.2

   1 .\" Copyright (c) 2007 The DragonFly Project.  All rights reserved.
   2 .\"
   3 .\" This code is derived from software contributed to The DragonFly Project
   4 .\" by Matthew Dillon <dillon@backplane.com>
   5 .\"
   6 .\" Redistribution and use in source and binary forms, with or without
   7 .\" modification, are permitted provided that the following conditions
   8 .\" are met:
   9 .\"
  10 .\" 1. Redistributions of source code must retain the above copyright
  11 .\"    notice, this list of conditions and the following disclaimer.
  12 .\" 2. Redistributions in binary form must reproduce the above copyright
  13 .\"    notice, this list of conditions and the following disclaimer in
  14 .\"    the documentation and/or other materials provided with the
  15 .\"    distribution.
  16 .\" 3. Neither the name of The DragonFly Project nor the names of its
  17 .\"    contributors may be used to endorse or promote products derived
  18 .\"    from this software without specific, prior written permission.
  19 .\"
  20 .\" THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
  21 .\" ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  22 .\" LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  23 .\" FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE
  24 .\" COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  25 .\" INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
  26 .\" BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  27 .\" LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
  28 .\" AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  29 .\" OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
  30 .\" OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  31 .\" SUCH DAMAGE.
  32 .\"
  33 .\" $DragonFly: src/lib/libc/sys/syslink.2,v 1.4 2007/04/03 03:09:28 dillon Exp $
  34 .\"
  35 .Dd March 13, 2007
  36 .Dt SYSLINK 2
  37 .Os
  38 .Sh NAME
  39 .Nm syslink
  40 .Nd low level connect to the cluster mesh
  41 .Sh LIBRARY
  42 .Lb libc
  43 .Sh SYNOPSIS
  44 .In sys/syslink.h
  45 .Ft int
  46 .Fn syslink "int fd" "int flags" "sysid_t routenode"
  47 .Sh DESCRIPTION
  48 The
  49 .Fn syslink
  50 function establishes a link to a kernel-implemented syslink route node
  51 as specified by
  52 .Fa routenode .
  53 If a file descriptor of -1 is specified, a file descriptor representing
  54 a direct connection to the specified route node will be allocated and
  55 returned.
  56 If a file descriptor is specified, it will be connected to the specified
  57 route node via full-duplex communication and kernel threads will be
  58 created to shuttle data between the descriptor and the route node.  The
  59 kernel may optimize and shortcut this operation.
  60 .Pp
  61 It is also perfectly legal to allocate two route nodes and then connect them
  62 together by passing the file descriptor returned by the first
  63 .Fn syslink
  64 call to the second
  65 .Fn syslink
  66 call.  It is legal (and usually necessary) to obtain multiple descriptors to
  67 the same kernel-managed syslink route node.
  68 .Pp
  69 The syslink protocol revolves around 64 bit system ids using the
  70 .Ft sysid_t
  71 type.  A sysid can represent one of three entities:  A session identifier,
  72 a logical identifier, or a physical identifier.
  73 Session ids are synthesized by machine nodes and used to
  74 uniquely identify a communications session between two entities in a way
  75 that prevents any possible duplication or confusion in the face of a
  76 constantly changing mesh, migration of logical elements, and other activities.
  77 Logical ids are persistent entities which uniquely identify resources.
  78 Examples of resources include filesystems, hard drive partitions, devices,
  79 VM spaces, memory, cpus, and so forth.  The logical id migrates with the
  80 resource, meaning that you can physically move a hard drive from one part
  81 of the mesh to another and the mesh will automatically figure out the
  82 new location.  New logical identifiers are also typically synthesized
  83 entities.  Physical ids are used to route messages across the mesh and
  84 may be multi-homed.
  85 .Pp
  86 For example, a particular filesystem mount will have a persistent logical
  87 sysid, a separate session id for every entity connecting to it, and one or
  88 more dynamic (changable) physical sysids depending on the mesh topology.
  89 .Pp
  90 The Syslink protocol is used to glue the cluster mesh together.  It is
  91 based on the concept of (mostly) reliable packets and buffered streams.
  92 Adding a new node to the mesh is as simple as obtaining a stream connection
  93 to any node already in the mesh, or tying into a packet switch which
  94 is part of the mesh using UDP.
  95 .Sh SYSLINK PROTOCOL - PHYSICAL SYSIDS
  96 Physical sysids are used to route messages across the mesh.  A physical
  97 sysid represents a relative route from source to target.  Each hop in
  98 the mesh gobbles up however many bits it needs from the low bits in the
  99 sysid and then shifts the sysid rightward by that many bits to set it up
 100 for the next hop.  For example, if a route node supporting 256 links
 101 receives a message, it would pull 8 bits off of the destination sysid
 102 and then shift the destination sysid right by 8.  0 bits are always shifted
 103 into bit 63 (an unsigned shift) in order to prevent broadcasts from looping
 104 through the cluster forever.  At the same time, each hop builds up the
 105 originating physical address field as the message passes through it.
 106 A link address of all 0's always addresses the node representing the hop
 107 and termintes the message.  A link address of all 1's always represents
 108 a broadcast.  A message addressed to a physical sysid of 0 thus always
 109 targets the immediate route node and a message addressed to a physical
 110 sysid of -1 is always broadcast to the entire cluster.  The number of hops
 111 is limited by the 64 sysid bits.  A message that does not have a sufficient
 112 number of bits effectively terminates at a route node by virtue of the
 113 target address becoming 0.  The routing path is arbitrarily controlled
 114 by the physical sysid and can include loops or alternative paths.
 115 .Pp
 116 Certain information is always broadcast across the mesh.  Broadcasts allow
 117 individual nodes in the mesh to cache the source physical address
 118 of the originator (which again represents a relative path).  Two types of
 119 nodes in particular do regular broadcasts.  Seed nodes are responsible
 120 for managing the session and logical sysid spaces and broadcast at least
 121 once every 10 seconds so other nodes can get routes to them.  Registration
 122 nodes are responsible for keeping track of resources via their logical
 123 sysids and facilitating the establishment of direct communication paths
 124 between originator and target.
 125 .Pp
 126 Broadcasts require special treatment by route nodes to prevent excessive
 127 duplication due to loops in the mesh.  Each route node holds a cache of
 128 the last 16 broadcasts.  If the cache is full a route node will not forward
 129 any new broadcasts.  Cache entries time out after 10 seconds.  The size of
 130 the cache and timeout period is adjustable and is distributed by seed nodes
 131 in their regular broadcasts.  In addition, switch nodes do not retransmit
 132 a broadcast over the same link it came in on.
 133 .Sh SYSLINK PROTOCOL - SESSION SYSIDS
 134 Session sysids are used to uniquely identify a communications link between
 135 two entities in the mesh.  Session sysids are synthesized by the end
 136 points for a particular communication.  The route node immediately adjacent
 137 to an end point typically tracks sessions, handles timeouts, and synthesizes
 138 negative responses to ease the coding required on the leaf.
 139 .Pp
 140 Session sysids are 'almost' forever unique, meaning that they are unique
 141 within a period of around 500 years.  A communications session can survive
 142 migration and topological changes, even if the route node changes.  Changes
 143 in topology are detected by the protocol and cause the session to be
 144 retrained.
 145 .Pp
 146 Establishment of a new session or retraining an existing session is usually
 147 based on the logical sysid for the two entities involved.  That is, sessions
 148 are created between entities defined by a logical sysid for each entity.
 149 The logical sysid is the ultimate rendezvous, the session sysid identifies
 150 a session and transaction, the physical sysid routes the message.
 151 .Sh SYSLINK PROTOCOL - LOGICAL SYSIDS
 152 Logical sysids are 'almost' forever unique, persistent entities which
 153 represent the ultimate rendezvous identifier within a cluster.  All
 154 resources on a system are given fully domained names.  For example,
 155 a disk label might be named 'MYDISK01@FUBAR.COM'.  When the system is
 156 associated with a cluster, each named resource will be assigned a permanent
 157 64 bit logical sysid allocated from that cluster.  This sysid must be
 158 permanently associated with the resource, either via a persistent file or
 159 in the resource itself (for example, as part of the disklabel).
 160 .Pp
 161 Resources can be broken up into smaller pieces and those pieces can
 162 also be assigned logical sysids or even have their own completely independant
 163 names.  For example, an ANVIL disk partition can have its own logical
 164 sysid and name independant of the one assigned to the label.  In many
 165 cases, the governing name you use to integrate resources into your cluster
 166 will be these smaller chunks.
 167 .Pp
 168 Systems connected to a cluster register their resource names and logical
 169 sysids with a registration node within the cluster (registration nodes
 170 broadcast their availability so finding one is always very easy).  The
 171 system linking in the resource will allocate the logical sysid if one was
 172 not previously assigned to the resource.  These registrations allow the
 173 cluster to make ends meet.
 174 .Sh SYSLINK PROTOCOL - SYNTHESIS OF LOGICAL AND SESSION SYSIDS
 175 Session ID prefixes are allocated from seed nodes.  Any given cluster will
 176 have one or more seed nodes in the mesh which periodically broadcast to
 177 gives nodes a routable path to them.  Any seed node can dole out a
 178 session id.  The allocation remains valid for a set period of time, usually
 179 an hour, and entities can synthesize full session IDs from a combination
 180 of the prefix, iterator, and universal timestamp.
 181 .Pp
 182 Allocations are not typically tracked beyond the one hour period and the
 183 actual code performing the allocation can simply use a two-handed
 184 clock algorithm with a fixed number of slots representing session sysid
 185 prefix ranges.
 186 .Pp
 187 Logical sysid prefixes use the same prefix obtained when allocating a session
 188 ID.  Logical and session sysids are considered to be in separate namespaces.
 189 .Pp
 190 Prefixes are typically on the order of 20 bits, fewer or greater depending
 191 on how many entities you want to be able to interconnect within the cluster.
 192 When multiple seed nodes are used in a cluster, the top few bits identify the
 193 seed node (seed nodes do not communicate with each other and must dole out
 194 separate numereical prefix ranges).
 195 The low 44 bits are a combination of a sequence number and a universal
 196 timestamp.
 197 Timestamps operate with a 1 minute granularity and must not roll over
 198 for at least 500 years, requiring 28 bits of storage.
 199 The remaining 16 or so bits are used as an iterator.
 200 If the iterator overflows the allocating entity must wait for the next
 201 minute boundary before it can allocate more ids.
 202 .Pp
 203 Sessions connect consumers to fairly granular resources.  For example,
 204 a filesystem rather then a file.  These session links can be cached.  A
 205 new session or logical id is not created every time you fork or issue an
 206 open() so the limited size of the iterator should not create any real
 207 limitations to system scale or performance.  A session can kinda be thought
 208 of as a serialized link over which transactions can occur.  While the
 209 rate of new session and logical id creation may be limited, the actual
 210 number you can have operationally (each with a 500 year guarenteed
 211 uniqueness) is virtually unlimited.  It is also possible to simply allocate
 212 more then one prefix to handle certain burst issues, such as machine booting,
 213 if the limitation to the iterator would otherwise cause allocation delays.
 214 .Pp
 215 A new session id prefix must be allocated prior to the original one expiring.
 216 An expired session id prefix cannot be reused for a period of time, usually
 217 the same period of time as the expiration timer, in order to ensure that
 218 no session or logical id overlaps occur.
 219 Once you have a session prefix in hand you can allocate session and logical
 220 ids by combining your prefix with your sequence index and global timestamp
 221 to create session and logical ids that are good for 500 years.
 222 .Pp
 223 .Sh SYSLINK PROTOCOL - REGISTRATION OF LOGICAL IDS
 224 A logical sysid represents a particular resource and must be registered
 225 with a registration entity along with the fully qualified name for that
 226 resource.  The physical addresses for registration entities
 227 are distributed via mesh broadcasts.  A resource may be registered with any
 228 of the available registration entities.
 229 .Pp
 230 Because logical ids can migrate, e.g. by unplugging a device from one
 231 location and physically transporting it to a different location in the
 232 cluster, the logical id alone cannot be used to route messages.
 233 Session ids also cannot be used to route messages.
 234 A logical to physical translation is required and the
 235 session id then serves as a verifier and serialization/timeout/retry entity
 236 for the message transactions.  The translation is typically accomplished
 237 by the route node directly adjacent to the resource.
 238 .Sh SYSLINK PROTOCOL - MESSAGE ROUTING
 239 Messages are based on transactions and transactions revolve around
 240 session sysids.  Sessions are established between logical IDs and the
 241 session->logical_id translations are cached by the route nodes immediately
 242 adjacent to the source and target entities rather then stored in the
 243 message structure.  Only physical addresses are stored in the message
 244 structure itself.  If these route nodes do not recognize a session id
 245 they return a RETRAIN response to the source or target as needed to obtain
 246 the information.  The route nodes are responsible for translating the
 247 logical ids to physical ids to route the message.  The originating and
 248 terminal entities usually do not do these translations and program the
 249 physical addresses as 0 (to talk directly to the nearest route node), and
 250 the route node then reprograms the fields with the correct physical
 251 addresses.  Originating and terminal entities can bypass route node
 252 translation by programming non-zero address into the physical address fields
 253 of the message.
 254 .Pp
 255 Logical address translation is typically accomplished by sending a
 256 translation request to any of the logical registration nodes and then
 257 caching the response.  The registration node will gain knowledge about
 258 the route from the originator to the registration node, from the registration
 259 node back to the originator, from the registration node to the target, and
 260 the target back to the registration node.  Additional work is required
 261 to convert these addresses into a physical sysid that can be used by the
 262 originator to talk directly to the target.
 263 .Pp
 264 This may seem complex but it all comes down to a very simple messaging
 265 format and protocol.  The retraining protocol also serves to validate
 266 communications links between entities and to allow massive changes in
 267 mesh topology to occur without disrupting the cluster.  For example, if
 268 the physical sysid of a node changes it will set off a chain of events
 269 at the route nodes due to the now-mismatched physical sysid and session
 270 sysid.  A message winds up being routed to the wrong target which detects
 271 the misrouting due to the unknown session id.  The error feeds back to
 272 the route node which can then clear its physical sysid cache and relookup
 273 the route.
 274 .Pp
 275 Syslink messages are transactional in nature and it is possible for a single
 276 transaction to be made up of multiple messages... for example, to break down
 277 a large buffer into smaller pieces for the purposes of transmission over the
 278 mesh.  The syslink protocol imposes fairly severe limitations on transactional
 279 messages and sizes... syslink messages are not meant to abstract very large
 280 multi-megabyte I/O operations but instead are meant to provide a reliable
 281 communications abstraction for smaller messages and buffers.
 282 A transaction may contain no more than 32 individual messages, allowing
 283 the route node to use a simple bitmap to track messages which may arrive
 284 out of order.
 285 Any given session may only have one transaction pending at a time... parallel
 286 transactions are implemented by creating multiple sessions between the same
 287 two entities.
 288 .Pp
 289 The messages making up a transaction can arrive out of order and will be
 290 collected by the target until all messages are present.  The originator
 291 must hold onto all messages it sends (so it can re-send if requested by
 292 the route node), until it has the complete response.
 293 The route node for a target is responsible for weeding out duplicate messages,
 294 monitoring transactions, and handling timeouts (returning a retry, retrain,
 295 or failure indication to the leaf).
 296 Route nodes are not responsible for retaining messages for incomplete
 297 transactions.  For example, a route node may indicate that a retransmission
 298 is needed but is not responsible for doing the actual retransmission.
 299 It is the leaf nodes that must collect the messages and do the actual
 300 retransmission and other related operations.
 301 The route nodes only track the transaction.
 302 .Pp
 303 Physical addresses can become invalid as the topology changes.  This does
 304 not invalidate a transaction but may cause a retrain to occur.
 305 .Pp
 306 Message transactions are uniquely identified by the (sessionid, msgid) fields
 307 in the syslink message.  Bits in the msgid field identify whether a request
 308 is being sent from the originator or target (determined by who initiated the
 309 original 'connection'), and whether the message is a command message or a
 310 reply message.
 311 Either side can initiate a transaction over an established session, which
 312 means that there may be a transaction going in both directions at the same
 313 time, each with request and reply messages.  Transactions initiated by
 314 the target are usually used for event and blocking/unblocking notifications.
 315 .Pp
 316 The SYSLINK protocol is not intended to take the place of a reliable link
 317 level protocol such as TCP and mesh links should only use UDP when packet
 318 delivery can be virtually guarenteed (such as when operating over switched
 319 ethernet).  UDP-based syslinks may still buffer multiple messages within
 320 the limitations of the UDP packet.
 321 .Pp
 322 The SYSLINK protocol is not intended to provide quorum guarentees.  Quorum
 323 protocols operate over SYSLINK, but are not implemented by SYSLINK.
 324 .Sh SYSLINK PROTOCOL - MESSAGE BUFFERING
 325 Syslinks which operate over buffered connections where messages may be
 326 sent or received in bulk must adhere to certain alignment and cross-over
 327 requirements to allow buffers to be implemented as FIFOs.  The message length
 328 field in a syslink message is not particular aligned, but syslink messages
 329 themselves must always be 16-byte aligned, creating small amounts of dead
 330 space in the buffer (and the data stream).  Additionally, the physical
 331 sysid propogation protocol also propogates a FIFO cross-over size, which is
 332 always a power of 2.  Typical values range from 64KB to 1024KB.  Messages
 333 received on a stream can be written into a buffer in FIFO fashion.  No single
 334 message may straddle the end of the FIFO's physical buffer (that is, cross
 335 back over to the beginning).  All transmitters must adhere to the FIFO
 336 size supplied in the initial message traffic by generating a PAD message
 337 when necessary.  Larger FIFO sizes are usually better since they result
 338 in smaller PADs.  I/O transactions containing data are typically broken up
 339 into smaller messages not only to accomodate limitations in transport
 340 protocols (such as UDP), but also to reduce the dead space created by PADs.
 341 On the bright side, these requirements allow very optimal hardware and
 342 software buffering of syslink message traffic.
 343 .Sh BLOCKING TRANSACTIONS
 344 Certain operations can block.  That is, the target may not be able to
 345 immediately complete the requested transaction.  When a transaction blocks
 346 the target is responsible for returning a keep-alive blocking indication
 347 to the originator to prevent the originator from retrying or aborting
 348 the transaction.  Keep-alives can be directly handled by the route node
 349 connected to the target (since it knows if the leaf disconnects),
 350 simplifying leaf operation.  A route node will very occassionally do a sanity
 351 check request to the leaf (perhaps once a minute) to verify that
 352 transactions blocked for a long time are still known to the leaf.
 353 .Pp
 354 Blocking indications are special response messages that set the
 355 blocked-operation bit in the sequence field and do not set the
 356 end-transaction bit.
 357 .Sh TRANSACTION ABORTS
 358 A transaction can be aborted.  Normally aborted transactions still
 359 required an acknowledgement (since the abort may race completion).
 360 If the target completes the transaction before receiving the abort
 361 request, it is as if the abort never occured.
 362 .Sh ASYNCHRONOUS PUSH TRANSACTIONS
 363 Most syslink transactions require an acknowledgement to terminate the
 364 transaction.  The acknowledgement is typically a single message in the
 365 return direction with both the start and stop bits set.  Multi-message
 366 responses are of course possible, such as when the transaction is
 367 implementing an I/O read operation.
 368 .Pp
 369 Certain syslink transactions do not require an acknowledgement and do not
 370 implement the retry or timeout protocols.  Such transactions are typically
 371 cache-push operations which are used to optimize operation of the cluster
 372 by allowing a node to asynchronously push data to places where it thinks
 373 it will be needed immediately.  The most commmon use of this sort of
 374 operation is the read-ahead optimization.  When one node performs a read
 375 transaction with another node, and the target node is capable of read-ahead
 376 and detemines that read-ahead is useful, the target node can initiate the
 377 read-ahead and push the data to the originating node in a separate
 378 asyncnronous transaction.  Read-aheads are typically not directly adjacent
 379 to the read that just occured in order to allow the originator to initiate
 380 the next synchronous transaction without it crossing paths with the
 381 asynchronous read-ahead push (resulting in the same data being returned to
 382 the originator twice).
 383 .Sh OPERATING AS A ROUTE NODE
 384 Most userland applications using syslink will operate as leaf nodes, but
 385 there is nothing preventing you from oprating as a route node.  Operating
 386 as a route node requires implementing all route node requirements including
 387 the handling of logical sysid registrations and the tracking of transactions
 388 initiated by nodes that directly connect to you.  In fact, sysid seeding
 389 nodes are user processes which operate as degenerate route nodes.
 390 .Sh RETURN VALUES
 391 The value -1 is returned if an error occurs in either call.
 392 The external variable
 393 .Va errno
 394 indicates the cause of the error.
 395 If a descriptor is supplied and the system call is successful, 0 is
 396 returned.  If a descriptor is not supplied and the system call is successful,
 397 a descriptor is returned representing a direct connection to the mesh's
 398 route node.
 399 .Sh SEE ALSO
 400 .Sh HISTORY
 401 The
 402 .Fn syslink
 403 function first appeared in
 404 .Dx 1.9 .