class/System.XML/System.Xml.Schema/XmlSchemaInferenceDesign.txt

   1 * INCOMPLETE
   2
   3 * XML Schema Inference Rules
   4
   5 ** Requirements
   6
   7         XmlReader:
   8         <ul>
   9                 - that does not expose EntityReference.
  10                 - that does not contain xsd:* elements.
  11         </ul>
  12
  13         XmlSchemaSet: only that was generated by this utility class. See
  14         particle inference section described later.
  15
  16         Actually MS implementation has insufficient check for this input,
  17         so it accepts more than it expects.
  18
  19 *** Allowed schema components
  20
  21         Before infering merged particles with premised particles in
  22         XmlSchemaSet, we have to know what is expected and what is not:
  23
  24         <ul>
  25                 - facets are not supported. [a014.xsd]
  26                 - xs:all is not supported. [a003.xsd]
  27                 - xs:group (ref) is not supported. [a004.xsd]
  28                 - xs:choice that does not contain xs:sequence is not
  29                   supported [a005.xsd].
  30                 - xs:any is not supported. Only xs:element are expected
  31                   to be contained in xs:sequence. [a011.xsd]
  32                 - same name particles that are still not ambiguous
  33                   are computed into invalid particles. It looks
  34                   like MS's unexpected bug. [a010.xsd]
  35                 - attributeGroup looks not supposed to be there (MS has a
  36                   bug around here). [a006.xsd]
  37                 - anyAttribute is not regarded as a valid particle, and
  38                   the output complexType definition just rips them out.
  39                   [a013.xsd]
  40                 - but substitutionGroup is not rejected and it will remain
  41                   in the output. [a001.xsd]
  42                   -> It must be rejected. It breaks choice compatibility.
  43         </ul>
  44
  45
  46
  47
  48 ** Processing model
  49
  50         First, parameter XmlSchemaSet is compiled[*1] and interpreted into
  51         its internal schema representation that is going to be used for
  52         XmlReader input examination. The resulting XmlSchemaSet is the same
  53         as the input XmlSchemaSet.
  54
  55         [*1] FIXME: this design might change.
  56         The XmlSchemaSet is compiled and , because 1) it might contain
  57         XmlSchemaInclude items. So it won't be possible to process inference
  58         inside the input schema set. However, reusing the input reduces
  59         some annoyance; to preserve elementFormDefault etc.
  60
  61         Second, XmlReader is moved to content (document element) and
  62         "element inference" starts from here (described later).
  63
  64         Resulting XmlSchemaSet keeps original XmlSchemas into itslef.
  65         For example, it keeps elementFormDefault and attributeFormDefault.
  66
  67         Basically it will process the XmlReader with existing XmlSchemaSet
  68         and won't "merge" two XmlSchemaSets one of which is newly inferred
  69         from this XmlReader. Because anyways the XmlReader will have to
  70         infer sequential nodes (siblings).
  71
  72         Once the element definition is determined (or created), any other
  73         branches in the schema are ignored.
  74
  75
  76
  77 ** Attributes
  78
  79 *** attribute component definitions and references.
  80
  81 **** ignored attributes
  82
  83         xsi:type, xsi:schemaLocation and xsi:noNamespaceSchemaLocation
  84         attributes are ignored.
  85
  86 **** special attributes
  87
  88         If xsi:nil does exist, then its content are not handled, while its
  89         attributes are handled.
  90
  91         xml:* schema are predetermined; it has a fixed schema for that ns.
  92
  93 **** namespaced attributes
  94
  95         miscellaneous attributes that resides in a certain namespace is
  96         referenced as <attribute ref="qualified-name" />
  97
  98 **** local attributes
  99
 100         miscellaneous attributes are represented as <attribute name="blah" />
 101
 102
 103 *** attribute occurence
 104
 105         when defining a complexType for a newly-created element, the attribute
 106         can be set as "required". Otherwise, it must be set as "optional".
 107
 108         For every element instance occurence, all attributes are tested
 109         existence, and if it does not, then it must be set as "optional".
 110
 111 *** attribute value types
 112
 113         FIXME: need to describe the relaxation of attribute value types.
 114
 115
 116 ** Content model inference
 117
 118 *** inference processing model
 119
 120         Content model consists of two parts;
 121
 122                 - content type : empty | elementOnly | textOnly | mixed
 123                 - particle : sequence | choice | all | groupRef
 124
 125         On processing reader.Read(), the node is first "tested" against
 126         current schema content model. If the current node on the XmlReader
 127         is not acceptable, then "content model expansion" happens.
 128
 129         <ul>
 130                 - If the current node is text content, then process the
 131                   text node according to "evaluating text content".
 132                 - If the current node is an element, then process it
 133                   in accordance with "evaluating particle".
 134         </ul>
 135
 136
 137 *** evaluating element
 138
 139         When an element occured, then it must be accepted as a particle.
 140         First, content type must be examined:
 141
 142         <ul>
 143                 - If the content type was simpleType, then it is changed
 144                   into complexType with complexContent and mixed='true'.
 145                   The inferred content particle must be optional.
 146                 - If the content type was empty, then it is changed into
 147                   complexType with complexContent (it is not mixed unlike
 148                   above). The inferred content particle must be optional.
 149                 - If the content type was elementOnly or mixed, no need
 150                   to change.
 151         </ul>
 152
 153         Next, the content particle must be evaluated.
 154
 155         According to the input XmlSchemaSet limitations, there will be
 156         only these patterns listed here:
 157
 158                 - empty content
 159
 160                 - simple content
 161
 162                 - sequence (of element particles)
 163
 164                 - choice of sequences
 165
 166 **** Reader progress
 167
 168         Every element is tested against current element candidates.
 169
 170         <ul>
 171                 - When the target element is a document element, then all
 172                   the global elements in XmlSchemaSet are the candidates.
 173
 174                 <ul>
 175                         - If there is a maching name, then that element
 176                           definition is used as the context element for
 177                           the node's content, and current particle is
 178                           in front of the first particle.
 179                         - If there isn't, then the inference engine creates
 180                           a new element definition, and content is none
 181                           (none != empty).
 182                 </ul>
 183
 184                 - When the target element is inferred in a new element
 185                   definition, then
 186         </ul>
 187
 188
 189 **** Particle inference
 190
 191         IMPORTANT: Here I tried to formalize the inference, but it is
 192         incomplete notes.
 193
 194         Target {particle} to add:
 195                 isNew  -> <xs:element name={name}> ... </xs:element>
 196                 !isNew -> <xs:element name={name minOccurs="0"> ... </xs:element>
 197
 198         no definition
 199         //      define complexType and add {particle} to .Particle
 200                 toComplexType()
 201                 processcontent(ct.Particle, isNew)
 202
 203         simpleType
 204                 makeComplexContent()
 205
 206         complexType
 207                 empty definition (no content model, no particle)
 208         //              -> add xs:element name={name} minOccurs="0" to .Particle
 209                         -> processcontent(ct.Particle, isNew)
 210
 211                 simple content
 212                         -> makeComplexContent()
 213
 214                 complex content / extension
 215                         -> processContent(cce.Particle, isNew)
 216
 217                 complex content / restriction
 218                         -> processContent(ccr.Particle, isNew)
 219
 220                 .Particle
 221                         -> processContent(ct.Particle, isNew)
 222
 223         makeComplexContent()
 224                 change to complexType which has complex content mixed="true" and
 225                 extension. Discard simple type information. Add {particle} to
 226                 extension's .Particle.
 227
 228         processContent(Particle particle, isNew)
 229                 if particle is either empty or sequence
 230                         processSequential(particle, 0, false, isNew)
 231                 else if particle is sequence of choices
 232                         processLax(particle, 0)
 233                 else
 234                         error.
 235
 236         processSequential(Sequence particle, int index, bool consumed, bool isNew)
 237                 particle.Count <= index
 238                         -> appendSequential(particle, isNew)
 239                 sequence
 240                         if (particle[index] has the same name)
 241                              -> if (consumed) then sequence[index].maxOccurs = inf.
 242                                 InferElement (sequence[index])
 243                                 processParticles(particle, index, true)
 244                         else
 245                              -> if (!consumed)
 246                                         sequence[index].minOccurs = 0.
 247                                         processParticle(particle, index+1, false)
 248                                 else
 249                                         particle = toSequenceOfChoice(particle)
 250                                         processLax(particle, index)
 251
 252         processLax(choice, index)
 253                 foreach (element el in choice.Items)
 254                         if (el has the same name)
 255                                 InferElement (el)
 256                                 processLax(choice, index + 1)
 257                                 return;
 258                 appendLax(particle)
 259
 260         appendSequential(particle)
 261                 if (particle is empty)
 262                         make particle as sequence
 263                 sequence.Items.Add(InferElement(null))
 264
 265         appendLax(choice)
 266                 choice.Items.Add(InferElement(null))
 267
 268
 269 *** evaluating text content
 270
 271         When text content occured, it must be accepted as simple content.
 272
 273         <ul>
 274                 - If the content type was textOnly, then "type relaxation"
 275                   happens (described later).
 276                 - If the content type was already mixed, then it is skipped.
 277                 - If the content type was elementOnly, then the content type
 278                   becomes mixed and then skipped.
 279                 - If the content type was empty, then its content type
 280                   becomes text and then skipped. The type is xs:string (no
 281                   type promotion will happen since empty value cannot be
 282                   accepted as any other types handles in this design).
 283         </ul>
 284
 285         (Actually inference is done from non post compilation information.)
 286
 287         Note that type relaxation happens only when it is inferred as textOnly
 288         and it always occurs.
 289
 290
 291
 292
 293 ** Type inference
 294
 295         All data types are inferred from string value; either element content
 296         or attribute value.
 297
 298
 299 *** primitive type inference
 300
 301         When a string is being evaluated as xs:blahblah typed value, it is
 302         tried against several types.
 303
 304         <ul>
 305                 - First, it is evaluated as xs:boolean; true, false<del>, 1 or 0</del>.
 306
 307                 - Next, its integer value is computed. 1) If it is
 308                   successful, then its value range is examined if it
 309                   matches with unsignedByte, byte, unsignedShort, short,
 310                   unsignedInt, int, unsignedLong, long, and integer.
 311
 312                 - If it was not an integer, then it is evaluated as a float
 313                   number, as a double number, and then as a decimal number
 314                   as well.
 315
 316                 - Next, it is examined as xs:dateTime, xs:duration and
 317                   related schema types.
 318
 319                 - If if did not match any kind of predefined types, then
 320                   xs:string is inferred. No other string-based types (such
 321                   as xs:token) are inferred.
 322         </ul>
 323
 324
 325 *** type relaxation
 326
 327         When a string value is being accepted with existing type, the type
 328         might have to change to accept it.
 329
 330         For example:
 331         <ul>
 332                 - xs:int cannot accept "abc"
 333                 - <del>string with maxLength="3" cannot accept "abcd"</del>
 334                   facets are not created anyways and thus not supported
 335                   by this inference engine.
 336                 - 12345 is not acceptable for xs:unsignedByte, but acceptable
 337                   for unsignedShort
 338         </ul>
 339
 340         Here, the new string value is inferred into a simpleType, and then
 341         the processor will compute the most specific common type between
 342         the existing type and the newly inferred type.
 343