I am generating an XML document for which different XSDs have been provided for different parts (which is to say, definitions for some elements are in certain files, definitions for others are in others).
The XSD files do not refer to each other. The schemas are:
- http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd
- http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/FormSubmission-v1-1.xsd
- http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd
Is there a way to validate the document against all of the schemas using lxml?
The solution here is not simply to validate individually against each schema, because the problem I am having is that validation fails because of elements not specified in the XSD. For example, when validating against http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd
, I get the error:
File "lxml.etree.pyx", line 3006, in lxml.etree._Validator.assertValid (src/lxml/lxml.etree.c:125415)
DocumentInvalid: Element '{http://xmlgw.companieshouse.gov.uk}CompanyIncorporation': No matching global element declaration available, but demanded by the strict wildcard., line 9
Because the document in question contains a {http://xmlgw.companieshouse.gov.uk}CompanyIncorporation
element, which is not specified in the XSD being validated against, but in one of the other XSD files.
I believe you should only be validating against Egov_ch-v2-0.xsd
, which appears to define an envelope document. (This is the document you are creating, right? You haven't showed your XML.)
This schema uses <xs:any namespace="##any" minOccurs="0"/>
to define body contents of the envelope. However, xsd:any
does not mean "ignore all contents." Rather it means "accept anything here." Whether to validate or ignore the contents is controlled by the processContents
attribute, which defaults to strict
. This means that any elements discovered here must validate against types available to the schema. However, Egov_ch-v2-0.xsd
does not import CompanyIncorporation-v1-2.xsd
, so it doesn't know about the CompanyIncorporation
element, so the document does not validate.
You need to add xsd:import
elements to your main schema (Egov_ch-v2-0.xsd
) to import all other schemas that may be used in the document. You can either do this in the xsd file itself, or you can add the elements programmatically after parsing:
xsd = lxml.etree.parse('http://xmlgw.companieshouse.gov.uk/v2-1/schema/Egov_ch-v2-0.xsd')
newimport = lxml.etree.Element('{http://www.w3.org/2001/XMLSchema}import',namespace="http://xmlgw.companieshouse.gov.uk",schemaLocation="http://xmlgw.companieshouse.gov.uk/v1-1/schema/forms/CompanyIncorporation-v1-2.xsd")
xsd.getroot().append(newimport)validator = lxml.etree.XMLSchema(xsd)
You can even do this in a generic way with a function that takes a list of schema paths and returns a list of xsd:import
statements with namespace
and schemaLocation
set by parsing targetNamespace
.
(As an aside, you should probably download these schema documents and reference them with filesystem paths rather than load them over the network.)