Title (Goal) | Content and structural validation |
---|---|
Primary Actor | Information architect, developer |
Scope | Component |
Level | Summary |
Author | Stefano Cossu |
Story (A paragraph or two describing what happens) | Enable validation of content structure and properties |
I want to enforce input validation outside of individual client systems. This is related to the Content modeling use case.
This validation may include constraints for property domain and range, cardinality, uniqueness, etc.
Range validation should include both data types for literal properties and class constraints for in-repo resource properties.
Examples
Restrict the “myns:created” property to xsd:dateTime;
Restrict the “myns:hasInstance” property to resources of type “myns:Instance” or its subtypes (structural validation);
- Make myns:createdDate single-valued (i.e. cardinality = 0..1)
Make myns:uid mandatory and single-valued (i.e. cardinality = 1..1)
- Inherit property constraints from super-types
- type myns:Document has a property definition for myns:uid as mandatory and myns:content as single-valued
- type myns:TextDocument inherits these definitions
- type myns:ImageDocument inherits myns:uid but overrides myns:content to being multi-valued
- Ensure that no two resources with the same myns:uid are present in the repository (similar to a unique key constraint in a relational database)
Roles of the API Extension Architecture configuration
- Define HTTP methods that the Validation extension operates on
- Enable/disable and define execution priority of Validation service (e.g. pipeline)
Roles of the API Extension engine
- Forward data from user request or previous services
- Handle response from Validation extension and forward to further services
Roles of the Validation extension configuration
- Define content models which validation is performed on
- For each model, define properties to be validated
- For each property, define:
- validation rules
- RDF type of the resource or container that should be validated (optional)
- Post-success actions (i.e. services to be called if all property values pass validation) (optional)
- Post-failure actions (i.e. services to be called if any property value fails validation) (optional)
Roles of the Validation extension engine
- Parse input from API-X engine and determine content model(s) of resource
- Parse configuration for content model(s)
- If RDF type restriction (see point in config roles) are defined, query repo index to determine if the resource if its container are of the RDF type specified in the config
- Loop over property validation rules in config file:
- If an RDF type restriction (see point in config roles) is defined, query repo or index to determine if the resource or its container are of the RDF type specified in the config
- If result is positive, apply validation rule against user-provided property value(s)
- If negative, skip validation
- If no RDF type restriction is defined, apply validation
- If validation passes:
- if a post-success action is defined, execute it
- if no post-success action is defined, move on to next rule
- If validation fails:
- if a post-failure action is defined, execute it
- if no post-failure action is defined, abort whole process and raise an exception
- If an RDF type restriction (see point in config roles) is defined, query repo or index to determine if the resource or its container are of the RDF type specified in the config
- After all rules have been parsed, return to API-X engine
Note
Due to the long discussion in the comments, this use case has been split into three child pages. See links below.
25 Comments
Elliot Metsger
Hi Stefano Cossu, can you describe what you mean by "in-repo resource properties" above?
Stefano Cossu
I mean property values that are URIs of resources living in the same repository.
E.g.if I have
<https://myrepo.org/res1> pcdm:hasRelatedFile <https://myrepo.org/res2>
I want to be able to specify that https://myrepo.org/res2 must have a rdf:type of myns:Image.
Aaron Birkland
Though not stated explicitly here, I can see two related or sub- use cases in this example:
For #1 (CRUD), one could imagine the following roles and responsibilities
For #2 (validation service) one could imagine the following roles and responsibilities:
/path/to/object/cma:validation
) to the validation service, for objects that can be validatedStefano Cossu: Is the above consistent with your conceptualization of how validation may be realized in the context of API extensions?
Stefano Cossu
Yes. Basically validation would be just another service that you call on CUD operations and that can determine the lifecycle and outcome of the operation.
If validation is extended to all PUT, POST, PATCH and DELETE operations, what is the reason for checking the validity of an existing resource when you are not modifying it?
Should be also PUT, PATCH and DELETE, i.e. any HTTP request meant to change the state of the resource.
You could also just define validation for arbitrary rdf types, so you don't need to assign an extra rdf type to your resources. Also just a "cma:validatable" type would not be enough to specify which validation to perform.
It may be useful to abstract the concept of "container" away from the API-X. Let's say you want to create an "image" resource: you POST an image file and some metadata and say, "this is an image", and the API-X validates your input and creates containers and binary resources for you.
+1 - We will need further research and discussion on this topic once we get to the details.
See above - I cannot see a use case for this.
Yes!
Aaron Birkland
I see two separate concerns here:
(1) Identifying which objects in the repository can be bound to a validation extension/service that may operate on them. (The assumption here is that a repository can contain objects that it cannot validate, and/or objects where validation is not wanted or desired; hence the need to bind objects to a validation service wherever validation is desired. This is analogous to the existing practice of marking objects with predicate
http://fedora.info/definitions/v4/indexing#Indexable
in cases where the object is intended to be indexed.)(2) Identifying which model an object should be validated against
So (1) is a concern of the API extension architecture, as it needs to know when it has to invoke a particular extension (without knowing any details of how that extension works or is configured). This is a general capability. I was suggesting that "presence of a specific rdf type" could be a sufficiently general means to bind the API extension Architecture to a particular service. In particular, presence of 'cma:validatable' would bind a validation extension to objects so marked.
Item (2) is a concern of the internal operation or specification of the validation extension, and as you note is also necessary information. The API extension architecture itself does not need to know or care how this is achieved.
Stefano Cossu
Aaron,
If we suppose that:
cma:*
) or the value of a special property (e.g.fedora:contentModel
, maybe easier to manage)Then you could use a configuration that specifies validation rules or validation service bindings for all content models. If the content model of the resource you want to validate is not in the configuration or no rules are specified, no validation happens.
Does that address both your concerns?
Aaron Birkland
So in this example, would the API extension architecture always invoke the validation extension, and the validation extension determines, based upon its configuration, the object's stated model, etc., whether it performs validation?
See my comment to Elliot. My concern is really that I'd like to explore ways to avoid enabling an extension globally for an entire repository, in cases where it makes sense to do so. Likewise, let's suppose an object has a content model that the validation extension is appropriately configured to know how to validate. Is it reasonable to have a scenario where a repository manager might still not want it validated, depending on where it is put in the repository?
I think these validation use cases are very interesting from the perspective of binding extensions to objects. Ideally (in my opinion) the API extension architecture itself should know as little as possible about a particular extension's business rules; the mechanism of binding an extension to an object needs to be simple, easy to reason about, and have a path forward to a fast and efficient implementation.
Stefano Cossu
I think this depends on what the repo manager is trying to achieve and what you mean by "where" (see my other comment).
I can see three different scenarios depending on this conversation: one where validation is mandatory across the whole repository (my case); one where it is recommended but not mandatory, at least for a temporary situation (Elliot's case here); and one where it is mandatory or recommended only in one part of the repository (your case above).
Can this summarize our discussion on this point?
Elliot Metsger
Want to +1 this statement. It would be useful to be able to validate a domain-specific representation of an object before it was mapped into LDP.
Aaron Birkland
Fedora's 4 object model is, for better or for worse, a hierarchy. Creating a new object in Fedora 4 inherently involves adding it as a child of some resource that already exists in the repository. So technically I should have used the word 'resource' rather than 'container' so as not to bring LDP into the mix.
Given that clarification, my comment "container for whom this extension provides validation services" is related to my desire to be able to specify in some way when to invoke the validation extension, and when not to.
So when depositing a new object, the API extension architecture needs to answer the question "which extensions do I need to invoke in order to handle this request". I believe the answer to that question shouldn't have to be "always invoke the validation extension". Ideally, in my opinion the API extension architecture should allow the answer to be "invoke the validation extension for the subset of the repository for which I want validation".
Because Fedora is inherently hierarchical, and because depositing a new object inherently involves specifying a parent resource to create a new child underneath, I was thinking that the parent resource would a natural place to have a marker to indicate "please validate objects put here". This marker could be a cma:validatable rdf:type.
So if a repository manager's policy is to validate all objects, then place the marker on the root node in the repository. If the manager's policy is to validate objects in /public/images, place a marker there. This is where I was going with "container for whom this extension provides validation services"
Stefano Cossu
Fedora's internal structure is hierarchical indeed; however that is the JCR layer that should ideally not be exposed to the client. As of late tying functionality to the JCR hierarchy is generally being avoided (the document you quote probably pre-dates the decision to move to a full LDP-based model and abstract the JCR machinery if I remember correctly, right Andrew Woods?). LDP has the concept of containment that would be more appropriate for your case.
Back to the main point, it seems like the main debate here is having validation depending on containment (or hierarchy, if you don't agree with the above statement) or RDF class or similar membership. I think this is a good discussion to bring forward when we start laying out an implementation plan.
Andrew Woods
Yes, the hierarchical nature of F4 should be viewed through the lens of LDP containment... which does not seem to invalidate Aaron Birkland's "containment" proposal.
Aaron Birkland
This raises an interesting point that will likely have to be addressed over the course of this work. It is a little unclear to me how much of JCR is 'accidental' (i.e. an implementation detail), and how much of it is 'essential' to the Fedora 4 model. fcrepo-kernel-* contains the the public Java API to Fedora core. Many current extension modules use this API. This API exposes Fedora objects/resources explicitly in terms of JCR (see, for example, FedoraResource.java from the kernel API). The only implementation of this API is based on Modeshape, and I believe at present there isn't any way to use fcrepo-kernel without actually being deployed as part of the Fedora webapp.
So where does this leave API extension modules? To a great extent, they may rely on the HTTP+LDP API of Fedora. In my mind, though, it may be useful to look to fcrepo-camel (which provides a client API based on HTTP), or even fcrepo4-client. It might be nice to be able to recommend to extension developers a client library that exposed Fedora 4's conceptual model free of JCR or HTTP concepts that could have implementations based on fcrepo-kernel or fcrepo-camel.client client; depending on where a particular extension is deployed.
Elliot Metsger
I agree that we would want to recommend patterns for writing extensions, including recommending client libraries.
If the Extensions are "air-gapped" from Fedora (e.g. not running in the same JVM, or running in a separate servlet container or web-app), then it seems unlikely that they would have access to APIs in fcrepo-kernel-*. If Extensions did have access to those APIs, that would make me uncomfortable because that increases coupling between Extensions and Fedora (so the coupling, regardless of the exposure of JCR, is what makes me uncomfortable in that scenario).
So I agree that we would want another integration point as you mention above: HTTP/LDP, Camel+HTTP, or another existing library.
Elliot Metsger
Aaron Birkland, Stefano Cossu: I had made these notes for roles of this use case. I think they align with everyone's thoughts. Some notes:
API Extensions Architecture
Provides - or proxies the request to - a runtime environment for validating an instance of a model against constraints that:
supports configuration or policy that governs whether an instance is subject to validation, and when (upon ingesting an object, upon retrieving an object) validation is performed
provides native support for certain constraint languages like SPARQL Inferencing Notation (SPIN) or Shapes Constraint Language (SHACL)?
supports a plugin architecture for validating model instances on a per-model basis?
provides access to the result of validation attempts
optionally generates validation events and stores them as provenance for the object being validated?
May provide - or proxy - a service which can validate an object on request (e.g. a request to /path/to/object/svc:validate)
because an object may conform to multiple models, the requestor may be allowed to specify the type of model to validate against
Fedora
Answers requests for resources, as normal
Provides storage for objects supporting validation: policy, models, and constraints
Information architect/developer
Defines the model, and its constraints
Expresses the constraints in a manner supported by the API Extensions Architecture
by developing a custom plugin to perform validation
by using a constraint language supported by the API Extensions Architecture
Configures the API Extensions Architecture by defining a policy determining which model(s) are subject to validation, and when.
Aaron Birkland
Not to be pedantic, but aren't the tasks of "defining a policy determining which model(s) are subject to validation" and "using a constraints language" under the purview of a validation extension itself, rather the API extension architecture? Stated another way, the API extension architecture itself does not know anything about validation or content models or constraint languages, but it does know what an object is, what an HTTP request is, and how to route requests to services.
So if the underlying issue is figuring out a better method than an rdf:type marker to determine whether the API extension architecture binds a specific extension to a request on an object, we can try to do that. To me, the intuitiveness, simplicity, and explicitness of an rdf:type marker makes it an attractive piece of data the API Extension architecture can use to bind a particular object to a particular service in response to an appropriate request. I suppose I'm having a hard time understanding where this breaks down for the validation use case(s)?
Elliot Metsger
No, the pedantry is welcome. I think I've had the role of the extension architecture and individual extensions themselves conflated in my head. So what you say makes sense.
What happens if you want to only validate myns:Image objects that are submitted to a particular collection? If I understand what you're saying, the architecture routes the request to the validation extension, and it is the extension's responsibility to determine whether or not to perform the validation?
Aaron Birkland
If the extension architecture globally involves the validation extension always for every resource under every circumstance (as I believe Stefano is suggesting it should), then yes, it would be entirely the extension's responsibility for determining whether to validate. I'm suggesting that (short of designing a validation extension as part of this work) it would also be reasonable to design a validation extension where this isn't the case. If the API Extension architecture binds a request to a particular extension only where there is a marker present (like rdf:type of cma:validatable), then the scenario you describe above can be handled by placing rdf:type of cma:validatable on the collection(s) you want to validate, and omitting it on collections you don't want validated at all.
Both validation extension scenarios may be reasonable. If there is agreement that this is the case, then they both can be used to generate requirements for the API Extension Architecture (e.g..'it shall be possible to bind an extension so that it is always enabled globally' and 'it shall be possible to bind an extension based on the presence of a specific rdf:type marker', etc). Then whatever approach an actual validation takes is irrelevant, because we will expect it to work either way.
Stefano Cossu
+1
In the case of a UUID, for example, I want to allow the client to set it on resource creation, but after the resource is created, that property cannot be changed.
I would be tempted to map different validation scenarios to HTTP methods, but that may not work with the example above because PUT can both create and update a resource. In that case I would need some extra logic (possibly provided by a plugin as you suggest).
Elliot Metsger
Possible reasons you may want to expose a validation service for ad hoc invocations is because:
Aaron Birkland
Also, the ad-hoc 'validation service' supports inherently asynchronous workflows. In that case, a repository manager might not want synchronous validation on POST, PUT, PATCH, DELETE at all. Imagine a scenario where contributors deposit initial/incomplete content that is subsequently updated/refined until it reaches a publication point. In that scenario, it may be perfectly reasonable for the repository to at least persist and make accessible 'invalid' content so that it may be fixed at some future point.
Stefano Cossu
Aaron, Elliot,
I think we are talking about two different concepts.
What I mean by "valid" resource is a resource that is eligible to be classified with a certain content model. A "work-in-progress" item may still have some hard constraints (e.g. a UID) which, if not satisfied, should prevent it to be stored in the repository as it is "bad data".
That is why I think point 1 and 2 in Elliot's scenario should not happen in a "normal" scenario. IMO, if you are relying on a content model, you would expect all stored resources to follow it all the time. That is, validation may be asynchronous, but the resource should not be persisted until validation is passed.
As for point #3, that is a very normal scenario.
That said, I agree with the need for ad-hoc validation. In all cases, we need to be able to re-validate part or all the repository.
Stefano Cossu
Your use case could maybe be resolved by using a "basic resource" content model to which a "publish-ready resource" content model is added at a later time: the former has constraints mandatory for that resource to even be ingested, the latter additional constraints to classify the resource as "completed". This implies the use of multiple content models on a resource, which may need some extra discussion.
Elliot Metsger
I had assumed, perhaps mistakenly, that a resource could participate in multiple models at once, because nothing prevents a Fedora resource from having multiple rdf:types, and presumably any one of those types could be linked to a content model.
Stefano Cossu
I assumed that possibility too, but I wanted to make it clear whether this may be the actual case or not; if it were, there may be more hings to consider, such as how to deal with conflicting validation rules.