Fedora's SOAP API has historically been split into two: API-A and API-M. The REST API, being resource-oriented, naturally isn't. And the control over which endpoints can be set to require authentication is inconsistent across the two. The old "Authenticate for API-A? Authenticate for API-M?" options no longer make sense when you look at Fedora's APIs as a whole.
Here's one way the situation could be improved.
- For the SOAP API (all read-oriented and write-oriented methods), always require authentication.
- For the REST API, on a per-verb basis (POST/PUT/DELETE/GET), offer the following options at install time:
- Proactive Challenge: Always require authentication.
- Reactive Challenge: Only require authentication if an un-authenticated request failed due to AuthZ rules.
NOTE: This is being tracked as FCREPO-668
This blog post has been moved to the Fedora Create Space. See EZService
Paul was describing how Mulgara Resolverswork today and Aaron pointed out that it may be possible to plug MPTStorein as a resolver. This is interesting because it would allow one to use Mulgara's higher-level query capabilities (e.g. SPARQL) on top of MPTStore. Additionally, for Fedora this would allow us to continue offering triplestore engine choice while working directly against the Mulgara interfaces vs going through Trippi.
Note however, that when using XA1.x (Mulgara's current storage engine), getting good throughput for lots of small writes would still require some sort of buffering layer. Long-term, we believe XA2 will remove the requirement to do buffering "above" Mulgara for good write throughput. XA2 is designed to make newly-committed changesets immediately visible, then merging them in the background.
Key Mulgara Interfaces
- org.mulgara.resolver.spi.Resolver
- Transactions:
- Query Results:
- Queries:
Key MPTStore Interfaces
Questions
- If the MPTStore resolver provides an XAResource by implementing the EnlistableResource interface, does this make Mulgara transactions automatically work with the resolver?
- What are the implications of this issue?
I've recently made a few significant changes to the FCREPO Tracker and thought it would be good to share them here. Some of this info will eventually work it's way into the Developer's Wiki, but for now it's a blog post.
Improvements
- Wider Access: The FCREPO Tracker now allows issues to be submitted from anyone in the community. Formerly, we experimented with having a dedicated tracker for community-submitted issues, but decided it'd be best to track everything in one place and improve the issue workflow instead (see below).
- Wider Scope: Fedora, GSearch, OAIProvider, and DirIngest are now listed as "Components" within the FCREPO tracker. These are all releasable products within the Fedora Repository Project. As part of this change, I made the JIRA "Versions" for the project product-specific (e.g. "GSearch 2.2"). This makes sense because each product has it's own release schedule.
- Improved Workflow: Issues now start in the new Received state, rather than the Open state. The distinction is that Open issues have been validated and determined by the committers to be in-scope. More specifics below.
Issue Lifecycle
Received
All new issues begin in this state. If it's clearly in scope for the project, it will be moved to "Open" by one of the committers. If it's not in scope or is a duplicate, etc, it will closed. If an issue remains in the recieved state for a longer than a couple weeks, it means we think it deserves more discussion before making a determination.
Open
Issues in this state are in scope for a future release. The target release may not have been determined yet. Generally higher-priority items will worked on first. The committers will take our best guess on the initial priority of issues, but the priority may be increased or decreased based on input we receive (votes, comments) from the community.
Issues that are not yet assigned can be worked on by anyone in the community. Attaching a good quality patch to an open issue is the best way to vote
In Progress
This state means a developer is currently working on the issue. Small issues will usually bypass the review step and be Closed (and resolved as "Fixed") once the changes are committed to trunk and pass automated tests. Larger issues will be moved to the "In Review" state at the discretion of the developer.
In Review
The assignee has asked someone to take a look at the solution before closing the issue.
Reopened
The issue was thought to be resolved, but isn't.
Closed
The issue has been resolved. Possible resolutions are:
- Fixed
- Won't Fix (out of scope)
- Incomplete (not completely described)
- Cannot Reproduce (for Bugs)
At the on-site last week, we had some additional discussion of the "Service Ladder" idea. Notes and images from the board can be found at the On-Site Nov 3-7 Meeting Notes page.
Bill and I are leaving Boston today after a good set of meetings with the DSpace folks. I'll be hitting the road shortly, but I wanted to get some thoughts out on what we're calling "The Ladder" for lack of a better term.
The concept of the ladder is this: Assuming we have Akubra, what are the next "rungs" of agreement in a shared persistence model? We played with the idea of different levels, starting with the purely structural, and moving carefully up into semantics.
At Level 0, you have Blobs with Ids. This is Akubra.
At Level 1, Relationships
At Level 2, Aggregation
At Level 3, Metadata/Semantics
I should point out that this ordering was a result of brainstorming, and we did not attempt to flesh out the details for each level. But it gave us a sort of strawman to frame the discussion.
Here's my first attempt at refinement after sleeping on it. Two key goals:
- It is possible to persist everything in Level N + 1 into N. If this is done, rebuilding everything from level 0 is possible.
- Though RDF might not be presumed, it is possible to represent each level in RDF (minus blob content)
Level 0
Blobs with Ids and readonly properties (size, minimal others).
Level 1
Entities with readonly + writable relationships and properties.
Entities optionally have content (represent bitstreams aka Blobs).
Level 2
Entities with everthing in level 1, plus key inferred (read-only) relationships. A big requirement here is to support two-way links for whole-part relations. Might also include transitive relationships.
It's possible to infer such relationships from what's expressed in level 1 and a set of inference rules.
Level 3
Higher-order repository domain objects. The concept of a metadata entity relating to content entity would be represented here.
- We decided during our status meeting that we'd shoot for Oct 15th-17th timeframe for the 2.2.4/3.1 releases; Bill and I will be in Cambridge meeting with DSpace folks much of this week.
- I gave an update on Akubra Analysis of Existing Approaches at the architecture meeting.
- I moved the Akubra project infrastructure (wiki, svn, tracking, mailing lists) to Fedora Commons. This is the first project that will use our locally-managed Subversion installation. It's also the first to use Google Groups for the mailing lists (rather than Sourceforge.net).
- Unfortunately I didn't get a chance to review outstanding branches by Bill and Eddie...hopefully I'll be able to do that on and off while in Cambridge this week.
Just wanted to share this paper, in which the term "spanning layer" was coined.
Interoperation, Open Interfaces, and Protocol Architecture - David D. Clark
Interesting stuff.
For Akubra, I started this week by scouring the web for existing blob storage APIs. I've been keeping disorganized notes here and there, and decided it was past time to start a wiki page on the topic. As the links started piling up, I realized it would be really useful to have a "capability matrix" for the APIs as well as existing implementations. It's not finished yet, but here's where I'm keeping it: Analysis of Existing Approaches. I hope to have the matrix finished by Monday night.
This week, I also:
- Reviewed Eddie's Mulgara/SPARQL update for 2.2.4.
- Updated the subversion config in src/build to use correct line endings for .sh/.bat. Bill updated the svn properties on the existing files.
- Got an open source/nonprofit license for the Gliffy Confluence Plugin. Dan installed it. I really like the way these plugins can be installed and enabled straight through the Confluence admin interface, while it's running.
- Merged in FCREPO-253 after Bills review
- Started reviewing Eddie's branch, FCREPO-254
- Did some top-secret firewall configuration stuff. I suppose it's not good for me to go into details publicly.
This week, I:
- Concentrated almost exclusively on Akubra (see below)
- Did some tweaks on our custom "Reviewer" workflow in our Jira installation
- Set up nightly postgres snapshots on the production box since we're now in production with Confluence and Jira
- Copied Eddie's test "Developer's Blog" wiki page and added an Atom feed. This now lives in, and is linked from the Developer's Forum space in Jira. We're now each doing weekly updates as News items in our personal space, which then get aggregated to this page.
Akubra progress
- Got initial File System (org.fedoracommons.akubra.fs) implementation done
- Gave an update on Akubra at the weekly architecture meeting on Tuesday
- Had some good discussions on the Akubra Developer's Mailing List.
- Learned about the API behind the Storage Resource Manager (SRM), which is used by the Large Hadron
Collider Computing Grid (LCG).
I also had an initial exchange with Richard Rodgers regarding DSpace-Fedora Commons collaboration on low-level storage. Richard has been working on Pluggable Stores for DSpace 2.0, and from the look of it, we have a lot of the same sensibilities about this layer being thin and free of higher-level OAIS semantics. I'm looking forward to more discussions to come.
Below are the notes I used for the Akubra update in today's architecture meeting.
What's Akubra About?
- A standard Java interface for reading/writing files, but at a different level of abstraction than a filesystem
- Transactional by design (but implementations may ignore transaction semantics)
- Exploring web-based exposure
- From the Akubra wiki: Requirements and Goals
Note: The Akubra wiki is hosted at http://topazproject.org/akubra
Anyone is welcome to sign up to the dev and general mailing lists.
Filesystem vs. BlobStore
Common filesystems:
- Have directories
- Can provide system metadata about files (e.g. size, modified date)
- Allow partial reads and writes of files
An Akubra BlobStore:
- Has a collection of URI-addressable bitstreams (no "directories")
- Only provides the size of each file -- is not concerned with other system metadata (yet?).
- May allow partial reads (InputStreams can skip()...)
- Does not allow partial writes
Java API
This is in flux. We are currently testing the design with a simple filesystem implementation.
Blob (A finite, readable bitstream)
BlobStore (For getting connections)
BlobStoreConnection (For CRUD operations)
Transactions
Level of support varies per-implementation (some can "fake it")
Why: To execute a mixed set of CRUD operations of several files as one atomic unit of work.
Observation: We can build a transactional blob store on top of a non-transactional one...with the help of a DB.
Example non-transactional BlobStore: FSBlobStore (see FSBlobStoreConnection)
Higher-level BlobStore TBD:
- Uses FSBlobStore to persist data
- Uses database to support transactions (via id mapping)
Other possible storage Plug-Ins:
- S3 (anything based on current LLStore should be easy to port over)
- ZFS (already transactional, does not need layering)
- Centera (content-addressible...ids not available till content is written)
- Sam/QFS (hierarchical storage implies graceful handling of delays...not a use case we've factored in yet)
Web-based exposure?
- Opens up use of akubra-java impls to other (remotely-running) programs
- Allows an akubra impl that's a client to remote akubra instance
- Lots of interesting possibilities!
This week I completed the initial migration from the Sourceforge Trackers to our Jira installation. They're now under the Jira Project, FCREPO (view all issues). This process gave me a chance to learn quite a bit about Jira and how to customize things.
Bill and I talked late in the week about how to use Jira to replace our current branch log (a history of who worked on what branches, when). What we finally came up with was a new issue type, "Code Task", whose id will serve as the id for the branch (e.g. fedora/branches/fcrepo-123). We also added a "review" step to the workflow for these types of issues. Besides being easier than hand-updating the old branch-log.txt file, we now have an easy way to list the outstanding branches (view outstanding branches).
I also started reviewing many of the long-outstanding feature requests and found a lot that we need to revisit and either specify better or decide if they're still relevant. To logically separate these from those features we know are on the immediate horizon, I've given them a "Pending Feature Request" issue type.
On Thursday, I applied a patch submitted by Atos, which fixes a threadsafety issue with file uploads and improves performance when many uploads are happening at once. This has gone into my bugfix branch for 3.1, which I'll need to get reviewed soon.
On Friday, Aaron and I talked with members of the Digital Library Infrastructure team at Stanford. They're getting up and running with Fedora and are in the process of deciding how to model their objects. They have quite a large collection of digitized books, image collections, special collections, and archives, and will be using Fedora as an object registry in their architecture. One thing that came up in the call was that the out-of-box modeling language in Fedora 3.0 doesn't have ordinality/cardinality constraints. I don't know how the ds-composite-model schema will evolve in future releases, so I recommended for the time being that they use a separate datastream in the CModel object to hold such constraints.
This morning I've been looking into doing a scripted migration our Sourceforge trackers to Jira. I wasn't able to find a working migration utility that does this, but it looks like it won't be too bad. Sourceforge allows the export of all project data (including trackers) to a big XML file, and Jira has several import options, the most palatable being the Jelly scripting facility. Right now it's a matter of doing the right mapping.