Namespaces for fun and profit

Technical Blog Post

Abstract

Namespaces for fun and profit - why you need one

Body

Background (optional reading)

I've had variations on this conversation several times over the past few months. I guess I should create a proper rant and number it. OTOH making it a blog post means I have a URI, which is way better than a numbered rant ;-)

Tom: These Linked Data/OSLC APIs all require namespaces. Where should I put client extensions?

Me: In a client-owned namespace.

Tom: But I don't know what that value would be/don't have one. Can't I just put them in mine?

Me: [1: get back up off of floor. 2: count ten 3: meditate quickly.] No.

Tom: Can I just use the deployed server hostname?

Me: [1: purge images of Dilbert's Alice and her Fist of Death from brain] No.

... and so on

Now before any Tom's I know get all in a huff, names have been changed to protect the guilty. These are not foolish questions or people, not in the least. Anyone schooled in the last 20 years would quite naturally think (in object-oriented terms) that everything is implicitly scoped to the class, database, organization, etc in which it occurs and this is fine. After all, it's worked so far for them. It works in a lot of real-world situations too; but not in all of them.

So what's the big deal with namespaces anyway?

Simple answer: uniqueness in time and space, across a set of parties that can operate in parallel with zero coordination. Neat, huh? Now what does all that actually mean?

Linked Data requires namespaces because the assumption is that the data is going to be available on "the" Web, and it's going to be consumed by code (perhaps in addition to humans) - maybe the Internet, maybe an intranet, but that's an implementation detail in terms of the data. Since it's going on the Web, the assumptions are that it's going to persist (remain available for an extended period), that it can be consumed in ways that the producer never directly considered, that the producer and consumers are loosely coupled, and therefore all identifiers must be unambiguous.

Not all integrations meet those criteria; sometimes you just want to ETL data from one place to another, transform it perhaps along the way, and it's throwaway/limited lifetime code. Don't feel obligated to carefully choose namespaces in those situations. Don't feel obligated to even use namespaces in those situations; if you're using a technology (for other reasons) that forces namespaces on you, pick whatever you like off the cuff - since it's throwaway code, no one will care and no Namespace Police are likely to darken your doorway. Use the right tool for the right job.

Of course, if it starts as throwaway and later grows into something that needs to be durable, its implementation might need to change.

Server consolidation scenario

Your business has two on-premise asset management servers, owned and managed at the department level. They both follow an internal corporate standard to use the property name assetnum, but the values (identifiers, to be precise) are assigned independently. You discover that you can save money by consolidating those servers into a single one.

It's useful that the property name assetnum is used in both, but only those aware of the corporate standard know that the property names are intentionally equal. Any code that's written must have that knowledge baked into the code in order to make use of it. Once that knowledge is baked into the code, the code is less general-purpose (so it's less re-usable for other tasks).
It's misleading (at best) that the assetnum values can match for different assets, because the values were assigned independently. Both departments can set an asset number of 100, and all their processes work. Treat two assets as a single one though, and your accountants might become vexed.

I darn you to heck

(If you don't get that reference, it's another Dilbertism)

You might follow Tom's thinking and say "just treat unqualified things (names, identifiers) whose values are equal as the same thing, regardless of where the data came from".

I suspect, pointed out like that, it's pretty obvious that this could work, but it's not reliable without some outside or implicit knowledge...and when that knowledge changes, things break. That risk might be tolerable for a short term one-off project, however not for data intended to be available over time and open to novel or unknown consumers. But let's make that assumption provisionally and see what happens.
When you see asset values, you assume "same value, same thing". Worked brilliantly for property names, not so much for independently assigned values. Those values, of course, have a scope of uniqueness of a department level; in other words, to make them unique within the business you need to know both the assetnum value and the department value... but how does code written to consume "general" data know this?

Try some other assumptions on for size, like the ones from the sample conversation. Think about similar scenarios, like a merger: you acquire another business; your executives expect you to be able to report over all assets, not give them two sets of asset reports from separate tools. Who knows what overlaps in the identifiers exist, and which overlaps are intentional (useful) or not (or worse: wrong/misleading)?

Here I come to save the day...

Let's say you're convinced. What would a reasonable use of namespaces for this problem look like? How do I choose a namespace value?

Namespaces are identified by their URI; URIs can be long and ugly (sorry), but they are familiar to any browser user. Friendly namespaces like those used in Linked Data have HTTP URIs that will serve descriptive documents, although strictly speaking the documents are optional. They're also hierarchical, and most use the delegation of authority principle from Web architecture. Here's an example: http://open-services.net/ns/crtv# ; through the magic of the Domain Name System, someone pays ICANN (I'm simplifying a bit here) to own http://open-services.net. Whoever that entity (person or organization) is, they then control the allocation of all URIs starting with that hostname.

Assumptions:

Our hypothetical business owns http://www.example.org, for simplicity.
Internally, the business decides how identifiers get assigned to things in need of unambiguous identification. Not to other things.
Corporate Standards decides the business's namespace allocation policy, which says
- http://www.example.org/namespaces/ is a common prefix for all their namespaces
- http://www.example.org/namespaces/standards/ is a common prefix for all corporate standard terms/namespaces, whose assignments are controlled centrally by Corporate Standards (it delegates to itself, if you like)
- http://www.example.org/namespaces/departments/department-name/ is a common prefix for department-level terms/namespaces, whose assignments are controlled by the owning department. Thus, department A is assigned http://www.example.org/namespaces/departments/A/ and so on.
Department A decides that asset identifiers it assigns follow the pattern http://www.example.org/namespaces/departments/A/assetnum
Department B decides that asset identifiers it assigns follow the pattern http://www.example.org/namespaces/departments/B/asset#assetnum (if you need a reason, assume Not Invented Here syndrome).
Corporate Standards decides the the internal standard name for asset number properties is
- assetnum, in contexts like relational databases that do not natively support namespace qualification of identifiers
- http://www.example.org/namespaces/standards/properties/assetNumber in contexts like Linked Data that natively support namespace qualification

Example Resulting URIs that show up in the Linked Data:

http://www.example.org/namespaces/standards/properties/assetNumber as a property name
http://www.example.org/namespaces/departments/A/100 as department A's identifier for what it calls asset number 100
http://www.example.org/namespaces/departments/B/asset#100 as department B's identifier for what it calls asset number 100

Note what has changed, and what has not, compared to the original simple assumption (plus any other variants you tried as homework)

The useful property of a corporate standard property name has not only been preserved, it has been rendered explicit. If you merge with another organization (call it SmallFry.org), even if their namespace allocation policies were identical, their URIs would still be unique (because theirs would start with a different hostname).
The misleading property of "apparently equal" identifier values (100) has been completely fixed.
Humans (and code) still have no way to know whether or not .../A/100 is the same asset as .../B/asset#100 without adding extra knowledge.

[{"Business Unit":{"code":"BU050","label":"BU NOT IDENTIFIED"},"Product":{"code":"SSHPN2","label":"Tivoli"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"","label":""}}]

UID

ibm11275868

Tips