Over the years I have been designing and building software I have noticed one recurring set of problems that keep cropping up, regardless of company, product domain or programming language. Software developers’ often have a naïve understanding of identity (myself included!), and this leads to all sorts of bugs, hacks and design compromises. You’d think something as fundamental as how to identify a Thing would have been settled by now! To make matters even worse, changing how you identify a Thing after you’ve already amassed a lot of data (Thing Instances) is typically very complicated and expensive.
The philosophy of identity has a long and rich history, so software developers are in good company when it comes to struggling with these issues. What I find particularly interesting is that many of the classic identity thought experiments are very concrete issues for software developers. For example, you may lie awake at night and philosophize as to whether you are identical to your clone in a parallel universe. However programmers frequently write code to clone and move objects between two systems separated by space and time. For example, every time you synchronize your iPod, a program, and by extension a programmer, applies some quite sophisticated identity management rules.
One of the first things we notice when we start to identify a real-life object is that the attributes we use to identify the object typically depend on the identity of the person asking, or the overall client context. For example, if someone asks me, “Who are you?” I might answer “Daniel”, “Daniel Selman”, “Rule Studio for Java Team Lead”, identify myself based on my relationship to other people, as often used in the Bible, or using biometric data such as DNA or fingerprints. This is a performance optimization as it is not feasible to list all our identifying characteristics to all clients! It is the client that imposes their identity requirements on us: if we given the client too little information they simply ask for more, if we supply too much, they probably ignore what they don’t need or cannot interpret.
Types of Identity
Gottfried Leibnitz famously stated that “x is the same as y if and only if every predicate true of x is true of y as well.” If you tug gently at this little philosophical thread you quickly become entangled in the fascinating and complex questions related to identity – many of which are still actively debated today.
There are two definitions of identity: numerical identity and qualitative identity.
Objects a and b are numerically identical if a and b are one and the same thing. It is the relation an object has with itself and nothing else – a circular definition as “nothing else” means, “no numerically non-identical thing”. For example, I will be numerically equal for as long as I exist.
Objects a and b can be said to be qualitatively identical if a and b are duplicates, that is if a and b are exactly similar in all respects. This implies that things can be more or less qualitatively identical. Twins may be qualitatively equal even though they are numerically different.
I-predicates are used to express qualitative identity relationships, taking into account the richness of a given theory or application context. For example, “having the same income as” will be an I-predicate in a theory in which persons with the same income are indistinguishable, but not in a richer theory. For example, within the “Selman Family Theory” I can safely use “has the same first name as” as an I-predicate to identify people. This I-predicate would be a foolish choice for the “ILOG Employees Theory” however!
Some philosophers contend that there is no absolute identity, and that identity is always relative, this is controversial and contested however.
Criteria of Identity
Similar to I-predicates is the concept of criteria of identity. For example, the criterion of identity for directions is parallelism of lines. Criterion of identity for numbers is equinumerosity of concepts, that is, the number of F’s is identical with the number of G’s if and only if there are exactly as many F’s as G’s.
Identity over Time
Identity over time is particularly controversial, because time involves change. For example, Heraclitus famously argued that one could not bathe in the same river twice – as the water continuously flowing through the river changes its identity.
Take a simple statement such as “Tabby was fat on Monday.” Endurance theorists state that persisting things endure and change through time, but do not extend through time, but only through space. I.e. Things are different from events or processes. Perdurance theorists refute this and do not distinguish between Things and processes.
If Tabby is fat on Monday, that is a relation between Tabby and Monday. Perdurance theorists would state that Tabby-on-Monday is intrinsically fat.
It is very useful to also consider the questions applicable to personal identity when designing software systems. These questions are:
- Who am I? What are the attributes that make me, me?
- Personhood: What is required to be a person? What is the definition of person?
- Persistence: What events can you survive? What brings your existence to an end?
- Evidence: How do we find out who is who?
- Population: What determines how many of us there are now?
- What am I? What am I composed of?
- How could I have been? Which of my properties do I have essentially, and which only accidentally? Could I have had different parents for example?
For example, take a business rule, copy it, rename it, update some of its properties and delete its history. Is it the same business rule as the original? If I now deploy the business rule from a development server to a cloned staging server how many business rules do I have? How about if I download the business rules from both the development and the staging servers into separate projects within an Eclipse workspace on my local computer? The point is that there can be fairly complex answers to some of these questions, particularly when you have multiple software systems interacting over space and time.
The metaphysical questions below are also very useful to consider as you design software systems:
- What does it mean for an object to be the same as itself?
- If x are y are identical (are the same thing), must they always be identical? Are they necessarily identical?
- What does it mean for an object to be the same, if it changes over time? I.e. is x at time t the same as x at time t+1?
- If an object’s parts are entirely replaced over time, in what way is it the same?
Qualitative Identity in Java is expressed using the equals method as well as the compare method. Equals allows you to test for qualitative identity (which can include numerical identity) whereas the compare method is used to order a list of objects using a comparison predicate.
Determine your I-predicates and in Java perhaps code them as Comparators. Does your domain model require several I-predicates? If yes, you will need something other than a single equals method. In one software system I designed we had a dedicated object comparison service that could compare different types of objects using different criteria, based on the client as well as the objects. For example, you might compare an Integer with a Float (with or without rounding), two Doubles (with precision), or two EJBObject instances. Note that most equals methods also test that the classes for the two instances are identical. The JVM determines that two classes are identical if they have the same fully qualified class name and were loaded using the same ClassLoader.
Think in terms of namespaces. Many identity schemes rely on namespaces, however namespaces must be rooted and managed to prevent copying or cloning corrupting the namespace. Internet domain names are a popular basis for namespaces precisely because they are globally managed and controlled. E.g. Java, XML Schema and the Semantic Web all use variations of Internet domain name namespace identifiers.
Decide whether you need numerical identity. How will you determine numerical identity? Object references within a JVM? Generated statistically unique identifiers such as UUIDs? Automatically generated database row IDs? What about object serialization? Object cloning? Database replication?
For the reasons above numerical identity is very difficult to apply in computer systems. Numerical identity is often used as an optimization however where the scope of the optimization is well understood, such as within a single JVM/ClassLoader or within a single database table. It is usually hidden from end users because it is machine generated and has no inherent business sense. End users typically find opaque machine-generated identifiers difficult to work with, as they cannot understand why two artifacts that appear to be superficially qualitatively identical are numerically different. Even a UUID which should be universally unique is problematic because it is trivial to create exact clones of objects in computer systems, rendering the uniqueness property useless.
A common problem scenario is deciding that you are going to put a UUID in a document and identify documents by UUID. The end user then copies the document on the file system and ends up with two artifacts that have the same UUID but different file names. When the documents are loaded into your system one of four things can happen:
- The documents are both stored but they are retrieved non-deterministically. Your user interface makes it impossible for the user to understand which document they are editing (very bad!)
- The documents are both stored but the first or last document loaded is always returned. Your user interface makes it impossible for the user to understand which document they are editing (bad!)
- An error is detected as a duplicate UUID was loaded and the end user must intervene to fix the document they did not realize was broken -- because your identifier is opaque (bad!)
- The second document silently overwrites the first, typically because they are being stored in a Map using the UUID as a key (very bad!)
The scenario above happened because the application developer’s identifier conflicts with the underlying storage mechanism’s identifier -- typically a fully-qualified (case sensitive?) name for a file system. The file system will happily create exact clones of the developer’s supposedly numerically unique resources. Identity criteria mismatch scenarios are a very common source of identity related bugs – particularly across space and time, as in distributed systems.
I hope this entry has helped you understand how those pesky identity bugs keep cropping up in your products and code. I Am Not a Philosopher (IANAP!) so if I have piqued your interested I encourage you to look at some of the references below for far more detail.
What identity related bug did you workaround or fix this week?