Archive for December, 2020

Pickling, uneval, unapply

I see that I am referenced on this wiki page without a link. The purpose of this post is to explain what I remember of the reason for this reference. Chris Webber prompted me to write this up during some recent conversations about object capabilities. I came up with the ‘unapply‘ idea while I was doing some interpreter work with Chris Fry around 2002, and talked to Mark Miller about it a bit later.

Most programming language implementations use very different representations of data objects when they occur as text in a file as compared to when they occur as patterns of bits and pointers in a computer’s memory. The names (or identifiers) in the program serve a function similar to the pointers in memory, but the pointers are followed to their target much more quickly by the computer hardware.

The conversion from in-memory form to text is usually called “serialization” these days and the inverse “deserialization”. Some serialization/deserialization services work with human-readable text (e.g. as in Lisp), while others go to a harder-to-read “binary” form. The latter kind was called “pickling” in PARC’s Cedar, if I remember correctly [but see addendum below], and later was called “serialization” in Java, a term that confused me since I always thought serialization had to do with concurrency.

Many languages have features superficially similar to unparse/parse or serialize/deserialize but not invertible. The “print methods” of many Lisp systems, Python, and so on, and Java’s toString method, give neither assistance nor enforcement for invertibility since the code for the print method can emit whatever characters it wants. In Python the question never even comes up because there is no practical parsing utility in the first place.

In the case of Lisp 1.5, PRINT is an inverse of READ (at least with respect to EQUAL; structure sharing is not detected), and subsequent Lisp dialects have preserved PRINT/READ invertibility for certain data types such as symbols and lists.

The conversion from file to in-memory form is accomplished by parsers similar to Lisp READ. Of course many data objects are born in memory and are not the result of parsing, and these are ones we usually care about because we often want to write them out (pickle them) and then read back a copy (unpickle).

OK. There’s a lot to be said about pickling/unpickling and I don’t have much to contribute. I just want to mention one detail.

The central design component is the interface between the generic pickling structure traverser, and the particular data types that might need special handling. There are many reasons one might want custom pickling and unpickling, including special considerations around the “boundaries” of objects (which links to follow), redundant structure that needn’t be kept such as indexes, security considerations, and so on, so there is a lot of motivation to be able to provide object- or type-specific methods for pickling and unpickling. Because of this, some standard interface is needed.

The approach I took was to have a pickling operation that generates an expression in a data construction language (actually a subset of the programming language). Unpickling is then just evaluating the expression that’s the result of pickling. This is not arbitrary evaluation like the ‘#.‘ feature of Common Lisp; the security of unpickling is assured using object capability confinement techniques (see E, W7, etc.).

Now it might seem tempting in this framework to say that the pickler/type interface is an “uneval” method that given an object returns an expression. That is, the type-specific customization method does the pickling directly. But most objects have other objects as parts or links. A custom uneval method would need to have the ability to uneval its parts, and in the interval after receiving the unevaled expression for the part and placing that expression into the larger expression for the whole, the uneval method would have the opportunity to inspect the pickled version of the part, which feels wrong in many different ways (it is not compositional, not secure, not modular, etc.). In addition, if the custom method composes expressions, it has to know something of the syntax of the pickling language, and has many opportunities to make mistakes.

An alternative presents itself that isolates the custom method from any knowledge of syntax or of the representations of its parts. This is for the interface to the custom method be unapply rather than uneval. Unapply is an inverse of apply: it takes an object and returns a procedure and a set of arguments. The procedure and arguments have the property that the procedure applied to the arguments is the object (or a copy or clone of the object).

The pickling harness can use unapply to reduce the problem it’s presented with to that of pickling a set of simpler objects. It can use whatever syntax it likes (coordinated with the unpickler) to express the application of the procedure to the arguments.

An added benefit is that the pickling harness can detect shared structure: it can look at the sub-objects delivered by unapply and determine which ones are shared among multiple parts of the overall object-part-graph being pickled, and generate expressions using constructs similar to let and letrec that re-create the shared structure on unpickling.

The base cases for pickling would be scalars (numbers and so on) together with objects that are required or known to have incarnations in the system doing the unpickling, such as the constructors for the various types. For proper capability discipline, any call to the unpickler has to receive an environment to be used to map the names or tokens designating scalars, such as constructors, to the scalars themselves, which if they’re functions might get invoked to help build up the in-memory representation.

I don’t have a lot more to say about this – I think this is the story I told Mark Miller way back when, that led him to acknowledge me on the “safe serialization” wiki page.

Addendum 2020-12-11: I have no evidence that Cedar used the word “pickle” or had such a facility, so don’t quote me on that. I just checked one of the Cedar tech reports on line and didn’t find it. Modula-2+ had pickles by 1986, and Modula-2+ was created by the same people who created Cedar, so attributing them to Cedar is at least plausible. Certainly the idea was kicking around at Yale (Nat Mishkin, Kai Li) and Xerox (Bruce Nelson’s RPC work) in the early 1980s and probably earlier elsewhere (maybe MacLisp FASDUMP / FASLOAD ??) – it’s not as if it would be difficult to reinvent. (credit due to Paul McJones, John Ellis, and Nat Mishkin for some of this information, thanks)

Categories: Uncategorized