DSL Usability Research

...learning a new formal language can itself contribute to the difficulty of encoding an experiment.

This statement was based on assumptions, intuitions, and folk wisdom. I started digging into the DSL usability research to see if I could find explicit support for this statement. This blog post is about what I found.

Suppose I have a DSL for a task that was previously manual. I want to conduct a user study. I decide to use some previously validated instrument to measure differences in perceived difficulty of encoding/performing a task ( $D$ ), and vary the method used to code the task ( $M=\text{DSL}$ vs. $M=\text{manual}$ ). Suppose there is no variability in task difficulty for now: the specific task is fixed for the duration of the study, i.e., is controlled.

Ideally, I'd like to just measure the effect of $M$ on $D$ ; we are going to abuse plate notation¹ a bit and say that the following graph denotes "method has an effect on percieved difficulty of performing a specific task for the population of experts in the domain of that task:"

The first obvious problem is that $D$ is a mixture of some "inherent" difference due to $M$ and the novelty of the method/context/context/environment/situation ( $N$ ). We have not included $N$ in our model; let's do so now:

Conducting a naïve study results in $(D \vert M=\text{manual}, N = 0)$ vs. $(D \vert M=\text{DSL}, N \gg 0)$ . This is why we have the study participants perform a training task first: it's an attempt to lower $N$ as much as possible, i.e., to control for novelty.

Training tasks are obviously not unique to DSL research; however, there are other tactics for reducing novelty that are unique to programming systems. For example, it seems obvious that IDE features like syntax highlighting and autocomplete that are part of a "normal" programming environment would reduce the value of $N$ ; so would integrating the DSL into the target users' existing toolchain/workflow.

If we allow the task to vary, then our model needs to include another potential cause for $D$ :

The details of how we represent $C$ matter: whatever scale we use, it contains a baked-in assumption that for any two tasks $t_1$ and $t_2$ where $t_1\not=t_2$ , but $C(t_1)=C(t_2)$ , we can treat $t_1\equiv t_2$ . This is a big assumption! What if there are qualitative differences between tasks not captured by the complexity metric that influence $D$ ? In that case, we may want to use a different variable to capture $C$ , perhaps a binary feature vector, or maybe we want to split $C$ into a collection of distinct variables. Maybe task complexity isn't objective but subjective, in which case we would want to include in our domain experts plate. Maybe we want to forego $C$ altogether and instead treat tasks as a population we need to sample over, e.g.,

I have plenty more to say and would love to iterate on the design of this hypothetical user study, but I am going to stop here because the above diagram feels like something that should already be established in the literature. Like a lot of folk wisdom, it's suggested, implied, assumed, and (I think!) generally accepted, but so far I have not found any explicit validation of the above schema. That doesn't mean it isn't out there; it means that (a) there isn't a single canonical paper accepted by the community as evidence and (b) where the evidence does exist, it's embedded in work that primarily addresses some other research question.

So, for now, I am putting together a DSL usability study reading list of works that I think touch on this fundamental problem in meaningful ways. I consider Profiling Programming Language Learning and PLIERS: A Process that Integrates User-Centered Methods into Programming Language Design seed papers and have gotten recommendations from Andrew McNutt, Shriram Krishnamurthi, and Lindsey Kuper. Please feel free to add to this (or use it yourself!). I look forward to writing a follow up post on what I find. :)

Update 14 April 2026: I came across an interesting quotation in Peter Naur's 1985 essay Programming as Theory Building that I thought was relevant to this post (emphasis added):

...a methodically satsifactory study of the efficacy of programming methods so far never seems to have been made. Such a study would have to employ the well established technique of controlled experiments (cf. [Brooks, 1980] or [Moher and Schneider, 1982]). The lack of such studies is explainable partly by the high cost that would undoubtedly be incurred in such investigations if the results were to be significiant, partly by the problems of establishing in an operational fashion the concepts underlying what is called methods in the field of program development. Most published reports on such methods merely describe and recommend certain techniques and procedures, without establishing their usefulness or efficacy in any statematic way. An elaborate study of five different methods by C. Floyd and several co-workers [Floyd, 1984] concludes that the notion of methods as systems of rules that in an arbitrary context and mechanically will lead to good solutions in an illusion. This conclusion is entirely compatible with the Theory Building View of programming. Indeed, on this view the quality of the theory built by the programmer will depend to a large extent on the programmer's familiarlity with model solutions of typical programs, with techniques of description and verification, and with principles of structuring systems consisting of many parts in complicated interactions. Thus many of the items of concern of methods are relevant to theory building.

By "programming methods," Naur means "a set of work rules for programmers, telling what kind of things the programmers should do, in what order, which notations or languages to use, and what kinds of documents to produce at various stages." I don't think it would be terribly contraversial to argue that PL design research — both general and domain-specific — addresses all of these tasks. For years now I've referred to PL design as "a method that should be taught alongside empirical and formal methods to computational computational scientists as part of their methods toolkit," which has been a bit more provacative than I would have thought!

Anyway, I'm highlighting this essay becuase it strongly asserts a negative result: the lack of rigorous study of the sorts of features that prompted me to make my original Mastodon/blog post!

While the plate notation here looks similar to the output that Helical produces for HyPL code, the specific graphs are more precise than those that Helical can currently produce. For example, only $D$ is embedded in the domain experts plate. Helical's current implementation would place both $M$ and $D$ in this plate. ↩