Helical ORM Reflections

Well I've finished my diesel ORM implementation and naturally it took longer than expected. I'll highlight the challenges and some of the changes I made to the database, and some changes I'd like to make, within this post.

First I want to say that it's been hard to stay disciplined about just focusing on the ORM. I'm glad I've been vigilant about using timers and recording what I've actually done because I've definitely wasted time on unnecessary refactors. By writing down what I've done every hour or so, my wasted time can now be measured in hours instead of days.

This blog post will basically be a laundry list of challenges and observations.

Challenge: Limited ability to construct self joins

The first major challenge in my port was one I'd already dealt with in the Python/Pony implementation: constructing an arbitrary number of self joins.

Example problem. I have a parital database schema that looks something like this:

erDiagram
  populations {
    string id PK
  }
  populationGroups {
    int id PK
  }
  hvocab {
    string id PK
  }
  hscopes {
    int    id PK
    string htype FK
    string pop FK
    int    populationGroup FK
  }
  hscopes ||--|{ hvocab : type
  hscopes ||--o| populationGroups: in
  hscopes ||--o{ populations : over

The basic idea here is that the populations and hvocab tables contains the vocabularies of all the named populations I've encounted so far and the types of hypothesis relations HyPL supports (respectively). The former is allowed to grow with each program it encounters; the latter is frozen unless the developer (me) changes it. The populationGroups table is just a counter for population group identifiers. Suppose we have a HyPL code fragment corresponding to the hypothesis/relation, "A causes B with respect to the population of programs run over various machine states:"

(programid, mstate) A -> B

The Helical engine should then do the following

Create new entries for programid and mstate in the populations table, if they do not already exist:
```
INSERT OR IGNORE INTO populations(id) VALUES (programid), (mstate)
```

Check to see if there already exists a population group for this pair of populations:

SELECT DISTINCT t1.populationGroup
FROM hscopes AS t1, hscopes AS t2
WHERE t1.pop = "programid" AND t2.pop = "mstate"
AND   t1.htype = "average" AND t2.htype = "average"
AND   t1.populationGroup = t2.populationGroup

If a population group does not exist, create a new one:
```
INSERT INTO populationGroups VALUES (NULL)
```

Let $gid correspond to the population group id for the pair and add our data to hscopes:

INSERT OR IGNORE INTO hscopes(htype, pop, populationGroup)
VALUES ("average", "programid", $gid),
       ("average", "mstate", $gid)

While steps 1, 3, and 4 are all possible with fairly straightforward Diesel queries, we run into issues with step 2. While it is possible to express the exact instantiation of step 2 in Diesel, we need to know the exact number of tables to alias statically. If we know we always have exactly two populations in every program, we can use the alias! macro to execute our self join:

alias!(crate::schema::hscopes as t1, crate::schema::hscopes as t1)

However, the HyPL syntax permits us to have an arbitrary number of populations, which means we need to support an arbitrary number of self-joins. This is simply not possible in Diesel (nor in any other ORM I've come across).

There are three approaches I considered to resolve this issue:

Perform one database call and load the entire hscopes table into memory.
Perform multiple database calls.
Leveraging the fact that I rarely expect to see more than two popualations in a single hypothesis, branch on the number of populations and use the alias! macro to perform one database call. This means we will throw an error the first time we encounter more than two populations.

I excluded options 1 and 3 for the following reasons:

I expect the hscopes table to eventually grow to be quite large, so the reduced calls to the database seem short-sighted.
If I were to load all of the hscopes table into memory, I'd prefer to do so only once for the entirety of the execution of the program, rather than once for each call to store the hscope data. This would require a refactor to my approach, which would include a mutable global data structure. This seemed like unnecessary overhead for something I'm not even sure would be an improvement.
Branching and using alias! would lead to repeated code and more opportunities for errors to creep in, especially as I extend the cases to cover larger tuples of populations.

My option 2 approach implemented the following algorithm:

Select all of the group ids in hscopes associated with the first population.
Then, using this set of group ids, select all entries in hscopes that had a populationGroup in the set of values extracted in step 1.
Chunk the returned data by group id.
For each chunk, create a hashset of populations and test to see if that set is equal to the set of populations we are looking for. If we find an equal set, return the associated group id.

The above approach issues exactly two calls to the database. In the worst case, we would hold the entirety of the hscopes table in memory, but that would only occur if the first population appeared in every population group. In that case, the size of the hscopes table is bounded by $|\texttt{populations}|^{m}$, where $m$ is the max tuple size and $|\texttt{populations}|$ is the size of the populations table.¹ That said, I expect this to be a degenerate scenario and for the memory requirements to be fairly small.

There are several other similar relations in the schema, so coming up with a solution to this problem was important!

Observation: Additional structs turned out to be unnecessary

One of the things I did not care for in my initial exploration of Diesel was how they advocated buildng out several additional structs. With the database in Categorical Normal Form, it turned out that most tables ended up having very few attributes. Most of what I needed to return was a chain of primary keys. Therefore, I ended up dropping most of the auxiliary structs I'd made at the outset. In the end, I only ended up with a handful of structs for tables that were involved in the self-join problem above.

Observation: Testing was easy

I won't go into this much, but I remember it taking some effort to set up the in-memory database for using Pony with Pytest. I had no issues with Rust: I just created a new struct for the test database that each test function handled, where all of the clean up was handled seamlessless on drop. I also feel like I always encounter some issue with test discoverability or running only a subset of tests in Pytest; I've never had any issues with Cargo.

Minor Challenge or Major Lift?: Typing constraints

For the most, Rust's type system and Diesel's mapping made coding far easier than in Python (what I was looking for!). Most of the errors I encountered were when I forgot the underlying column was nullable. In fact, I found that the type system allowed me to debug the underlying schema far more easily than in Python/Pony.

My biggest annoyance was a lack of proper union types. However, I found that in every single case where I used a union type in Python, it was due to a change in design decision that I told myself I'd propagate later. That is: the ability to annotate types as Unions with mypy actually created more subtle bugs that I only caught on the port.

Concluding remarks

Diesel ended up being quite easy to use, even if it was a little verbose. Since I had some common query patterns, I wrote some macros to help clean up the core logic. All in all, it was actually a fun refactoring experience!

My reasoning: assume that $pop_1$ is in every population group. Then there is one entry for the singleton group ($1\leq \texttt{populations}^{0}$), plus at most $|\texttt{populations}|-1$ entries for 2-tuples ($|\texttt{populations}|-1 \leq |\texttt{populations}|^{1}$), etc. ↩