Helical ORM Reflections
Well I've finished my diesel ORM implementation and naturally it took longer than expected. I'll highlight the challenges and some of the changes I made to the database, and some changes I'd like to make, within this post.
First I want to say that it's been hard to stay disciplined about just focusing on the ORM. I'm glad I've been vigilant about using timers and recording what I've actually done because I've definitely wasted time on unnecessary refactors. By writing down what I've done every hour or so, my wasted time can now be measured in hours instead of days.
This blog post will basically be a laundry list of challenges and observations.
Challenge: Limited ability to construct self joins
The first major challenge in my port was one I'd already dealt with in the Python/Pony implementation: constructing an arbitrary number of self joins.
Example problem. I have a parital database schema that looks something like this:
erDiagram
populations {
string id PK
}
populationGroups {
int id PK
}
hvocab {
string id PK
}
hscopes {
int id PK
string htype FK
string pop FK
int populationGroup FK
}
hscopes ||--|{ hvocab : type
hscopes ||--o| populationGroups: in
hscopes ||--o{ populations : over
The basic idea here is that the populations and hvocab tables contains the vocabularies of all the named populations I've encounted so far and the types of hypothesis relations HyPL supports (respectively). The former is allowed to grow with each program it encounters; the latter is frozen unless the developer (me) changes it. The populationGroups table is just a counter for population group identifiers. Suppose we have a HyPL code fragment corresponding to the hypothesis/relation, "A causes B with respect to the population of programs run over various machine states:"
(programid, mstate) A -> B
The Helical engine should then do the following
- Create new entries for
programidandmstatein thepopulationstable, if they do not already exist:INSERT OR IGNORE INTO populations(id) VALUES (programid), (mstate) - Check to see if there already exists a population group for this pair of populations:
SELECT DISTINCT t1.populationGroup FROM hscopes AS t1, hscopes AS t2 WHERE t1.pop = "programid" AND t2.pop = "mstate" AND t1.htype = "average" AND t2.htype = "average" AND t1.populationGroup = t2.populationGroup - If a population group does not exist, create a new one:
INSERT INTO populationGroups VALUES (NULL) - Let
$gidcorrespond to the population group id for the pair and add our data tohscopes:INSERT OR IGNORE INTO hscopes(htype, pop, populationGroup) VALUES ("average", "programid", $gid), ("average", "mstate", $gid)
While steps 1, 3, and 4 are all possible with fairly straightforward Diesel queries, we run into issues with step 2. While it is possible to express the exact instantiation of step 2 in Diesel, we need to know the exact number of tables to alias statically. If we know we always have exactly two populations in every program, we can use the alias! macro to execute our self join:
alias!(crate::schema::hscopes as t1, crate::schema::hscopes as t1)
However, the HyPL syntax permits us to have an arbitrary number of populations, which means we need to support an arbitrary number of self-joins. This is simply not possible in Diesel (nor in any other ORM I've come across).
There are three approaches I considered to resolve this issue:
- Perform one database call and load the entire
hscopestable into memory. - Perform multiple database calls.
- Leveraging the fact that I rarely expect to see more than two popualations in a single hypothesis, branch on the number of populations and use the
alias!macro to perform one database call. This means we will throw an error the first time we encounter more than two populations.
I excluded options 1 and 3 for the following reasons:
- I expect the
hscopestable to eventually grow to be quite large, so the reduced calls to the database seem short-sighted. - If I were to load all of the
hscopestable into memory, I'd prefer to do so only once for the entirety of the execution of the program, rather than once for each call to store thehscopedata. This would require a refactor to my approach, which would include a mutable global data structure. This seemed like unnecessary overhead for something I'm not even sure would be an improvement. - Branching and using
alias!would lead to repeated code and more opportunities for errors to creep in, especially as I extend the cases to cover larger tuples of populations.
My option 2 approach implemented the following algorithm:
- Select all of the group ids in
hscopesassociated with the first population. - Then, using this set of group ids, select all entries in
hscopesthat had apopulationGroupin the set of values extracted in step 1. - Chunk the returned data by group id.
- For each chunk, create a hashset of populations and test to see if that set is equal to the set of populations we are looking for. If we find an equal set, return the associated group id.
The above approach issues exactly two calls to the database. In the worst case, we would hold the entirety of the hscopes table in memory, but that would only occur if the first population appeared in every population group. In that case, the size of the hscopes table is bounded by \(|\texttt{populations}|^{m}\), where \(m\) is the max tuple size and \(|\texttt{populations}|\) is the size of the populations table.1 That said, I expect this to be a degenerate scenario and for the memory requirements to be fairly small.
There are several other similar relations in the schema, so coming up with a solution to this problem was important!
Observation: Additional structs turned out to be unnecessary
One of the things I did not care for in my initial exploration of Diesel was how they advocated buildng out several additional structs. With the database in Categorical Normal Form, it turned out that most tables ended up having very few attributes. Most of what I needed to return was a chain of primary keys. Therefore, I ended up dropping most of the auxiliary structs I'd made at the outset. In the end, I only ended up with a handful of structs for tables that were involved in the self-join problem above.
Observation: Testing was easy
I won't go into this much, but I remember it taking some effort to set up the in-memory database for using Pony with Pytest. I had no issues with Rust: I just created a new struct for the test database that each test function handled, where all of the clean up was handled seamlessless on drop. I also feel like I always encounter some issue with test discoverability or running only a subset of tests in Pytest; I've never had any issues with Cargo.
Minor Challenge or Major Lift?: Typing constraints
For the most, Rust's type system and Diesel's mapping made coding far easier than in Python (what I was looking for!). Most of the errors I encountered were when I forgot the underlying column was nullable. In fact, I found that the type system allowed me to debug the underlying schema far more easily than in Python/Pony.
My biggest annoyance was a lack of proper union types. However, I found that in every single case where I used a union type in Python, it was due to a change in design decision that I told myself I'd propagate later. That is: the ability to annotate types as Unions with mypy actually created more subtle bugs that I only caught on the port.
Concluding remarks
Diesel ended up being quite easy to use, even if it was a little verbose. Since I had some common query patterns, I wrote some macros to help clean up the core logic. All in all, it was actually a fun refactoring experience!
-
My reasoning: assume that \(pop_1\) is in every population group. Then there is one entry for the singleton group (\(1\leq \texttt{populations}^{0}\)), plus at most \(|\texttt{populations}|-1\) entries for 2-tuples (\(|\texttt{populations}|-1 \leq |\texttt{populations}|^{1}\)), etc. ↩