Helical in Rust : ORM time

It's been almost a year since I started on my Helical rewrite in Rust (with a sizable dose of help and motivation from John!). I wrote the prototype in Python and while Gwen has been a good sport about alerting me to usability issues, the truth is that updating the core code is getting unruly. I've also been re-doing some of the program analysis stuff so that it's cleaner, less ad hoc, and generally more amenable to an interesting paper about the program semantics. However, I've let myself get distracted over the past year on other things, so I'm going to try to be more disciplined about getting the Rust rewrite done. In services of that I've decide to blog my way through the process, since like a lot of people I haaaaaaaate context switching and having to pick up where I left off.

These posts are going to be a little stream-of-consciousness-y...

ORM is missing

Ah yes, as I look at the code for HyPL, I'm reminded of the fact that we didn't implement the ORM part of the port. There was something John didn't like about Diesel, but having just pinged him about it, he says he doesn't remember.

Okay, so let's back up — you may be wondering: why does HyPL have an ORM in the first place (the Python implementation uses Pony)? Here is a disorganized list of reasons:

Some day (i.e., after we write an associated ExPL program and compile and execute it), observations of the HyPL variables should be organized and stored somewhere, so a database makes sense.
Causality/Statistics folks have long had some connections with the databases community (IIRC there's a Rubin paper I read where I was like, this guy is secretly a databases person).
There's a fairly direct mapping between grammar productions in HyPL and what we might want to put into a database, so why not represent the AST in an object-oriented way?
The database design schema that naturally falls out of this approach looks a lot like Spivak's categorical normal form. That's got to be interesting, right?

The biggest issue I had in the Python implementation was the amount of generated code and how when something went wrong, debugging was a nightmare. At least that's what I vaguely recall — I wrote that part of the code in like 2023 or 2024, so it's been a while since I was dealing with these issues.

Design pains

Okay so reading through the Diesel documentation, I think I know why John was skeptical. Diesel requires you to commit to a schema. That is, when writing in Python with the Pony ORM, I could just have a class that inherits from the database entity and BAM I have a table. It looks like in Diesel I either need to define a migration (SQL data definitions) or use a Rust macro. That is, in Pony we only model the data once (i.e., as a Python object that inherits from the Pony Entity object). In Diesel, we need to define the struct and table we want to link separately, and then link them. Not a dealbreaker, but the repeated work is a bummer.

Refactoring woes

Now I'm set up to start annotating structs! Unfortunately, not all of the entities I want to map to tables are in a form where that's possible. For example, John ported the AST code from Python to Rust and leveraged the newtype pattern via macro. Unfortunately, we cannot annotate the macro. Maybe there's a way to do some macro magic and like...specify that this str_newtype! macro should be expanded first, so that the derive macro knows what to bind to? This is now way beyond my Rust skills, so for now I'm going to refactor to use the more verbose/non-macro version...and that works! However, it's ugly and cumbersome:

str_newtype!(Variable)

becomes

#[derive(Queryable, Selectable)]
#[diesel(table_name = crate::schema::variables)]
#[diesel(check_for_backend(diesel::sqlite::Sqlite))]
struct Variable(
    #[diesel(column_name = "name")]
    String
);

Let's now try porting some of these macros into John's macro...

It works!

To be fair, it took me two tries: in the first try, I modified the expansion to match three arguments so we could now call:

str_newtype!(Variable, crate::schema::variables, "name")

...but it turns out that you can't pass in fully qualified names as identifiers. Since all of the tables live in the same crate, I modified the macro so I could pass in:

str_newtype!(Variable, variables, "name")

and that got this segument of code to pass cargo check (and bunch of others to fail!). I ended up adding back the single-argument version to the macro, since we use str_newtype! in ExPL, which should not be adding to the database. That allowed the existing uses of str_newtype! to just continue as before.

Lesson learned: can't pass qualified names as macro arguments (at least not in the obvious way) — i.e., macro_rules! doesn't know how to parse ::.

How do we actually add stuff to the database?

So one of the nice things about pony was that it manage database queries for me — I could mostly just code in regular Python and everything was bundled up and executed without me having a care in the world. I even wrote the parsing/instantiation code such that if the entity already existed in the database, we'd return that; otherwise, we would create a new entry and return that reference. This kept the database a nice, managable size.

Now that I'm coding against Rust's type system, I have to be a bit more intentional about when and how I interact with the database. At present I have three options for how to manage this:

Create a function that will open a global database connection that all objects refer to (how I did it in Python).
Thread the database connection through all instantiations, creating and dropping as needed.
Bundling all database reads and writes into temporally colocated transactions (i.e., iterate over the AST in a function).

Option (2) would remain faithful to my approach in Python, where I wanted to ensure the uniqueness of entities and fail fast. After working with Gwen, I'm not so sure that this should be a design goal anymore. You may wonder: why was this a design goal in the first place? Well, my dream was that there would be some global or at least greater-than-this-experiment-scoped database that treats your HyPL programs almost like an ontology. I envisioned it being a way to enhance discoverablily — e.g., if I'm a new student in a lab, extending someone else's work and I write out my experiment in Helical, I can search for past work on the basis of these identifiers or relations (hypotheses).

In retrospect, my approach was a premature optimization. Practically speaking, Gwen and I (or Gwen or I? I can't remember who first found the problem) ran into the problem where if we tried to run Helical against a database, we'd get errors every time we'd change the variable type. This doesn't really lend itself to iteration.

As a result of these past experiences, I don't think (2) makes sense anymore. Plus, if I ever get my Helical web server up, I'll probably want to be more judicious about database connections!

This leaves me with (3). Time to write a function that traverses the AST!

Competing desires

So the first snag I hit arises from two incompatible desires:

I want e.g. a Program struct to be effectively just a wrapper for a collection of Statements, with an optional name field.
I want the Program table to be just the program id and associated name; we will find associated statements by joining on the program table's primary key (an auto-generated id).

As I understand it, it's kind of ORM's "thing" to require the object fields to be contravariant with the asssocaited table/entity's columns/attributes (i.e., they may have at most the full set of attributes for the table; I assume that adhering to the constraints of the database is something that the ORM manages as well, thus preventing us from doing things like manually updating an autoincrement column or failing to include a non-default, non-null attribute).

Changing the full code base to adhere to the database schema is a tedious refactor.

My first attempt at something less tediuos was to treat the original Program struct as a wrapper for the database object. That is, where we had:

pub struct Program {
    pub statements: Vec<Statement>,
}

...we now have:

#[derive(Eq, Debug, PartialEq, Insertable)]
#[diesel(table_name = crate::schema::programs)]
#[diesel(check_for_backend(diesel::sqlite::Sqlite))]

struct ProgramTable {
    #[diesel(column_name = "name")]
    pub name: String,
}

pub struct Program {
    pub program_table: ProgramTable,
    pub statements: Vec<Statement>,
}

Implementing the Default trait on Program and updating places where we create Program objects should be easy peasy. While these updates passed `cargo check, when I tried to move on to the next step, I ran into new problems.

AST walk & DB updates

Since many of my structs already have methods implemented on them, I decided to define a function write_to_database in each relevant impl block and do a OOP-y caller thing (visitor pattern or whatever?), rather than a top-level function with a bunch of match statements.

My plan was to:

Have Program::write_to_database(p) create a new record in the database table and return that record.
Extract the identifier from that record.
Loop through the statements, which will require the program's primary key as their foreign key.

However, I quickly ran into problems! Here is the code I wrote:

    pub fn write_to_database(&self, url : &String) {
        let conn = &mut SqliteConnection::establish(url);
        diesel::insert_into(crate::schema::programs::table)
            .values(self.program_table)
            .returning(ProgramTable::as_returning())
            .get_result(conn)
            .expect("bad times!!!");
    }

The easy to fix part was that I needed to derive Selectable on ProgramTable in order to use as_returning. The more annoying issue was the next error:

the trait load_dsl::private::CompatibleType<_, _> is not implemented for diesel::expression::select_by::SelectBy<ProgramTable, _>

...

= note: this is a mismatch between what your query returns and what your type expects the query to return

= note: the fields in your struct need to match the fields returned by your query in count, order and type

Reading these error messages and looking back at the diesel documentation, I can see that the authors defined two separate structs: one that maps exactly to the database table and one that represents insertable data.

The first thing I can quickly fix is having ProgramTable include the missing field from the database table, the id field. Since I'm only creating the in-memory object in order to grab the auto-incremented id, I'm going to revert to the original Program definition and create a NewProgram struct in the style of the diesel documentation:

#[derive(Selectable, Queryable)]
#[diesel(table_name = crate::schema::programs)]
#[diesel(check_for_backend(diesel::sqlite::Sqlite))]
#[allow(unused)]
struct ProgramTable {
    #[diesel(column_name = "id")]
    pub id : i32,
    #[diesel(column_name = "name")]
    pub name: String,
}

str_newtype!(NewProgram, programs, "name");
impl FromStr for NewProgram {
    type Err = anyhow::Error;
    fn from_str(value: &str) -> Result<Self, Self::Err> {
        Ok(NewProgram(value.to_string()))
    }
}


#[derive(Eq, Debug, PartialEq)]
pub struct Program {
    pub statements: Vec<Statement>,
    pub name : String,
}

impl Default for Program {
    fn default() -> Self {
        Program {
            name: format!("hypl_{}", SystemTime::now().duration_since(UNIX_EPOCH).expect("should work").as_millis()),
            statements : Vec::new()         
        }
    }
}

impl Program {
    pub fn write_to_database(&self, url : &String) {
        let conn = &mut SqliteConnection::establish(url).expect("Database connection error");
        diesel::insert_into(crate::schema::programs::table)
            .values(NewProgram(self.name.clone()))
            .returning(ProgramTable::as_returning())
            .get_result(conn)
            .expect("");
                //format!("Error saving program {}", &self.program_table.name));
    }
}

After running this code with a test program, I checked out the database through the sqlite3 CLI:

sqlite> select * from programs
   ...> ;
1|hypl_1769794974148

Awesome!

Concluding thoughts

Obviously I still need to implement the rest of the database; the above is just the tracer bullet version. Right now it feels heavier, since I need to define two additional structs to handle database actions. I'll probably move the database-specific structs into the schema file.

All that said, in retrospect, this process wasn't too bad!