Simplify store: replace pair mappings with explicit groups #11

Closed
opened 2026-05-17 06:20:08 +00:00 by victor · 0 comments
Owner

Problem

The current store models duplicate relationships as a flat list of track-to-track pairs (a graph of edges). Groups are only ever reconstructed at query time by computing the transitive closure of those edges. This produces several complexities:

  • group_undecided does graph traversal in application code to reassemble groups from pairs.
  • The "reopen" invariant is maintained by a consistency SQL trigger that fires when a new pair connects to an already-decided group — subtle and hard to reason about.
  • Queries like decided_groups_with_rating_mismatch join across multiple tables and re-derive group membership.
  • The excluded flag on pairs (from split_track) is a workaround for pairs that are edges in the graph but should not be treated as same-group.

Proposed change

Replace the pairs/decisions tables with an explicit groups model:

CREATE TABLE groups (
    id          INTEGER PRIMARY KEY,
    canonical_id TEXT    -- NULL = undecided
);

CREATE TABLE group_members (
    group_id    INTEGER NOT NULL REFERENCES groups(id) ON DELETE CASCADE,
    track_id    TEXT    NOT NULL,
    match_method TEXT,          -- how this track was linked (fuzzy/mbid/manual)
    PRIMARY KEY (group_id, track_id)
);

Key behaviours that change:

  • find: instead of upserting pairs, upsert group members directly. Two tracks that match go into the same group (merge if they were in separate groups).
  • evaluate/review: set canonical_id on the group.
  • sync: upserting a track that already belongs to a group leaves the group intact; removing a track that is the canonical reopens the group (sets canonical_id = NULL).
  • split_track: remove a member from a group (and optionally form a new singleton group).
  • export: read group_members WHERE group_id IN (SELECT id FROM groups WHERE canonical_id IS NOT NULL) AND track_id != canonical_id.

Queries become direct SELECTs rather than in-memory graph traversal.

Migration

Since the user's collection is already marked, a SQLite migration must be provided:

  1. Read all existing pairs (with excluded and decided flags).
  2. Reconstruct groups via transitive closure (same logic as current group_undecided).
  3. For each group, determine canonical from the decisions table.
  4. Insert into groups + group_members.
  5. Drop old tables.

Tests

  • Integration test that seeds a realistic mix of decided/undecided/excluded pairs into the old schema, runs the migration, and then asserts:
    • decided_duplicate_ids() returns the same IDs as before.
    • undecided_pairs() (or equivalent group query) returns the same groups as before.
    • decided_groups_with_rating_mismatch() returns the same result.
  • Extend the existing Navidrome integration tests: run a full sync → find → evaluate cycle, snapshot the store state, run the migration, assert the snapshot is identical.
## Problem The current store models duplicate relationships as a flat list of track-to-track **pairs** (a graph of edges). Groups are only ever reconstructed at query time by computing the transitive closure of those edges. This produces several complexities: - `group_undecided` does graph traversal in application code to reassemble groups from pairs. - The \"reopen\" invariant is maintained by a consistency SQL trigger that fires when a new pair connects to an already-decided group — subtle and hard to reason about. - Queries like `decided_groups_with_rating_mismatch` join across multiple tables and re-derive group membership. - The `excluded` flag on pairs (from `split_track`) is a workaround for pairs that are edges in the graph but should not be treated as same-group. ## Proposed change Replace the pairs/decisions tables with an explicit **groups** model: ```sql CREATE TABLE groups ( id INTEGER PRIMARY KEY, canonical_id TEXT -- NULL = undecided ); CREATE TABLE group_members ( group_id INTEGER NOT NULL REFERENCES groups(id) ON DELETE CASCADE, track_id TEXT NOT NULL, match_method TEXT, -- how this track was linked (fuzzy/mbid/manual) PRIMARY KEY (group_id, track_id) ); ``` Key behaviours that change: - **`find`**: instead of upserting pairs, upsert group members directly. Two tracks that match go into the same group (merge if they were in separate groups). - **`evaluate`/`review`**: set `canonical_id` on the group. - **`sync`**: upserting a track that already belongs to a group leaves the group intact; removing a track that is the canonical reopens the group (sets `canonical_id = NULL`). - **`split_track`**: remove a member from a group (and optionally form a new singleton group). - **`export`**: read `group_members WHERE group_id IN (SELECT id FROM groups WHERE canonical_id IS NOT NULL) AND track_id != canonical_id`. Queries become direct SELECTs rather than in-memory graph traversal. ## Migration Since the user's collection is already marked, a **SQLite migration** must be provided: 1. Read all existing pairs (with `excluded` and `decided` flags). 2. Reconstruct groups via transitive closure (same logic as current `group_undecided`). 3. For each group, determine canonical from the decisions table. 4. Insert into `groups` + `group_members`. 5. Drop old tables. ## Tests - Integration test that seeds a realistic mix of decided/undecided/excluded pairs into the **old schema**, runs the migration, and then asserts: - `decided_duplicate_ids()` returns the same IDs as before. - `undecided_pairs()` (or equivalent group query) returns the same groups as before. - `decided_groups_with_rating_mismatch()` returns the same result. - Extend the existing Navidrome integration tests: run a full `sync → find → evaluate` cycle, snapshot the store state, run the migration, assert the snapshot is identical.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
victor/subsonic_duplicates#11
No description provided.