I've been toying with a very promising new engine design that makes it possible to use computation-heavy effects, while also eliminating row transition noise. I don't have time to develop this into a full engine at the moment, but I thought I'd share the idea in case anybody else wants to play with it.
The design revolves around using a crude scheduler to implement basic multi-threading. Within every iteration of the sound loop, the scheduler allots a fixed portion of the available cpu time to the main thread, which updates frequency counters and calculates the output value(s) for the next iteration. The rest is allotted to a variable pool of secondary threads, which run consecutively in between each main thread run. In other words, our sound loop consists of two parts - the first part is always the same and runs all those updates that are required on each sound loop iteration. The second part runs varying tasks that do not require fast updates, for example counting length, computing some effect, or reading in note data.
The scheduler is implemented by pointing the stack pointer to the music data, and then simply RETurning after the main thread. To keep the music data size manageable, we need to implement sequence loops and subroutines as abstract operations running as secondary threads. Ideally we'd do nested subroutines, however that's a bit expensive to compute and in some cases one could probably get away without nesting. No nesting basically mirrors the sequence/pattern approach we normally use. The following example would play two notes over and over again:
sequence
dw set_note_ch1, some_note
dw init_loop, #3fe ; set note length
dw jump_sub, delay
dw set_note_ch1, some_other_note
dw init_loop, #3fe ; set note length
dw jump_sub, delay
dw jump_seq, sequence
delay
dw do_nothing
dw loop, delay
dw return
Now, that may seem like a rather inefficient way to just do
seq
dw pattern
dw seq_end
dw seq
pattern
dw length, note
dw length, other_note
db ptn_end
Perhaps it makes more sense to pick a certain number of threads, and think of the time they will take as your refresh rate/tick length, eg.
...
dw set_note, note
dw init_loop, #80 ; 0x80 * 0x10 + 3 = 0x803 loops/ca. 8 ticks (so we can get away with just an 8-bit length counter)
dw jump_sub, fx01
dw set_note, other_note
dw init_loop, #40
dw jump_sub, fx02
...
fx01
dw calculate_expensive_fx_part1
dw calculate_expensive_fx_part2
dw do_something_else
dw jump_sub, delay11
dw loop, l1 ; 16
fx02
dw calculate_other_expensive_fx_part1
dw calculate_other_expensive_fx_part2
dw calculate_other_expensive_fx_part3
dw calculate_other_expensive_fx_part4
dw do_something_else
dw jump_sub, delay9
dw loop, l1
delay11
dw do_nothing
dw do_nothing
delay9
dw do_nothing
dw do_nothing
dw do_nothing
dw do_nothing
dw do_nothing
dw do_nothing
dw do_nothing
dw do_nothing
dw return
Still expensive in terms of data size, but essentially we can now run things that we used to run in the outer sound loop (so, every 256*innter_loop_length t) at a much faster rate without affecting sound. This allows us to do stuff like fast&precise slides/vibrato and all sorts of variable-rate modulation effects. Also with recursive subroutines, you could actually have subroutines that generate subroutines... I think one could go pretty wild with this.