Proteus that costs like $1000 is certainly an overkill for this.
For this particular application I didn't need to know exact cycle count or do assembly optimization. If I was to write AVR assembly code by hand, I sure would know how many cycles it takes - it is common practice to memorize opcode cycles (super easy on AVR), and count them in time critical parts when writing for 8/16 bit systems in assembler.
What I needed was just to see compiler generated code, to check if particular changes in C code makes resulting code longer or shorter, which is totally enough to estimate efficiency of edits - another common practice when writing for 8/16 bit systems in C. It is super simple task, I don't see why it should be overcomplicated like that with all that porting to other environments, using simulators, debuggers, and stuff like that. All it really takes is just a quick glance at the compiler output, which is normally fully exposed to the user. It is just an Arduino IDE's quirk that it hides the intermediate assembly.