Audition involves a deep temporal hierarchy of transformations supporting capacities such as speech and music perception. How are representations structured throughout the auditory system? To characterize these processing stages, we develop a computational modelling framework allowing comparisons across human perception, time-resolved brain signals, explicit models, and artificial neural networks. Specifically, we identify properties of the underlying transformations by considering the effects of computational trajectories on representational spaces. Using a combination of modelling, synthesis, psychophysics, and MEG, we explore hypotheses about artificial and human audition in a common representational ground. We aim to find computational descriptions that point to missing pieces as modelling efforts in mid-level audition extend to low-level music cognition.