Audio Conditioned Symbolic Solo Guitar Generation with PerceiverAR

Dataset

Model was trained on custom collected dataset of 17 thousand pairs of symbolic guitar solos aligned with synthesized audio backing tracks.

Dataset consists predominantly of hard rock and metal genres.

The complete proposed pipeline is a four-stage encoder–adapter–decoder system.

A frozen Music2Latent encoder transforms the raw waveform into a sequence of 64-dimensional continuous latents produced at ≈10 Hz.

A trainable projection layer maps the continuous latents to the token-embedding space of the language model.

The projected latents are prepended as a prefix and are seen through cross-attention only once, keeping generation cost independent of audio length.

A Perceiver AR network predicts DADA-GP tokens autoregressively while attending to both the prefix and its own latent queries.