Audio Conditioned Symbolic Solo Guitar Generation with PerceiverAR

Dataset

Model was trained on custom collected dataset of 17 thousand pairs of symbolic guitar solos aligned with synthesized audio backing tracks.

Dataset consists predominantly of hard rock and metal genres.

Model Architecture

Model Architecture

The complete proposed pipeline is a four-stage encoder–adapter–decoder system.

Audio feature extraction

A frozen Music2Latent encoder transforms the raw waveform into a sequence of 64-dimensional continuous latents produced at ≈10 Hz.

Projection adapter

A trainable projection layer maps the continuous latents to the token-embedding space of the language model.

Conditioning strategy

The projected latents are prepended as a prefix and are seen through cross-attention only once, keeping generation cost independent of audio length.

Autoregressive decoder

A Perceiver AR network predicts DADA-GP tokens autoregressively while attending to both the prefix and its own latent queries.

Generation Samples

Conditioning Audio Generated Guitar Solo (audio) Generated Guitar Solo (GP5)
solo_01.gp5
solo_02.gp5
solo_03.gp5
solo_04.gp5
solo_05.gp5
solo_06.gp5
solo_07.gp5