Harmonising Human Creativity With AI: A Deep Dive Into Diff-A-Riff
Sony CSL's Javier Nistal on AI's potential to enhance music production and creativity.
Welcome back to Musicworks, where we're all about equipping music creatives and professionals with the knowledge they need to thrive in an industry constantly in flux. In our ongoing series on Music AI, we're exploring the innovations that are reshaping the landscape of music production and composition. Today, we're delighted to bring you an enlightening conversation with Javier Nistal, an Associate Researcher at Sony Computer Science Laboratories in Paris, who's at the forefront of AI-driven music creation.
As the lines between human creativity and artificial intelligence continue to blur, tools like Diff-A-Riff are emerging as powerful allies for musicians and producers. This groundbreaking technology promises to change the way we approach music production, offering unprecedented control and creative possibilities. Javier and his team are not just developing new tools; they're reimagining the very nature of musical collaboration between humans and machines.
In this interview, we'll explore the genesis of Diff-A-Riff, the challenges faced in its development, and its potential to transform music production workflows. Javier's insights offer a unique perspective on the role of AI in music, challenging some common misconceptions and painting a picture of a future where technology enhances rather than replaces human creativity.
For those unfamiliar with Javier Nistal and his work, here's some background:
“Javier Nistal is an Associate Researcher with the Music Team at Sony Computer Science Laboratories in Paris. He studied Telecommunications Engineering at Universidad Politecnica de Madrid and received a Master’s in Sound and Music Computing from Universitat Pompeu Fabra. He completed his doctoral studies at Telecom Paris in a collaborative effort with Sony CSL, where he researched Generative Adversarial Networks for musical audio synthesis.
In the music tech industry, Javier has worked on diverse projects involving machine learning (ML) and music, including recommendation systems, instrument recognition, and automatic mixing. He contributed to the development of the Midas Heritage D, the first ML-driven audio mixing console, and created DrumGAN, the first ML-powered sound synthesizer to hit the market.
Javier’s current research interest lies at the intersection of music production and deep learning. He is dedicated to devising generative models for music co-creation, aiming to enhance artistic creativity and enable musicians to explore new realms of musical expression.”
Thanks for taking the time out to talk today, let's start off with a bit of background about yourselves. What made you want to get into the Music AI space. What motivated you to develop Diff-A-Riff, and how does it differ from existing music generation models?
I got into the technology side of music because I was a musician myself and I’m interested in science. I say 'was', in the past tense, because, you know, life can get hectic and, in the end, research ended up taking over. But yeah, for a long time, I was doing computer music, mainly hip hop and electronic music, influenced by my older brother and cousin. On the other side, my parents are both researchers, and inculcated on me a passion for science and discovery. At some point, I had to choose a career path. I was not particularly gifted for music so I decided to follow a middle ground path by studying Telecommunications Engineering, and specializing in sound signal processing.
Back then, artificial intelligence wasn’t as widespread as it is today. It felt like some hermetic sci-fi secret knowledge, and this made it very interesting to me. So, after concluding my Telecommunication studies in Madrid, I moved to Barcelona to do a master’s in the Music Technology Group (MTG) at Pompeu Fabra University, a well-known research center for music technology. There, I learned the basics of Machine Learning, and pursued a career along this path. During this period, Deep Learning applied to symbolic music generation was starting to take off. Jukedeck, was one of the first companies in this domain back in 2014 or so. I interned there in 2017, and discovered all this generative sorcery. During this period I went to a conference to present my master's thesis. There, I met this excentric guy with meter-long rastas; we immediately connected and became very good friends. He was working at CSL and, today, he's the lead researcher of the Music Team, Stefan Lattner. He mentioned that there would be an opening at the lab for a PhD student in neural audio synthesis. The next year, in 2018, I applied to Sony CSL, got the job, and moved to Paris to pursue a PhD in Generative Adversarial Networks (GANs) for applications to musical audio synthesis. GANs were popular at the time due to their great generation quality and speed. As a result of this work, and through a partnership with Steinberg, we released DrumGAN, an AI-driven synthesizer of drum sounds and the first of its kind to be commercialised in a VST plugin.
More recently, with the rise of Diffusion models, we saw the chance to push these technologies even further. One critique of existing AI-driven technologies for music generation (e.g., Suno, Udio) is that, while they can create very faithful and musically appealing content, they lack controllability. In my opinion, writing a text prompt to generate a five-minute piece of music isn't very useful for musicians. In fact, I think there’s a true existential threat to art and artists’ agency as a result of the extreme autonomy that these technologies are granted. Today, the lens of “fairness” in AI seem to be focused on the training data while, for me, another important threat lies in such level of automation. Even if the data is legally acquired, pressing a button to generate a million tracks is still unfair to artists and art.
As musicians, the Music team in CSL, we noticed this lack of controllability in current generative models and wanted to push for a more fine-grain, layer-by-layer approach. We aimed to create a tool that could listen to the creator’s music, not just regurgitating training data, but adapting to the artist's input. We wanted to combine these aspects to bring generative models into the realm of music production, providing musicians with a whole new palette of means for musical expression.
I understand that creating meme music, like what Suno or Udio does, is fun. Now, my friends send me music with their lyrics instead of voice messages, so I guess it's fun. But I struggle to see how the current state of these models is applicable to serious music production. This is the main motivation behind Diff-A-Riff.
What features or capabilities must Diff-A-Riff have to perform well?
The main features for any generative AI are: quality of the generated data, measured as its resemblence to the original training data, diversity capabilities, or, to put it simply, that the AI doesn’t generate always very similar things, and its generation speed. For the first two capabilities, one of the main challenges in music concerns the availability of training data. Diff-A-Riff was trained on around 10-12,000 multitrack recordings from a collection we bought for research purposes. This is one of the prerequisites to make it work, of course. Another and most important feature, in my opinion, concerns the expressiveness and control affordances of the tool, this is, the creative capabilities that the tool offers to the user. For this, we designed Diff-A-Riff to operate on single instruments, which we believe aligns better with music production workflows where the music is created in a layer-by-layer fashion. Also, we condition Diff-A-Riff on various types of information (what we call multiple modalities in the jargon) including the user-provide music context but also text descriptions and audio sample references for the model to emulate. Additionally, we equipped Diff-A-Riff with the capacity to generate loops, stereo sounds or variations of user-provided samples.
In general, the model’s performance improves when having such user-provided information as conditioning input, which relieves the model from having to learn it and recreate it from scratch, i.e., the model just needs to extract and recombine information from its inputs. We could exploit any alternative conditional information, provided we have the right dataset, e.g., from sonifying a score, or creating music based on dance data. As you can see, the design and nature of the conditional information guides the generative model and defines its control affordances.
With regard to the generation speed capabilities that I mentioned earlier, the most important building block of Diff-A-Riff is the underlying Consistency Autoencoder, developed in-house at CSL by one of our most brilliant PhD students, Marco Pasini. I don't know if you're familiar with latent generative models (like Diff-A-Riff). These rely on an preliminary step that involves compressing the audio signal into a very compressed form. This compression step is done by a so-called autoencoder. In the domain of sound, this is similar to what MP3 does, just that with MP3 you can still play back the compressed sound. A deep learning autoencoder compresses sound even more. So much that the resulting signal isn’t any longer an audible sound and becomes a sequence of so-called tokens that represent the original audio signal in a super compressed form. The decoder, then, is used to decompress the audio back to the domain of sound by filling in the missing information after compression. Diff-A-Riff relies on such autoencoder by operating with audio signals represented in this compressed space. This enables our model to be very small and perform at unprecedented speed when compared to other AIs that perform with similar quality (e.g., Stable Audio, MusicGen).
To give you a rough idea of the compression capabilities of the Consistency Autoencoder: an audio clip of 48,000 samples (1 second of audio at 48 kHz sample rate), gets compressed into a signal with 800 samples.
What were some of the biggest challenges you faced during the development of Diff-A-Riff, and how did you overcome them?
Our main challenge at the moment is infrastructure limitations—basically, data storage and GPU capabilities. The current dataset takes around 20 terabytes, which is not a lot compared to the amounts of data that big players train on, but we are actually a small and humble team within the Sony organization and don’t have access to large computing clusters. This means that we have to break everything into (many) pieces. Training the model requires a lot of preliminary pre-processing steps because we can't do everything in one single take. We have to do various pre-extractions of the data, move it around back-and-forth from disk to SSD to preprocess it. This process takes days, or even weeks if something goes wrong (which usually does).
Additionally, the field of AI is continuously evolving and keeping up with the state-of-the-art is harder than ever. I'm not a mathematician myself, nor a software engineer; I'm rather a sound engineer with some self-acquired computer science and Machine Learning skills. So, for me, navigating all these complex topics, putting the pieces together and combining them into a successful tool for music production is a big challenge. Luckily, I have a very diverse team of brilliant minds with very complementary skills backing me up.
How do you envision Diff-A-Riff being integrated into existing music production workflows, and what are the potential benefits for artists, producers and the music industry as a whole?
The way I envision Diff-A-Riff or future versions of it is as some sort of AI daemon integrated into the digital audio workstation. By daemon, I mean a kind of continuous background process in software engineering jargon. So, imagine having an AI daemon that works seamlessly with your DAW, like Ableton for instance, and can access concurrently to all the tracks in your project. Whether Audio or MIDI tracks, these would be fed to the Diff-A-Riff as part of the context. Based on these, and the additional controls that Diff-A-Riff offers, one could directly create new tracks for specific instruments and instantly populate them with meaningful material adapted to your music. Also, there wouldn’t be any need to work with MIDI any more. One could directly work on audio clips, allowing for endless ways to generate and edit it.
Chopping, rearranging or time stretching of audio loops are tasks producers often do manually to ensure that samples fit the rest of the tracks. With Diff-A-Riff all this could be done effortlessly. If you didn't like a part of a guitar that you recorded or that you took from a sample library, you could simply regenerate it (what we call in-painting in analogy to the Computer Vision field, i.e., filling missing information), perhaps by using a text prompt, e.g., “increase distortion level”, “guitar solo”, or perhaps by using a reference audio sample you want your regenerated part to emulate. This would create a super-fluid audio editor/generator where one can “sculpt” music and sounds at will.
This approach could tremendously speed up workflows in audio production whilst allowing for weird, creative, and expressive means of control. For instance, Diff-A-Riff was trained only on musical instruments, but what if you drag in as input non-musical sounds, like a barking dog, to fit to your context? I never tried it myself but chances are that it will produce unexpected, yet musically interesting results—happy accidents that inspire creativity.
While using these tools, one must relinquish some control to embrace surprise. Integrating these tools into music production workflows offers an option to experiment and innovate. If desired, one can still opt for the precise control of classic tools, but having these new AI-driven capabilities would add a layer of creative exploration and spontaneity to the process.
Are there any accepted "truths" about AI's role in music creation that you disagree with based on your discussions with musicians? What's your contrasting viewpoint and evidence?
There are a couple of big ideas floating around about AI in music that seem a bit exaggerated to me, especially after my experience working in collaboration with musicians and assessing our tools at Sony CSL. One big idea out there is that AI is going to take over and replace human artists. A lot of the fear around this comes from heated online debates and dramatic media coverage. But, honestly, from what I've seen, most musicians don't seem all that worried. Sure, some companies are using AI to create royalty-free music, e.g., for ads and content creators, and this might affect some jobs and reduce some revenue to right-holders. Companies like Jukedeck and Amper Music tried this quite some time ago, and now there are new ones like Suno and Udio, trying similar things with more advanced technology.
But I think AI will be more about enhancing what artists can do rather than replacing them. I see AI as a tool that can speed up your workflow and let you try out new ideas without much hassle. Tools like Diff-A-Riff, for example, require the artist in the creative process, making it more of a collaboration between humans and machines. From what I’ve gathered, many musicians see AI as a way to be more efficient and explore creative avenues they might not have considered otherwise.
While AI might take over some rather technical and repetitive tasks, I believe the human touch in music will always have an added value. Music is not just content; it is a vessel for ideas/visions through which people express themselves. Even in the AI era, musicians will still add that special something that makes their work unique. In fact, in my experience, a lot of artists see AI as an opportunity rather than a threat. Being quick to adopt these tools could open up new possibilities and help create amazing new music. It might affect some, but for many, it's a great opportunity to do innovative things with minimal technical effort and no regrets. The benefits outweigh the risks.
A good analogy is how the image and visual arts community is evolving. Tools like Adobe's Firefly speed up work in Photoshop, and artists are fine-tuning existing models on their own work to create unique models with their distinct flavour. Many of my graphic designer friends who are freelancers are thrilled. A client proposes a wild idea, they outline it, use tools like Mid-Journey to generate drafts quickly, present a few examples in hours, get feedback, refine the idea—it's exciting for them. These tools boost productivity like never before, making sketching and design much easier. I believe music will undergo a similar transformation.
That said, those who might be impacted most are big artists, record labels, and music rights holders in general. AI will allow anyone to create top-tier production quality music with limited resources and skills, which means more people can produce music that competes with what these big names are putting out. Additionally, the dilution of creative rights on material generated from AI's training data is already shaking the traditional models of music ownership and royalties.
Another common belief is that AI just spits out stuff it has seen before. This seems partially true for current models that generate music from limited user input, such as text prompts, since they solely rely on their training audio data to make predictions and will most likely end up recycling styles and genres. AI models can do more if trained to obey user-provided information, and I believe Diff-A-Riff is an example of this. AI may generate music that is novel if confronted with unique combinations of prompts or other inputs provided by the user. Also, AI will bring new ways of interacting with music, for example, as I mentioned before, AI could be used to convert dance movements into music; imagine dancing and creating music based on your moves, how cool would that be! To me, this shows that AI isn’t just about regurgitating old data—it’s about expanding how we create and interact with music.
What advice would you give to aspiring researchers looking to make a career in the AI audio space?
I would highly recommend starting off on your own: reading a lot of blogs and papers, trying to experiment a lot... Nowadays, with Google Colab notebooks and freely available GPU computing, at least on a small scale and for a limited amount of time, you really have the means to get started directly. There are plenty of blog posts—and ChatGTP, of course, our good ol’ friend. I would also encourage trying to find a master's program that suits your intersts. For example, the one I did at the MTG, or the one from IRCAM, in Paris, or Queen Mary University of London. They offer excellent music technology programs that integrate AI with computer science and music. These programs will definitely provide a broad view of the field and often have strong partnerships with industry. For instance, my internship through MTG's partnership with SoundCloud gave me a valuable opportunity. I find this very interesting for anyone starting a career in this field.
There are also numerous mentorship programs available. If you're feeling lost or uncertain, enrolling in these programs to connect with industry professionals or researchers from the field could be a good idea. A good starting point is the International Society for Music Information Retrieval (ISMIR), that offers mentorship programs like the Women in Music Information Retrieval (WiMIR) community. I'm sure there are other communities and mentorship programs that offer a gentle introduction to the field.
Is there anything else you'd like to share about Diff-A-Riff or your personal journey that we haven’t covered?
We hope to make the technology accessible to the public. We don't know yet in what way or form, but we will push for this. We are already working on subsequent versions of the model, and it will only get better from here.
Browse the Music AI Archive to find AI Tools for your Music
https://www.musicworks.ai/
Get in Contact with Javier Nistal