On June 21, 2017, electronic musician Holly Herndon and her husband, writer/philosopher/teacher Mat Dryhurst, welcomed a new addition to their family. They named it Spawn. “She’s an inhuman child,” Herndon tells me one afternoon, while seated in the offices of her record label, 4AD.
Spawn is nascent machine intelligence, or AI. There’s artificial intelligence being deployed for self-driving 18-wheelers, Netflix user-preference predictors, customer service preferences, handwriting recognition, and cyber-security to fight hackers using AI to create malware. Machine learning’s future infiltration into music production isn’t a question of “if” but is one of “when,” and significant inroads are already being made. There’s AI that can replicate Bach and make up Beatles’ songs, gimmicky YouTube uploads of robo-pop, ambient producers that use AI to churn out new albums every week, even an algorithm signed to a major label. Engineering teams at Google, IBM, and Spotify are working tirelessly to advance AI further into the realm of music-making.
But Herndon’s 2019 album, Proto, contains the first recorded debut of an AI on a pop-music album. Here, she explains how she did it.
So much of the research into AI is trained on a very particular era of music — 1850–1950 in the western canon — where pitch and note length and rhythm are the most important. It’s really dull because it ties us to this particular time that’s no longer of the moment. We wanted Spawn to reflect our community, and we wanted to use peoples’ voices that were specific to it.
The first six months were pretty uninteresting. With AI, you have a training canon; the AI extracts a rule set from the canon and applies it to something else. It can never go outside the canon. When that’s applied to a voice, the AI tries to understand the rule set of the voice — the logic of the voice. We started training it with my voice and Mat’s voice, both of which are in the hundreds of megabytes of Spawn’s training info. After six months, we got slightly more interesting results. That started to happen when I stopped using TensorFlow, a program mostly for visual learning. (If you wanted to have your portrait done in the style of Van Gogh, you would use this.) This involves turning sound files into spectragrams so that AI can “see” them. But in terms of timbre, it was very lo-fi and it all sounded the same. There was nothing exciting about the output. We switched to SampleRNN, which is used for voice recognition. With SampleRNN, it takes whatever is in the training canon and then it tries to understand — If this sample is happening, what would most likely come next? The one snag is, if it’s training on my voice, it tends to get stuck on vowels. When we speak, we elongate our vowels, so the program tries to guess for how long exactly — and then it gets stuck.
An early example of playing with SampleRNN, in which Spawn gets somewhere imitating Holly’s voice model:
Spawn’s first words and sounds only came when we switched to a third, voice-model method. It required way more audio. We used hours of my voice. It takes my voice speaking and singing and creates a model of what it sounds like. I made a data set where I sang random phrases within a comfortable range for me, like:
Aluminium cutlery can often be flimsy.
She wore warm, fleecy, woolen overalls.
Alfalfa is healthy for you.
Spawn would digest that information, which could take anywhere from 1 to 20 minutes. We’d all be on Slack together and we’d get updates like: “Spawn released a new track.” She would do that all the time. We’d click through and listen to it and, most of the time, our response was, eh. And then I clicked on the one used for “Birth” and went, “Yes!” That was the first time I was excited by the outcome. Because generally, Spawn has such a limited perspective. It’s both super-impressive and like … terrible. It’s like God, you’re so dumb!
Spawn has very real limitations. Reverb is really difficult. It couldn’t understand the difference between the shapes of the sounds and their echoes. It’s looking for difference, so it really likes that the audience clapping or shaking keys, or tapping beer bottles, or finger-snapping — all of that sounded really cool through Spawn.
Spawn imitates an audience clapping:
She likes transients. Percussion instruments have the biggest transients of the whole instrument family, in that the beginning of the sound is big and then it quickly decays. She saw a snare, and thought, That is a bit like this bit I recall from when Holly says “T,” and tried to reproduce the snare with a “T” sound. That, to us, is new. The result is somewhat clever, logical, and most importantly, unexpected. It surprised us.
Spawn performs the rhythm section of the song “Frontier.” You can hear how it pulls different ideas from Holly’s vocal model:
That’s why when you listen to “Godmother,” it sounds like beatboxing, which is a combination of singing and speech. I wasn’t training her with beatboxing. It’s so embarrassing that this is what she spat out! I don’t know if it was a good idea, but it was an idea. It wasn’t something I told her specifically to do. I tried singing “Godmother,” and I just can’t. It’s too fast. Spawn outperforms me.
Pretty soon, we’ll have very accurate voice models of past vocalists, and that’s going to open up questions about what we do with our forefathers’ and foremothers’ voices. I used to say we’ll have infinite Michael Jackson records, but that probably won’t happen anymore. Infinite Aretha Franklin records may be the better example!