Here at Acentech, we’ve taken note of the latest viral phenomenon to capture the attention of the internet. A high school student looked up the word “laurel” on the vocabulary.com website, listened to the audio pronunciation of the word, and heard something completely different, the word “yanny.” A recording was made of a laptop playing this audio, posted on Instagram and Twitter, and the rest is history.
So how could this happen? To look at why this occurs, we first need to look at the source audio, which was voiced by an opera singer saying “laurel” with excellent, and perhaps exaggerated, diction. Everyone has peaks in the frequency spectrum of their voices when they speak vowels, and these peaks are called formants. Opera singers often develop an usually prominent resonance in their vocal tract at high frequencies around 3,000 Hz in order to compete with orchestras when singing, a resonance sometimes called “the singer’s formant.” This resonance can also be observed in speaking voices of opera singers, lending a nasal aspect to the sound. Once the source audio, containing typical and singer’s formants, was recorded, it was likely processed with a data compression encoder to save on server space and to facilitate streaming over the internet. The most common form of this compressed audio is known as MP3. MP3’s, along with many other audio and video encoding techniques, use a perceptual coding algorithm. This means the encoding process compresses the file, in part, by throwing away information that isn’t perceived by most people. Unfortunately it isn’t perfect and as you increase the amount of compression the perceptual coding becomes more aggressive, meaning that more people will perceive a difference from the source file. The encoding process can introduce artifacts to the audio such as high frequency distortion of the signal, time-smearing of transients, and a reduction of low frequency content. The distortion is most obvious in frequencies from around 3,000 Hz and higher and is often perceived as ringing or a kind of swirling sound. So the original audio contains not only the sound of the opera singer (with singer’s formant) but also ringing compression artifacts that fluctuate in response to the voice signal.
The viral recording was made from the output of laptop audio in a noisy room and was further compressed either in the recording or in the transfer to various social media platforms. It is possible that the frequency response of the laptop and the room noise further distorted the audio to make it more ambiguous. Playing this audio through laptop or smartphone speakers rolls off the low frequency formants, giving the high frequency portion more prominence.
It is highly probable that this ambiguous viral audio stimulus forces the brain to choose between two distinct possibilities, a process that varies with individuals based on their prior experiences and assumptions, similar to the famous “white or blue” dress debate. Josh McDermott, Assistant Professor in the Department of Brain and Cognitive Sciences at MIT, states:
“My general sense is that the effect is the result of a crummy speech signal making the formants ambiguous, such that the percept can be pushed around by variation in the transfer function of your audio presentation system, or by variation in your priors. I heard two different things depending on whether I listen over laptop speakers or headphones, almost surely due to the poor bass response of the laptop speakers.”
So the frequency response of your playback system (with poor low response for smartphones and laptops) and of your hearing (with poor high frequency response for older adults and individuals with hearing loss) can influence which word you hear. The New York Times has placed a great demo on their website that filters out either high or low frequencies allowing everyone to perceive both laurel and yanny.
We’ve enjoyed playing around with this auditory Rorshach test. What do you hear?