Siri, the Apple Inc. personal assistant application that gained mass popularity after the release of the iOS software and the iPhone 4S, is not only an intelligent technological advance but also a captivating voice. Users simply hold the home button of their Apple device, ask Siri a question, and she answers in her robotic tone. The voice is a form of artificial intelligence; the machinery of the Apple product is programmed to recognize human vocal cues and respond using synthesized language. As a consequence, despite the context of the user’s question, the sound produced by Siri is often very repetitive. Due to the fact that Siri is essentially a robot, her voice lacks emotional, rhythmic, and other culturally acquired sound characteristics prevalent in the human voice. This exhibit focuses on the analysis of a captivating voice that elicits a response from listeners because it deviates from the norm of a particular cultural understanding. Thus, Siri becomes intriguing vocally through her lack of emotion, lack understanding of social contexts, and lack of some paralinguistic qualities that cause her voice to differ from the norm of the human voice.
The following link provides an understanding visually of how Siri is integrated with Apple Products.
More specifically, Siri’s audible vocal qualities that cause her sound to become captivating and differ from the norm establish her inhuman nature; the fact that her voice is robotic intrigues users. When the software was initially released, Siri was vastly popular amongst iPhone owners because Apple’s products were so widespread that suddenly, a synthesized voice became a common feature in the user’s daily life. Siri has been featured giving directions to people through a GPS in television shows and movies. Autistic children have been found to develop a particular interest in her voice because she never grows frustrated from answering their questions, never stops being a personal assistant, and does not exhibit the human emotions that may instigate their distress or confusion. As competing companies strive to market their products against those of Apple, they have realized that they too must integrate an electronic vocal assistant with their technology. The concept of an artificially intelligent assistant that understands audible input from a human user, responds using synthesized sounds, and provides relevant information is thus a relevant topic.
Siri is not just alluring because she is a convenient piece of technology; she exhibits certain unusual qualities that develop her robot voice. The following five examples feature Siri responding to human questions. In each instance, different vocal characteristics persist that develop her sound. To summarize, Siri refuses to respond to a question about her personal music preferences because she is not human. Her voice is unable to express musicality and rhythm when she repeats the lyrics to a popular song. She calmly provides directions to users without frustration. As Siri speaks, certain electronic noises ornament her words and make it clear that the software is processing inputted data. Lengthy pauses occur when the user speaks to Siri and she is unable to immediately process her answer. Therefore, each example provides insight into how Siri lacks emotion, lacks an understanding of social contexts, and lacks paralinguistic features. The differences between Siri’s voice and a human voice cause her to become captivating.
This particular sound clip involves a human iPhone user asking Siri a question about her personal music preference; she responds in her unenthusiastic monotone, leaving the user to understand that she is only a robotic assistant incapable of having human emotion or artistic taste. The user first poses, “Siri, do you like music?” The speaker emphasizes the words, “Siri,” and “music” to denote their significance in the phrase. She also pauses between the words “Siri” and “do” to indicate that the question is aimed at Siri, and that the following words signify a continued thought. Siri’s response contrasts with the human voice; she states, “I really have no opinion.” Interestingly, every word in her sentence is produced in the same pitch, and the pauses between each word are similar. Siri then lacks the basic paralinguistic norms that were produced by the human speaker. Furthermore, the context of the question provides further insight into the idea that she is an unfeeling machine. Because she is a voice produced by a device, she has “no opinion.” Siri differs from a normal voice produced by the human in that she cannot possibly exhibit musical preferences like a human. To express this idea, she is programmed by Apple to produce reasonable, unemotional, and robotic responses to personal questions users may ask. After the user asks Siri if she can play Beyoncé, she responds, “playing Beyoncé shuffled.” More contrasts are presented. Because the user asks a question, the end of her sentence switches to a higher pitch. However, Siri once again responds in the same tone of voice that she will in fact play Beyoncé; she has no emotional reaction or interest in the music itself, but can serve as a technological application. After this sequence, the beginning of a Beyoncé song starts; the piece provides a stark contrast to the electronic pronunciations of Siri as Beyoncé exemplifies human emotion, pitch, and artistic interpretation. Therefore, this clip demonstrates that Siri differs from the norm in that she does not have a social opinion, does not speak with the same linguistic techniques as the user, and cannot express emotion.
https://www.youtube.com/watch?v=jofNR_WkoCE
When the user asks, “Siri, What does the fox say?” she responds, “Fraka-kaka-kaka-kaka-kow!” The speaker references the viral electronic dance song performed by the comedy group, Ylvis from Norway. Siri is programmed by Apple to understand this particular pop-culture reference; it would seem that Apple has included certain humorous elements to Siri’s knowledge base to aid user entertainment. However, to analyze how Siri’s response differs from a normal human response, one must consider the context of the song. A link to the viral YouTube video of “The Fox (What Does the Fox Say?)” by Ylvis is provided above. In the original song, the performers dance around in lively costumes and enthusiastically sing, “Fraka-kaka-kaka-kaka-kow!” with a particular rhythmic element to complement the background electronic beats. In addition the notable rhythm of the lyrical phrase, the pitch is very high, and the final syllable “kow” is highly emphasized. Siri’s cool voice in a singular tone juxtaposes the musicality of the song lyrics by Ylvis. She does pronounce the syllables of the phrase, “Fraka-kaka-kaka-kaka-kow!” but her pronunciation is much quicker, does not change pitch, and contains no rhythmic element. She is simply a synthesized voice that does not demonstrate the musical and cultural understanding of the song. The primary unusual feature of her voice in this particular instance is Siri’s complete lack of emotional musicality that catches the listener’s attention.
This particular clip involves Siri giving directions: she states, “Starting route to Cameron Indoor Stadium. Head East on Pace Street, then turn right onto Oregon Street.” Siri’s voice is integrated with the Apple directions feature, so each time the user asks their GPS device how to get to a location, Siri calmly informs the user as they drive how to reach their destination. Her vocal directions are mesmerizing because they are calmly stated. If a human were to give directions, they might employ vocal fillers like “ok,” “um,” or employ pauses when speaking. Because the machine is programmed to know exactly how to tell the user where to move next, no hesitation is present. Furthermore, because Siri does not possess any emotions, she will never become frustrated with the user for not following her exact directions. The beginning of the clip is also linguistically interesting because Siri does not pronounce “Cameron” in the traditional way, CAM-ruhn. Instead she says, “CAM-er-on.” The listener may find this pronunciation odd because it clearly demonstrates that Siri’s voice is programmed to develop certain vowel and consonant sounds; however, there may be certain flaws in the technology that do not account for a certain context. This causes her voice to become much more robotic, and even otherworldly. Her electronic voice then has certain glitches that make it obvious that the production of sound does not come from a human but rather a machine that cannot relate to or does not know how “Cameron” is pronounced. Therefore, in this excerpt, Siri’s voice captivates the audience both through her calm instructive words and through the clear sound that she is a robot.
In this audio clip, the iPhone user asks, “Siri, what’s the weather like?” She responds: “OK, here’s the weather through Thursday.” The sequence of sound presents an electronic button click, the user’s question, another button click, a pause, and then finally Siri’s response. This series of noises exemplifies the electronic qualities of Siri’s interactions and further emphasizes her robotic tone. The initial ding of the iPhone audibly denotes that the questioner is using a device. After the question is posed, the second button click precedes her response. These electronic noises that supplement her words would not normally occur in a conversation between two human beings. In a way, the short clicking noises ornament the conversation and add the electronic element to ease the user interface. When she states the word “OK,” she provides equal emphasis on both syllables and takes the same amount of time to pronounce the “O” and then the “K.” This very specific method of speaking the word differs greatly from how a typical person would employ the transitional phrase. Thus Siri lacks a linguistic understanding of the fact that the phrase “OK” typically is not used in a particularly formal way; however, she presents her weather information to the user after calling their attention with the phrase “OK.” Additionally, the interaction between the user and Siri is very succinct. It is clear that the question can be easily and quickly answered by the iPhone technology, and thus, not much conversation is needed. In this way, Siri’s voice reflects the speed of technology and ease of an Internet search for the weather forecast. The alluring voice of Siri is developed through her unusual use of a transitional phrase, speed of response, and electronic ornamentation that further emphasizes her robotic sound.
The iPhone user asks, “Siri, why is the sky blue?” and a lengthy pause ensues after which Siri states, “I’m really sorry about this but I can’t take any requests right now. Please try again in a little while.” The clip is similar to the preceding four examples because Siri’s repetitive monotone directly contrasts a human question in which certain words are inflected through rhythm and pitch. However, the pause in this example is important to note. After the question is asked, the pause that occurs seems to last for a very long time. If two humans were having a conversation, the pause in between question and answer would be much shorter. The pause then makes the fact that Siri’s voice is produced through a technological device obvious; the break occurs because the machine needs time to process the user’s vocal input. Because the software may not understand the words the user pronounced, does not know where to look for the answer to the question, or simply is not connected to the Internet, the pause implies that the phone needs time to load a response. It is clear to the listener that the technology must take time to process vocal input from the user before providing a response. In this way, Siri lacks the social understanding of a typical response time that would be employed by the user. Furthermore, when Siri says the words, “little while,” once again, she places equal emphasis on both words that would not normally occur in human conversation. The sequence of the conversation demonstrates that Siri is captivating essentially because she is inhuman; a robotic voice lacks emotion, lacks certain linguistic understanding, and lacks the social cue that it would be inappropriate to take so much time responding to a question.