Biz & IT

Cortana for all: Microsoft’s plan to put voice recognition behind anything

Microsoft and co. make computer vision, voice, and text processing a Web request away.

Sean Gallagher – May 15, 2015 10:00 am | 110

Rampancy? No. All these Cortanas are coming to put artificial intelligence in all your things if Microsoft's Project Oxford pays off. Credit: Microsoft

When Microsoft introduced the Cortana digital personal assistant last year at the company’s Build developer conference, the company already left hints of its future ambitions for the technology. Cortana was built largely on Microsoft’s Bing service, and the Cortana team indicated those services would eventually be accessible to Web and application developers.

As it turns out, eventually is now. Though the most important elements are only available in a private preview, many of the machine learning capabilities behind Cortana have been released under Project Oxford, the joint effort between Microsoft Research and the Bing and Azure teams announced at Build in April. And at the conference, Ars got to dive deep on the components of Project Oxford with Ryan Galgon, the senior program manager at Microsoft Technology and Research shepherding the project to market.

The APIs make it possible to add image and speech processing to just about any application, often by using just a single Web request. “They’re all finished machine learning services in the sense that developers don’t have to create any model for them in Azure,” Galgon told Ars. “They’re very modular.” All of the services are exposed as representational state transfer (REST) Web services based on HTTP “verbs” (such as GET, PUT, and POST), and they require an Azure API subscription key. To boot, all the API requests and responses are encrypted via HTTPS to protect their content.

Currently, the Project Oxford services are free to try for anyone with an Azure account, though there are limits on the rate of usage. While its idiosyncrasies are worked out, the services can be leveraged through software developer kits for a number of platforms plus Microsoft’s Azure—bringing speech-to-text, text-to-speech, computer vision, and facial recognition capabilities to virtually any application, mobile, web, and otherwise.

For now, the missing piece is the intelligence that can take text and speech interactions for applications to the next step. That capability is wrapped in what Microsoft calls LUIS (Language Understanding Intelligent Service), a text-processing capability that will be able to determine user intent from a string of text whether it’s typed or spoken. LUIS identifies “entities” within text such as names, dates and times, actions, concepts and things, and the service can be wired into cloud applications to perform the appropriate task.

Things aren’t perfect yet, but they demonstrate that Microsoft is trying to put its tools at the center of the next wave of new applications. Cortana aims to serve the mobile world and smart devices that will have no keyboards, mice, or even screens. When combined with the rest of the capabilities and interfaces being provided to developers through Azure and Bing, the approach makes a pretty strong case for Microsoft’s continued relevance. Even if the Windows desktop moves from being the star of the show to a supporting role, it seems the many faces of Cortana are primed for the spotlight.

A quick trip through the public release of Project Oxford APIs.

Eye in the cloud

Two of the four sets of services in Project Oxford are focused on image processing. The first, the Face API, was partially demonstrated in Microsoft’s sample How-old.net application, which guesses at the age of people whose faces are within an uploaded photo. The API “provides a set of detection, verification, proofing, and identification services,” Galgon said, and it performs analysis of facial geometry and applying rules built through the machine-learning process. The Face service has been trained to guess age and gender of subjects it identifies with fair accuracy, and it can also perform facial recognition of an individual by either matching photos of the same person in a collection or based on a pre-loaded “person” identity.

The input for the Face API is an HTTP POST request that includes the image file (a JPG, GIF, PNG, or BMP file) being analyzed. Each type of processing request includes the photo either as a binary object as “application-octet-stream” data or as a URL pointing to a Web-accessible image in a JavaScript Object Notation (JSON) format. Along with the image, the request includes a set of parameters instructing the API on which information to return.

For example, using my image from my author page on Ars to get facial geometry data, a guess at my gender and age, and analysis of my head pose (with estimated pitch, yaw, and roll away from the dead-on view), my app would send the following:

POST face/v0/detections?analyzesFaceLandmarks=true&analyzesAge=true&analyzesGender=true&analyzesHeadPose=true Content-Type: application/json Host: api.projectoxford.ai Ocp-Apim-Subscription-Key: •••••••••••••••••••••••••••••••• Content-Length: 81

{ “url”:"http://meincmagazine.com/wp-content/uploads/authors/Sean-Gallagher.jpg” }

And the Web service returns the following in JSON, accurately guessing my gender and slightly underestimating my age. It also provides a Face ID that can be used to check any other images against later—an ID that Azure retains for up to 24 hours:

[{"faceId":"ade42988-fd58-4422-b207-688e6a0d417d","faceRectangle":{"top":111,"left":62,"width":137,"height":137},"faceLandmarks":{"pupilLeft":{"x":107.8,"y":144.8},"pupilRight":{"x":167.4,"y":155.0},"noseTip":{"x":121.3,"y":182.6},"mouthLeft":{"x":96.6,"y":203.8},"mouthRight":{"x":156.0,"y":215.5},"eyebrowLeftOuter":{"x":84.0,"y":134.2},"eyebrowLeftInner":{"x":120.3,"y":139.4},"eyeLeftOuter":{"x":95.3,"y":145.5},"eyeLeftTop":{"x":105.8,"y":141.4},"eyeLeftBottom":{"x":104.6,"y":150.5},"eyeLeftInner":{"x":114.3,"y":148.4},"eyebrowRightInner":{"x":146.9,"y":143.5},"eyebrowRightOuter":{"x":190.9,"y":151.1},"eyeRightInner":{"x":153.7,"y":155.2},"eyeRightTop":{"x":164.4,"y":151.9},"eyeRightBottom":{"x":163.8,"y":160.8},"eyeRightOuter":{"x":174.5,"y":158.2},"noseRootLeft":{"x":124.4,"y":151.8},"noseRootRight":{"x":137.1,"y":153.7},"noseLeftAlarTop":{"x":117.7,"y":170.8},"noseRightAlarTop":{"x":138.1,"y":174.0},"noseLeftAlarOutTip":{"x":109.3,"y":182.7},"noseRightAlarOutTip":{"x":144.1,"y":187.9},"upperLipTop":{"x":121.7,"y":207.1},"upperLipBottom":{"x":121.4,"y":211.2},"underLipTop":{"x":120.9,"y":213.3},"underLipBottom":{"x":119.4,"y":219.4}},"attributes":{"headPose":{"pitch":0.0,"roll":10.7,"yaw":-13.4},"gender":"male","age":44}}]

How Microsoft’s How-old.net presents the same data spit out by the raw API call.

The Project Oxford Face API can be used for facial matching and recognition in a number of more sophisticated ways. It can be trained on specific faces, creating facial identity profiles (which can also be zapped remotely when no longer needed with a DELETE request via REST). Using the facial geometry data associated with an image or an identity, the Face service can also do group processing. And in a fashion similar to Facebook’s automatic image tagging, Face identifies the individuals in each photo, returning face box data and identification data in JSON format.

All of this puts facial recognition within the grasp of all sorts of application developers. The capability is delivered in a form that may not exactly be real time, but it’s fast enough for some interesting applications in the mobile and Internet-of-things space. During beta Face API is free to use, but it limits each account to 20 face ID transactions per minute and 5,000 transactions per month. When it’s rolled out in full, subscribers will be allowed to use the service in volume and pay as they go.

The second set of image processing services are bundled under Project Oxford’s Computer Vision API, a service that can recognize objects within an image and categorize the image itself based on content. Like the Face API, it takes input either directly as an image or as a URL pointing to an image on another server.

Computer Vision leans heavily on Microsoft Research’s and Bing’s investments in semantic processing and entity extraction, which we first highlighted three years ago. Microsoft extended this heavily during the development of Cortana.

Using the context of an image based on the types of objects detected, the Project Oxford services can extract “topic identifiers” or categories from the image based on a hierarchy of 86 different concepts—including whether the picture is a cityscape, a picture of an animal, a crowd shot or a portrait, a rainbow or a church window. It can also distinguish whether an image is an animated GIF, a line drawing, or clip-art content, and it distinguishes black-and-white photos from color.

A graphical map of the Project Oxford image analysis API categories. Credit: Microsoft

Of course, the quality of this categorization depends a lot on the source image. For example, I fed Computer Vision a (pretty poor) photo of the crowd at Build to see what analysis it would return. It categorized the image as “other,” “outdoor,” and “text_signs.” It identified the large interior of the Moscone Center as an outdoor area and picked up on the signs on the walls. It also picked up on two male faces in the crowd, providing rectangles for their locations in the image by pixel and identifying the dominant colors of the image as black and grey and the hex color value 1583B6 as a highlight color.

The crowd shot I sent to the Project Oxford Computer Vision API.

Based on the “safe search” intelligence built for Bing, Computer Vision services can also determine whether an image has adult or “racy” content—allowing for a site developer to tap into the service for content moderation for example. The adult scoring of an image is part of the JSON returned by the service when an image is analyzed. Here, for example, are the results from the photo above:
"adult": { "isAdultContent": false, "isRacyContent": false, "adultScore": 0.010505199432373047, "racyScore": 0.020516680553555489 },

The Computer Vision API also includes optical character recognition. It can identify text in any of 21 different languages within an image, correcting for text tilted at angles of up to 40 degrees or flipped upside-down, and return the text to the program that sent the image. The text is returned as JSON content, along with bounding box and angle data. This can fairly easily be assembled into a normal text string for processing since each line of text is returned as separate JSON structures:

{ "language": "en", "textAngle": 7.699999999999962, "orientation": "Up", "regions": [ { "boundingBox": "466,204,1167,496", "lines": [ { "boundingBox": "613,204,619,88", "words": [ { "boundingBox": "613,221,239,71", "text": "Download" }, {"boundingBox": "871,235,30,37", "text": "+" },{"boundingBox": "919,213,164,76", "text": "Expert"}, {"boundingBox": "1093,204,139,60", "text": "Zone"} ] }, {"boundingBox": "466,633,1167,67", "words": [ { "boundingBox": "466,638,74,43", "text": "Two"}, {"boundingBox": "549,635,95,46", "text": "birds."}, {"boundingBox": "655,636,82,45", "text": "One"}, { "boundingBox": "748,640,112,42", "text": "stone."}, {"boundingBox": "872,635,72,47", "text": "Get"},{"boundingBox": "951,634,67,49","text": "the"},{"boundingBox": "1029,634,73,50", "text": "bits"}, {"boundingBox": "1116,634,79,50","text": "and"}, {"boundingBox": "1210,640,114,44", "text": "meet"},{"boundingBox": "1333,633,74,52", "text": "the"}, { "boundingBox": "1423,633,210,67", "text": "speakers."} ] } ]

Another feature of the Computer Vision API is “smart thumbnail cropping.” When an image is sent to the API by an HTTP POST message along with the dimensions of a desired crop, the Computer Vision service analyzes the content and returns what its contextual rules determine to be the best effort at preserving the most important content for the size. “Building on some of the other Oxford APIs, it tries to keep the most important content front and center,” Galgon explained.

I’d love to unleash Faces and Computer Vision on my iPhoto and Pictures libraries today to do all the image annotation I never did over the past 10 years, but restrictions on usage for the free Computer Vision API during the Project Oxford beta are the same as the Face API’s. The cap is 20 transactions per minute and 5,000 transactions per month. I’ll likely wait until Microsoft allows for bulk pay-per-use operation of the API—at the Machine Learning rate of $0.50 per 1,000 API transactions—to achieve photographic closure.

Talk to me

Project Oxford’s Speech APIs are the third piece of what’s available today for free from Azure. They consist of two main services: speech to text and text to speech. Two additional services—speech and text intent detection—are part of the private LUIS preview, so we were unable to test them.

All of these services are based on the Bing Voice APIs used by Cortana. Right now, they support 18 different “locales”: German, Korean, Italian, Japanese, Russian, two Spanish dialects (Spain and Mexico), two French dialects (France and Quebecois), three different Chinese dialects (Hong Kong Cantonese and Chinese, and Taiwanese Mandarin), five flavors of English (US, UK, Australian, and Canadian), and Brazilian Portuguese.

In the case of speech to text, audio has to be encoded in WAV format using one of three codecs: Pulse Code Modulation (PCM) single channel, Polycom’s Siren, or SirenSR (a speech recognition-optimized version of the Polycom Siren codec). Audio can be streamed live via a “microphone client” application (such as the live demo on Project Oxford’s webpage, shown in the video earlier in this story) via Chunked Transfer Encoding, or sent as a recorded file from a “data client.”

Either way, speech recognition starts with an HTTP POST request from the client app (or a call from one of the native software developer kits) that sends information on how the audio will be sent. This includes other parameters—locale, the device operating system, and whether the recognition is going to be in the form of an “utterance” (less than 15 seconds of speech) or in the “LongDictation” mode with up to two minutes of speech to be streamed to the server. Optionally, the request can include a parameter that requests screening for words on a profanity blacklist. Offending profanity is then masked with asterisks after the initial letter.

As with Cortana, results of voice recognition can be returned live as they are spoken so the user can see them. Display of partial requests depends on local code in the application using “partial result” event handlers. As the speech is processed, the Bing speech APIs use entity discovery to try to assemble the semantic meaning of the recognized text and correct the results, streaming back changes to previous text until the speech recognition is complete.

The final product of the speech recognition service is a JSON data structure that can include the text in several forms: the “lexical” form, which is the raw, unadulterated speech recognition result in text; or various flavors of “display text” with a best guess at capitalization, punctuation, conversion of number words to numerals, and application of common abbreviations such as “Mr.” for “mister” and “St.” for “street.” In cases where there may be multiple interpretations of the speech processed, each will be weighted with a confidence score.

The text-to-speech piece of Project Oxford is more straightforward. It is essentially a cloud-based version of Microsoft’s previously client-based text-to-speech API. The request is an HTTP POST to Bing’s speech service with parameters for the output format in the request header and the text to be turned to speech formatted in Speech Synthesis Markup Language (SSML). A request looks like this:

POST /synthesize
HTTP/1.1
Host: speech.platform.bing.com
Content-Type: audio/wav; samplerate=8000

X-Microsoft-OutputFormat: riff-8khz-8bit-mono-mulaw
Content-Type: text/plain; charset=utf-8
Host: speech.platform.bing.com
Content-Length: 197

<speak version='1.0' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, ZiraRUS)'>Crush your enemies. See them driven before you. Hear the lamentations of their women.</voice></speak>

The response is sent back as a mono WAV-formatted binary file in two 8-bit and two 16-bit supported formats. You can currently get the audio in male or female voices (though Canadian French, Australian, and Canadian English are only supported with female voices right now; and Brazilian Portuguese, Italian, Mexican Spanish, and Indian English only have male voices).

While text-to-speech is fairly straightforward and finished, the most intelligent piece of Project Oxford’s language processing services—and the part least ready for prime time—is LUIS. “Just as the speech to text will transcribe audio,” Galgon said, “LUIS can take a string of text and convert to a structured intent output.” Developers will be able to build models for cloud applications based on LUIS to process strings of text to determine what content to display or commands to execute for the user.

Galgon used the example of a news service to illustrate what LUIS will be capable of. “If I say, ‘tell me about flight delays,’” he said, “LUIS will process the text or speech and pull entities out of it to understand the intent. The topic is ‘flight delays’ and the intent is ‘find news.’” For those who have access to the private beta test of LUIS, Galgon said there’s an interactive website that helps developers build models for these sorts of applications in a fashion similar to the way Azure currently allows developers to build machine learning-based analytical services.

By identifying and tracking the sorts of entities developers have used in early tests, Galgon added, Microsoft Research has been able to identify some of the more important entities that should be streamlined in order to build these sorts of intent-driven applications. Calendar and time-related entities (“DateTime”) and people’s names have surfaced as being most important in many of them, he said—not just for doing things like managing appointments, but defining time-driven delivery of content to others.

The Internet of things without keyboards

The early targets for the Project Oxford services are clearly mobile devices. The software developer kits released by Microsoft—in addition to those supporting .NET and Windows—include speech tools for iOS and Android, and face and vision tools for Android. The REST APIs can be adapted for any platform.

But it’s also clear that Microsoft is thinking about other devices that aren’t traditional personal computers—devices that generally fall under the banner of Internet of Things. “Especially if you have a device that you’re not going to hook up a mouse or keyboard to,” Galgon said, “to have a language model behind it that can process intent and interactions is… very powerful.”

The longterm result could be that developers of all sorts of devices could build speech and computer vision into their products, delivering the equivalent of Cortana on everything from televisions to assembly line equipment to household automation systems. All such implementations would be customized to specific tasks and backed by cloud-based artificial intelligence. Some of the components of projects in fields such as cloud robotics could easily find their way into the Azure and Bing clouds.

By making the Project Oxford services as accessible as possible, Microsoft is positioning Azure and Bing to become the cloud platform for this new world of smart products. And ironically in the process, Windows could become even more relevant… as a development platform in a “post-Windows” world.

Listing image: Microsoft

Sean Gallagher IT Editor Emeritus

Sean was previously Ars Technica's IT and National Security Editor. After over 20 years in technology journalism, including over 9 at Ars, he pivoted to cybersecurity threat research, first at Sophos and now as a security research engineer at Cisco ‘s Talos Intelligence Group. A former Navy officer, he lives and works in Baltimore, Maryland.

110 Comments