With the development and maturity of speech compression/ decompressionand recognition, speech processing is becoming an important form ofman-machine interface. Currently, these systems in laboratories aremostly large and complex. These implementations are usually based oncomputer platforms.
The primary focus of these speech recognition systems islarge vocabulary continuous speech. Their speech encoding/ decodingalgorithms are so complex that they have to depend on a PC's computingpower and ability. Hardware requirements for such subsystems are highlyrestrictive in meeting the requirements of portable, low-power andlow-cost embedded system.
To date, the Tsinghua/Infineon Uni-SpeechSoC has shown goodperformance in accurate speech recognition and low-rate high-qualityspeech encoding/decoding. However, the economics and power consumptionof the current solution limit its potential in portablespeech-recognition applications.
This opens up opportunities for Uni-Lite, an SoC that addresses costand power consumption. The Uni-Lite chip architecture comprises a 16bitDSP core, on-chip ROM and RAM, an embedded Delta Sigma() ADC/DAC with their respective I/O analog channels, and othercommon interfaces. Uni-Lite is an SoC embedded with speech-processingfirmware that can be developed into an application without the need forsecondary on-board devices.
The Uni-Lite contains not only a DSP core and a code c , but also I/O analogchannels that serve as microphone and speakers. On-chip RAM and ROM,communication to external devices are handled through on-chip UART, GPIO lines as well as SPI. With the exception of powersupply, microphones and speakers, all the system hardware componentsare integrated into a single chip (Figure1, below ).
|Figure1: The Uni-Lite contains not only a DSP core and codec, but also I/Oanalog channels that serve as microphone and speakers. On-chip RAM andROM are handled through on-chip UART, GPIO lines and SPI.|
The OAK DSP Core is a 16bit (dataand program buses) high performance fixed-point DSP core. All the I/Ounits are connected to it directly through intermediary digital logics.
The DSP core is designed for operation of up to 104MHz. However, dueto the paradigm of design for portability, the operating speed of theDSP can be programmed to various rates from complete hibernation,1-104MHz. In terms of computational capability, the DSP core is able toprovide a throughput of 1 Drystone MIPs per MHz of operation.
On-chip memory consists of RAM and ROM. Most program and data usedby the algorithm are stored in the ROM to minimize the costs of bothsilicon area and SoC. The application layer will reside in RAM toprovide application flexibility of the whole system.
The codec consists of 12 bit on-chip DAC and 12bit ADC. The samplingrate of the ADC and DAC can be programmed to either 8kHz or 16kHz. Thisallows speech processing for different frequency bandwidths. The inputanalog channel of the ADC has a PGA-programmable gain amplifier with adynamic gain of 0-42dB while the output analog channel of the DAC has aPGA of 0-(-18)dB and a programmable band pass filter of 2.5-10kHz.
A power management unit provides general-purpose power managementand clock-rate control capabilities. It has the ability to controlclock inputs to all major blocks of the design as the power consumptionof the system is closely related to the rate of the system clock. Thepower management unit is also able to put the chipset into a state ofhibernation until the whole system is 'woken up' by the definedsources.
A watchdog timer is also available. This special design enables thesystem to recover from unexpected endless loops and hangups. This willthus enhance the chip's operational reliability. An SPI interfaceenables the use of external memories that use the SPI. In the case ofUni-Lite, the SPI interface serves two purposes. In the first case, itis used as a source for downloading firmware during system boot-up.
In the second case, the SPI interface can function as ageneral-purpose interface to external serial-interface memories. In thefirst mode, the system loads the first batch of contents of the SPIEEPROM into the internal Program RAM during boot up.
This is done with the aid of either a Hardware Bootloader orSoftware BootLoader. This special double bootloader design provides astable performance of the system. The GPIO provides general purposeserial communications and control capabilities as well as “wakes up”sources when the system is put to sleep mode. The typical rate ofoperation is 10MHz.
As an SoC designed for portable systems, size and power consumption aretwo most important factors to consider in design. In Uni-Speech, theSoC was developed specifically for use in advanced stationaryspeech-recognition applications.
Uni-Lite, which was jointly developed by Infineon Technologies andTsinghua University, focuses on developing a seamless system thatintegrates an advanced speech-recognition algorithm and cost-efficientSoC solutions.
Power consumption is tightly correlated to the operating clock ratein any digital system. The DSP core was designed to operate at varyingrates, according to the function module. In the case of the real-timeG.723 coder, a high operating rate is needed due to the complexity ofthe computation in compression algorithm.
On the other hand, the G.723 decoder requires a much lower operatingrate. The algorithm for the MelFrequency Cepstral Coef ficient (MFCC)feature extraction requires a lower rate. The higher operating rate isneeded by the recognition mechanism to minimize the delay after the endof speech. The operating rate of the chip can be modified according tothe delay requirement, the scale of vocabulary and the complexity ofthe template (Table 1 below ).
|Table1: The operating rate of the chip can be modified according to thedelay requirement, scale of vocabulary and complexity of the template.|
The ability to vary the operating rate of the semi-real-time speechapplication will thus constrain peripherals usage and limitcomputational requirements. This will result in power reduction to someportion of the chip, which will contribute to the overall powersavings. With correct settings in the applications software, the entireUni-Lite SoC can be put into hibernation mode where it only consumescurrent in the range of microamperes (Table2 below ).
|Table2: With correct settings in the applications software, the entireUni-Lite SoC can be put into hibernation mode where it only consumescurrent in the range of microamperes.|
On the SoC, a full-speech interface – functions of guidance prompt,speech talk-back and speech recognition – is embedded.
This software set (Figure 2, below )is composed of endpointdetection, MFCC feature extraction, small vocabularyspeaker-independent recognition and encoding/decoding objects. Otheralgorithms such as speaker-dependent recognition and speakeridentification on this chip are under development.
|Figure2: The entire software system is partitioned into threelevels—application, service and driver.|
The entire software system is partitioned into three levels:application, service and driver. The driver level mainly manipulatesthe hardware and peripherals of the chip, and serves as a soft deviceto the upper level. With this structure, only minor revisions need tobe made for the whole system to function when the external devices arechanged.
The service level contains basic speech functional objects. Theperformance of a template-based word-recognition system is verysensitive to the variations of endpoint. This is even more difficult ina speech-interface chip, since all endpoint-detection processing mustbe done in real-time on the hardware.
The two-stage endpoint detection is a good method developed to copewith such difficulty. In the first stage, endpoint detection is basedon energy and zero-crossing rate.
This process is simple enough to be time-synchronous. However, itgives only rough active-voice boundaries. Speech frames within theseboundaries are then processed with their features extracted andrecorded. This process saves storage space by bypassing the silence.This will also lower the DSP's average computing burden resulting in alower operation rate.
The second-stage endpoint detection uses more information generatedfrom feature extraction, such as energy of different frequency bands. Anew feature is added to allow the algorithm to search both forward andbackward of the endpoints, and update the searching threshold based onthe current whole word.
In this case, the second-stage endpoint detection gives out a farmore accurate endpoint location. This is considered a well-balancedmechanism between efficiency and accuracy, and results in highperformance feature extraction.
The continuous density Hidden MarkovModel (HMM) method based on both word model and subwordmodel is also implemented in this level to achieve text- andspeaker-independent high-accuracy recognition. This means thatvocabulary recognition can easily be added as text from the computerand downloaded to the chip.
A multipass decoding algorithm is also embedded in the system. Usinga simple template, the most likely words are selected in a short time.A much more precise template is used to determine which one is thefinal result. This saves both memory consumption and recognition time,thus enhancing the performance of SoC recognition.
Also implemented on the SoC is theITUG723.1 speech algorithm, which provides a low-rate, good-quality speechcoding method that has been successfully applied in very narrow- bandvideoconferences.
This high compression rate also contributes to longer speech for agiven data-storage space. Most of the code and data in this level arestored in the ROM. This results in a significant reduction of siliconarea that translates to lower power and hardware cost reduction.
The application level is the most variable portion of the system.With the implemented software architecture, this can be easily changedaccording to specific applications. With the support of relevantservice-level speech-functional objects, any new application softwarecan be built up rapidly.
And being a flexible configurable software system, eachservice-level speech- functional object can be freely integrated intothe system application firmware.
The embedded software can be built up using the providedhigh-performance functional objects and is designed to enableapplications to be easily assembled within a very short time.
The SoC is an optimized solution for embedded applications such astoys, voice-based remote control and speech recorder. Futuredevelopments include real-time recognition with the ability torecognize a much longer speech in a short time on the SoC.Speaker-dependent recognition and speaker identification will also bedeveloped to satisfy more complex applications.
Zhizuo Yang and Jia Liu are withthe Department of Electronic Engineering at Tsinghua UniversityBeijing, China and Eric Chan is Staff Engineer, Lim Cheow Guan isSenior Engineer and Chen Kim Chin is IC Design Manager in the ASICDesign and Security Department atInfineon Technologies AG.