A model for speech-driven lesson summary generation in a noisy educational environment
- Authors: Blunt, Phillip John
- Date: 2024-04
- Subjects: Automatic speech recognition , Speech processing systems , Educational technology
- Language: English
- Type: Master's theses , text
- Identifier: http://hdl.handle.net/10948/64500 , vital:73741
- Description: The application of Automatic Speech Recognition (ASR) technology for generating lesson transcripts and closed captions in the classroom has shown to improve the learning experience of people in disadvantaged student groups. This dissertation proposes a concept model for applying ASR technology in the educational environment for lesson transcription or closed captioning. The model aims further to bolster students’ secondary contact with the lesson content using keyword identification and subsequent association to generate a summary of the educator’s key points with reference to known course content material. To reinforce this concept, three core theoretical areas are discussed in this work, namely the existing applications of ASR technology in the classroom, the prominent machine-learning solutions that are capable of performing ASR, either for keyword spotting or for continuous speech recognition, and finally, the speech enhancement techniques used to mitigate the negative effects of environmental noise in the educational space. After a groundwork investigation into these three core theoretical areas, an initial model was created for incorporating an ASR system into the educational environment using the speech of the educator to drive the process of generating the lesson summary. After analysis for prototype development, the feasibility of developing a keyword-spotting system using South African speech data to train a machine-learning model revealed a number of challenges. Hence, it was decided that it would be more appropriate to implement a cloud-based ASR solution to establish proof of concept in a prototype system. In addition, the advent of a cloud-based ASR solution meant that a more reliable lesson transcript could be generated and, as a result, the direction of this work could move towards exploiting the utility provided by lesson transcription to generate a meaningful lesson summary. An initial prototype implementation was then developed based on the initial model using a cloud-based ASR approach. The final model presented in this work makes use of keyword identification in the transcription process, in collaboration with a course content database to identify known, educator-defined keyword terms during a lesson that are tied to relevant course content items for the specified lesson. As the model or prototype was improved and adapted, its counterpart was modified appropriately, ensuring that each reflected both the theoretical and practical aspects of the other. After a series of improvement cycles, a final version of the model was ascertained, supported by a performance evaluation of an acceptable prototype system. Ultimately, the prototype proved capable of generating a lesson summary, presented to students to bolster secondary contact with lesson content. This lesson summary provides students with a lesson transcript, but also helps them to monitor educator-defined keyword terms, their prevalence as communicated in the lesson by the educator, and their associations with educator-defined sections of course content. The prototype was developed with a modular approach so that its speech recognition component was interchangeable between CMU’s Sphinx and Google Cloud’ Speech-to-Text speech recognition systems, both accessed via a cloud-based programming library. In addition to the ASR module, noise injection, cancellation and reduction were also introduced to the prototype as a speech enhancement module to demonstrate the effects of noise on the prototype. The prototype was tested using different configurations of speech recognition- and speech enhancement techniques to demonstrate the change in accuracy of lesson summary generation. Proof of concept was established using the Google Cloud Continuous Speech Recognition System, which prevailed over CMU’s Sphinx and enabled the prototype to achieve 100,00% accuracy in keyword identification and subsequent association on noise-free speech, contrasted with a 96,93% accuracy in keyword identification and subsequent association on noise-polluted speech when applying noise cancellation. , Thesis (MIT) -- Faculty of Engineering, the Built Environment, and Technology, School of Information Technology, 2024
- Full Text:
- Date Issued: 2024-04
Incorporating emotion detection in text-dependent speaker authentication
- Authors: van Rensburg, Ebenhaeser Otto Janse , Von Solms, Rossouw
- Date: 2024-04
- Subjects: Automatic speech recognition , Biometric identification , Computer networks -- Security measures , Computer networks -- Access control
- Language: English
- Type: Doctoral theses , text
- Identifier: http://hdl.handle.net/10948/64566 , vital:73767
- Description: Biometric authentication allows a person to access sensitive information using unique physical characteristics. Voice, as a biometric authentication method, is gaining popularity due to its unique characteristics and widespread availability on smartphones and other devices. It offers a secure and user-friendly alternative to traditional password-based authentication and allows a less intrusive authentication method than fingerprint authentication. Furthermore, a vast amount of information is portrayed through voice, such as age, gender, health, and emotional state. Gaining illegitimate access to information becomes significantly more difficult as biometrics are difficult to steal, and countermeasures to techniques such as replay attacks are constantly being improved. However, illegitimate access can be gained by forcing a legitimate person to authenticate themselves through voice. This study investigates how the emotion(s) carried by voice can assist in detecting if authentication was performed under duress. Knowledge is contributed using a three-phased approach: information gathering, experimentation, and deliberation. The experimentation phase is further divided into three phases to extract data, implement findings, and assess the value of determining duress using voice. This phased approach to experimentation ensures minimal change in variables and allows the drawn conclusions to be relevant to each phase. The first phase examines datasets and classifiers; the second phase explores feature enhancement techniques and their impact; and the third phase discusses performance measurements and their value to emotion detection. , Thesis (DPhil) -- Faculty Of Engineering, the Built Environment and Technology, School of Information Technology, 2024
- Full Text:
- Date Issued: 2024-04