HAM-NET: Hierarchical Acoustic Modeling with Dilated Convolutions and Multi-Scale LSTMS for Enhanced Speech Command Recognition


Vinay RAVURI1, Kolla Bhanu PRAKASH2, Valentina Emilia BALAS3,4

Abstract.  Accurate detection of spoken commands is essential for modern interactive voice systems, yet robust keyword spotting remains computationally demanding, especially under speaker and noise variability. State-of-the-art solutions require substantial resources and large training datasets, while still struggling with acoustically similar keywords. This work presents a novel keyword spotting architecture based on hierarchical modeling, enabling more efficient resource allocation and reduced computational waste. The proposed approach provides not only improved keyword recognition, but also an explicit modeling of relationships among keywords. Experimental evaluation against a standard baseline demonstrates superior accuracy. Analysis using a confusion matrix shows significantly reduced misclassification among similar-sounding keywords. These results indicate a meaningful advancement in both efficiency and reliability for keyword spotting systems.

Keywords: Keyword Spotting, Hierarchical Acoustic Modeling, Dilated Convolutions, LSTM, Speech Command Recognition, Deep Learning

More

DOI    10.56082/annalsarsciinfo.2025.2.17

1Student, Mohan Babu University, Tirupati, India (e-mail: drkbp1981@gmail.com )

2Professor, Department of Computer Science & Engineering, Koneru Lakshmaiah Education Foundation, Green Fields, Vaddeswaram, A.P., India (e-mail: drkbp@kluniversity.in)

3Professor, Faculty of Engineering, “Aurel Vlaicu” University of Arad, Romania

4Corresponding member of the Academy of Romanian Scientists (e-mail: balas@drbalas.ro )

 


PUBLISHED in Annals of the Academy of Romanian Scientists Series on Science and Technology of InformationVolume 18, No2