A group of researchers from three Universities of Great Britain designed an advanced acoustic attack method that allows, with 95% accuracy, to determine the information entered on the keyboard by analyzing the sound from pressing keys. This sound can be recorded using a nearby smartphone or obtained from the local microphone of the attacked device. The recreation of the input is carried out using a classifier based on a machine learning model, taking into account sound features and volume levels when pressing different keys.
Conducting this attack method requires preliminary training of a model, which involves comparing the sound of input with information about the clicked keys. Under ideal conditions for model training, malware installed on an attacking computer, capable of simultaneously recording sound from a microphone and intercepting keystrokes, can be used. In a more realistic scenario, the necessary training data can be collected by comparing the entered text messages with the sound of a set recorded during a video conference. The accuracy of determining the input decreases slightly when training the model based on the analysis of input in Zoom and Skype video conferences, with accuracies of 93% and 91.7%, respectively.
During the experiment, in order to train machine learning models using sound from a Zoom conference, each of the 36 keys (0-9, A-Z) on the keyboard was pressed 25 times in a row with different fingers and varying power. Data on the sound of each press were transformed into a picture with a spectrogram, which reflects changes in frequency and amplitude of sound over time. These spectrograms were used to train a classifier based on the coatnet model (sonvolution and attend network), a model commonly used for classifying images in machine vision systems. In other words, during the model’s training, each press’s spectrogram image is compared with the name of the corresponding key. The Coatnet model, based on the conveyed spectrogram, returns the most likely closed key, similar to how objects are recognized based on their images.