A team of researchers from the Chinese University of Hong Kong and the University of Sun Yatsen has discovered that artificial intelligence-based tools like Github Copilot and Amazon Codewhisperer can inadvertently expose users’ confidential data. These data may include API keys, access tokens, and other sensitive elements.
The researchers developed a tool called HardCode Credential Revealer (HCR) to identify such data. As part of their study, they found that out of 8,127 code completion suggestions generated by Copilot, 2,702 turned out to be valid secrets. For Codewhisperer, the figure was 129 out of 736.
It is important to note that these “secrets” were initially published accidentally and could have been used or revoked before they reached the model. However, the study emphasizes the risks associated with reusing data that was originally intended for training purposes.
The study also revealed that the models not only reproduce secret keys from their training data, but also suggest new ones that are not present in the data. This raises concerns about the potential disclosure of other types of data.
The research demonstrated that thousands of new unique secrets are inadvertently published on GitHub every day due to developers’ mistakes or their disregard for security practices.
Due to ethical considerations, the researchers refrained from checking for the presence of keys that could pose significant risks to confidentiality, such as payment API keys. However, they did test a set of harmless keys and discovered two working Stripe keys proposed by both Copilot and Codewhisperer.
In response to the information about this threat, GitHub stated that since March 2023, they have implemented an AI-based system to block unsafe code templates in real time, thus preventing leaks.
“In some cases, the model may propose what appears to be personal data, but these suggestions are fictitious information synthesized from templates in the training data,” the company added.
In conclusion, even if the researchers’ findings are only partially true, it sets a precedent and underscores the need for technological companies offering such tools to develop additional methods for testing and verifying code, even if it is generated by artificial intelligence. In today’s world, there should be no room for potential leaks of secrets and other confidential data.