TY - JOUR
T1 - Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code Models
AU - Yang, Zhou
AU - Zhao, Zhipeng
AU - Wang, Chenyu
AU - Shi, Jieke
AU - Kim, Dongsun
AU - Han, Dong Gyun
AU - Lo, David
N1 - Publisher Copyright:
© 1976-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated. While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.
AB - Leveraging large-scale datasets from open-source projects and advances in large language models, recent progress has led to sophisticated code models for key software engineering tasks, such as program repair and code completion. These models are trained on data from various sources, including public open-source projects like GitHub and private, confidential code from companies, raising significant privacy concerns. This paper investigates a crucial but unexplored question: What is the risk of membership information leakage in code models? Membership leakage refers to the vulnerability where an attacker can infer whether a specific data point was part of the training dataset. We present Gotcha, a novel membership inference attack method designed for code models, and evaluate its effectiveness on Java-based datasets. Gotcha simultaneously considers three key factors: model input, model output, and ground truth. Our ablation study confirms that each factor significantly enhances attack performance. Our ablation study confirms that each factor significantly enhances attack performance. Our investigation reveals a troubling finding: membership leakage risk is significantly elevated. While previous methods had accuracy close to random guessing, Gotcha achieves high precision, with a true positive rate of 0.95 and a low false positive rate of 0.10. We also demonstrate that the attacker's knowledge of the victim model (e.g., model architecture and pre-training data) affects attack success. Additionally, modifying decoding strategies can help reduce membership leakage risks. This research highlights the urgent need to better understand the privacy vulnerabilities of code models and develop strong countermeasures against these threats.
KW - code completion
KW - large language models for code
KW - Membership inference attack
KW - privacy
UR - http://www.scopus.com/inward/record.url?scp=85207929323&partnerID=8YFLogxK
U2 - 10.1109/TSE.2024.3482719
DO - 10.1109/TSE.2024.3482719
M3 - Article
AN - SCOPUS:85207929323
SN - 0098-5589
VL - 50
SP - 3290
EP - 3306
JO - IEEE Transactions on Software Engineering
JF - IEEE Transactions on Software Engineering
IS - 12
ER -