PassGPT and AI/ML: Cracking Passwords
With all the recent buzz about ChatGPT, it isn’t surprising that a password-guessing model, PassGPT, has surfaced based on OpenAI’s GPT-2 architecture (Lanz, 2023). While PassGPT is claimed to focus on decoding complex features of human-generated passwords and provide stronger passwords for users, it raises some eyebrows regarding Cybersecurity considerations. PassGPT was developed using Large Language Models (LLMs) that use progressive sampling methods to build probable passwords, character by character. The LLMs used in developing PassGPT were constructed from data pools of leaked passwords from various hacks and exploits. The LLM approach outperforms the previously touted Generic Adversarial Network (GAN) approach to password guessing by an estimated 20% increase, which offers opportunities for password protection companies and hackers alike.
Technical Background
Large Language Models (LLMs) and Generative AI, Deep Learning Techniques, Password Guessing, Password Strength Estimation, and Deep Generative Models provide the background to the emerging use of the technology implemented by PassGPT. Several Artificial Intelligence/Machine Learning (AI/ML) technologies are briefly described below (Rando and Hitaj, 2023):
1. Generative AI: Generative AI is part of the Artificial Intelligence field focused on creating computer programs and systems to generate original and creative content, such as image generation, music creation, poetry generation, etc. While more traditional AI systems rely heavily on explicit programming or rules, Generative AI uses Deep Learning techniques and neural networks to learn patterns and generate new content based on training data. Generative AI can also be misused, as it is the primary technology behind creating deep fakes.
2. Deep Learning Techniques: Deep Learning is a sub-field of AI that focuses on designing neural networks and algorithms that simulate the human brain’s structure, purpose, and function. A variety of deep learning methodologies exist, including (but not limited to):
3. Password Guessing: Password Guessing employs Machine Learning (ML), where an algorithm is designed to be trained to predict passwords based on a provided dataset. The AI uses pattern recognition to guess what the password might be based on understanding typical human behavior (this behavior expectation is part of the programming requirement for ML).
4. Password Strength Estimation: Password Strength Estimation is a component of Password Guessing and is crucial in ensuring user password security. In a traditional sense, password strength is a function of the length of passwords, including their complexity (e.g., use of upper-case, lower-case letters, numbers, and special symbols) in the generation of passwords. In AI/ML, models can be trained on large datasets of compromised passwords allowing the AI/ML better to guess the probable length of a potential password crack.
5. Deep Generative Models: Deep Generative Models is another subset of AI that aims to understand and learn the data distribution of the training set of data to generate new variations of data points, and in our case, passwords. Here are a few of the more common generative models:
a. Generative Adversarial Networks (GANs): GANs are a machine learning category invented by Ian Goodfellow et al. (2014). LLM works from large data pools and GAN is based on gaming theory.
b. Variational Autoencoders (VAEs): VAEs generate data like GANs do, however VAEs have a different learning process, as they focus on the principles of probabilistic graphical models (and, as such, autoencoders).
c. Autoregressive models: Autoregressive models create sequences by predicting new (generated) values based on the previously generated values.
d. Energy-Based Models (EBMs): EBMs define the energy function over the “data space” and assign energy levels based on regions near data points. The focus on EBMs is speed and efficiency. This approach allows EBMs to generate values efficiently.
e. Normalizing Flow Models (NFM): NFM models build complex distributions by transforming simple ones. They provide an explicit likelihood model, contrasted with other GAN models.
An Overview of LLM and GAN
To dig into the technology, a comparason of LLM and GAN will provide some background related to PassGPT and how it is a viable tool for cybersecurity researchers and hackers. Rando, J., & Hitaj, B. (2023) discuss the effectiveness of using Large Language Models (LLMs) as generative AI in creating robust passwords using large pools of password data. In particular, Rando and Hitaj (2023) discuss their focus on PassGPT, which is LLM-trained that uses leaked password data for probable password creation.
In contrast to LLM, a GAN system, two neural networks contest with each other in a “zero-sum” game. The GAN comprises two primary parts, including (a) a Generator, where a neural network generates new data instances, and (b) a Discriminator, where a neural network distinguishes and compares instances of real and fake data generated by the Generator. For password cracking using GANs, a more efficient password-cracking system has been created where the Generator creates more plausible passwords, and the Discriminator determines whether those passwords are likely to be correct based on its training data.
A Comparison to Traditional Password Cracking
Both LLM and GAN differ from more traditional password-cracking methods. The more traditional password cracking methods include (a) brute force methods, where every possible combination for a password is attempted, (b) rule-based attacks, where knowledge about how people create passwords is used to predict possible passwords; or (c)dictionary attacks, where a list of common passwords and possible variants are tried. The LLM generally uses pools of known password data, and GAN is based on gaming theory (AI/ML).
PassGPT
PassGPT provides unique password generation and analysis. Regarding AI/ML, PassGPT is an explicit generative model that relies on the probability and distribution of data (e.g., password data) that captures the joint probability distribution of observed data and labels. PassGPT, as a generative model, creates new instances of data (passwords) based on sampling from its learned data distribution using LLM. Further, PassGPT analyzes password strength vulnerabilities, which, as noted above, can potentially compromise account security. Further, PassGPT learns patterns across multiple languages, which improves its dictionary-based heuristics.
The Future of PassGPT
While PassGPT is in its nascent years, PassGPT is a versatile tool that has the potential to be tailored for specific software applications that can support more robust password security (Lenz, 2023). Further, while the initial implementation of PassGPT was developed and trained using large pools of compromised passwords, there is a possibility of using other datasets to leverage the AI/ML of PassGPT to enhance (for example) medical data, search data, speech analysis, or other heuristic data.
Conclusion
PassGPT is challenging the well-known password creation practices that are used. Further, traditional password creation methods may become less secure, and these methods may become obsolete. While PassGPT demonstrates advancements in password-cracking for cybersecurity researchers and hackers alike, due to its robust tools and password-strength algorithms, it is possible to build out PassGPT to study hidden patterns that can support future use in the security of cloud platforms, digital IDs, data storage in mobile phones, biometric data, and similar data and security intensive project. And yes, hackers will offer cybersecurity researchers additional opportunities to work with PassGPT, so get ready.
References
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014, June 10). Generative Adversarial Networks. arXiv.org. https://arxiv.org/abs/1406.2661
Lanz, J. A. (2023, June 9). Meet PASSGPT, the AI trained on millions of leaked passwords. Decrypt. https://decrypt.co/144004/meet-passgpt-ai-trained-millions-leaked-passwords
Rando, J., & Hitaj, B. (2023). PassGPT: Password Modeling and (Guided) Generation with Large Language Models. ArXiv. /abs/2306.01545
About the Author
Ron McFarland, Ph.D., CISSP, is a Senior Cybersecurity Consultant at CMTC (California Manufacturing Technology Consulting) in Long Beach, CA. He received his doctorate from NSU’s School of Engineering and Computer Science, an MSc in Computer Science from Arizona State University, and a Post-Doc Graduate Research Program in Cyber Security Technologies from the University of Maryland. He taught Cisco CCNA (Cisco Certified Network Associate), CCNP (Cisco Certified Network Professional), CCDA (Design), CCNA-Security, Cisco CCNA Wireless, and other Cisco courses. He was honored with the Cisco Academy Instructor (CAI) Excellence Award in 2010, 2011, and 2012 for excellence in teaching. He also holds multiple security certifications, including the prestigious Certified Information Systems Security Professional (CISSP). He writes for Medium as a guest author to provide information to learners of cybersecurity, students, and clients.
CONTACT Dr. Ron McFarland, Ph.D.
· CMTC Email: rmcfarland@cmtc.com
· Email: highervista@gmail.com
· LinkedIn: https://www.linkedin.com/in/highervista/
· ·YouTube Channel: https://www.youtube.com/@RonMcFarland/featured
Interested in AI/ML Programming?
Chapter 1: Programming for Artificial Intelligence and Machine Learning: Introduction to SWI-Prolog: https://medium.com/@highervista/chapter-1-programming-for-artificial-intelligence-and-machine-learning-introduction-to-swi-prolog-d3624ab0b791