Given the rise of LLM usage in essential services and critical systems, the need for a more efficient and secure LLM has become paramount. RedShift is a project that aims to provide a more secure LLM by implementing an adversarial jailbreaking system to identify weaknesses to prompt engineering. This research project consists of three LLM categories: an attacker LLM that generates jailbreaks, a target LLM that is the system under attack, and a judge LLM to quantify the success of the current jailbreak. (GitHub Link)
Multi-agent reinforcement learning system that uses adversarial jailbreaking to identify weaknesses in a target LLM.
The attacker LLM generates jailbreaks to exploit the target LLM, while the judge LLM quantifies the success of the jailbreak.
The target LLM is the system under attack and is used to test the effectiveness of the jailbreaks generated by the attacker LLM.
Nine LLMs are used in the RedShift project: each LLM is rotated to become the attacker, target, and judge LLM.
The attacker LLM generates jailbreak prompts that shift the attention of the target LLM, masking malicious queries to generate restricted content.
The judge LLM is only given the target LLM's response and the current jailbreak prompt.

