Attack Types¶
AI is vulnerable to evasion, poisoning, inference, and extraction attacks. HEART currently focuses on the evaluation of computer vision models – including image classification and object detection – against evasion attacks, but will soon be extended to include the other three types of threat.
Evasion¶
Attacks that involve the manipulation of runtime data input into a trained model and crafting perturbations that cause the model to perform poorly (sometimes in a specific, targeted way). [1]
HEART supports the following types of evasion attacks:
Hop skip jump
Laser
Query efficient Black Box
Evasion attacks may have different goals.
For object detection models, evasion attacks may aim to cause the model to hallucinate objects (appearing attack), ignore existing objects (hiding attack or invisibility) [2], or mislocate or misclassify detected objects. Variations of the attack goal for object detection can include overflow attacks that lead the model to detect an extremely large number of objects. This can result in memory overflows and/or increased latency [3], rendering the detector unusable (denial of service).
Evasion attacks can be either untargeted (the output is any valid output other than the correct output) or targeted (the output is incorrect and has been defined as a specific target by the attacks). In the case of image classification, if the correct classification is ‘airplane’, any classification other than ‘airplane’ caused by the crafted adversarial perturbation is a successful untargeted attack. In most cases, untargeted attacks shift the adversarial point to any close neighboring class. On the other hand, targeted attacks aim to cause the classifier to output a specific target class. Adversarial perturbations for targeted attacks are often harder to craft than for untargeted attacks, as they need to find a perturbation that shifts the point into a specific class, not just any neighboring class.
Poisoning¶
Poisoning means that the attacker introduces specifically crafted, poisonous samples into the training data with the goal of inserting a backdoor into the trained model to influence its behavior at runtime, make it more difficult to train the model on that data, or change its behavior in unexpected ways like confusing two image categories. [4]
Inference¶
Inference attacks, also know as privacy attacks, do not directly affect model performance. Instead, these attacks infer private or sensitive information in the training dataset of the model by interacting with that model (without any access to the training data itself). Attackers probe the model with specifically selected input samples and analyze model output to derive the targeted insights.
Extraction¶
In an extraction attack, the attacker will attempt to decipher information about the model. This including its architecture and parameters, under special circumstances, in order to replicate its functionality and steal its intellectual property.
Attacks can be staged in concert to increase their combined effectiveness. For example, extraction attacks can provide key information to enable stronger white-box evasion attacks, or to extract a model that could later leak private information in a strong white-box inference attack.