MDS-11AI-Specific

Model Failure

Specification

Perform a risk-based evaluation of the model and model serving infrastructure for model failure. Define and implement measures to mitigate model and model serving infrastructure failures, and regularly evaluate throughout the AI system's lifecycle.

Threat coverage

Model manipulation

Data poisoning

Sensitive data disclosure

Model theft

Model/Service Failure

Insecure supply chain

Insecure apps/plugins

Denial of Service

Loss of governance

Architectural relevance

Physical infrastructure

Network

Compute

Storage

Application

Data

Lifecycle

Preparation

Data storage, Resource provisioning

Development

Design, Training

Evaluation

Evaluation, Validation/Red Teaming, Re-evaluation

Deployment

Orchestration, AI applications

Delivery

Operations, Maintenance, Continuous monitoring, Continuous improvement

Retirement

Archiving, Model disposal

Ownership / SSRM

Shared Cloud Service Provider-Model Provider (Shared CSP-MP)

The CSP and MP are jointly responsible and accountable for the design, development, implementation, and enforcement of the control to mitigate security, privacy, or compliance risks associated with Large Language Model (LLM)/GenAI technologies in the context of the services or products they develop and offer.

Model

Owned by the Model Provider (MP)

The model provider (MP) designs, develops, and implements the control as part of their services or products to mitigate security, privacy, or compliance risks associated with the Large Language Model (LLM). Model Providers are entities that develop, train, and distribute foundational and fine-tuned AI models for various applications. They create the underlying AI capabilities that other actors build upon. Model Providers are responsible for model architecture, training methodologies, performance characteristics, and documentation of capabilities and limitations. They operate at the foundation layer of the AI stack and may provide direct API access to their models. Examples: OpenAI (GPT, DALL-E, Whisper), Anthropic(Claude), Google(Gemini), Meta(Llama), as well as any customized model.

Orchestrated

Shared Model Provider-Orchestrated Service Provider (Shared MP-OSP)

The MP and OSP are jointly responsible and accountable for the design, development, implementation, and enforcement of the control to mitigate security, privacy, or compliance risks associated with Large Language Model (LLM)/GenAI technologies in the context of the services or products they develop and offer.

Application

Shared Orchestrated Service Provider-Application Provider (Shared OSP-AP)

The OSP and AP are jointly responsible and accountable for the design, development, implementation, and enforcement of the control to mitigate security, privacy, or compliance risks associated with Large Language Model (LLM)/GenAI technologies in the context of the services or products they develop and offer.

Implementation guidelines

[Shared Responsibilities (Applicable to ALL actors)]
1. Define service-level agreements (SLAs) or objectives (SLOs) that include metrics reflecting overall model performance and reliability.

[Shared Responsibilities (Applicable to AP, OSP)]
1. Redundancy Management:
• Implement Multiple model support with weighted importance
• Perform Quality and performance monitoring
• Implement Automatic detection of degraded or failed models
• Adopt Quorum-based decision making techniques

2. Model Configuration:
• Set appropriate weights for each model based on reliability
• Configure quality thresholds based on your use case
• Adjust quorum size based on reliability requirements
• Implement fallback scenarios for complete model failure

3. Implement Failure Detection techniques such as:
• Quality score tracking
• Error rate monitoring
• Latency monitoring
• Historical metrics tracking
• Trend analysis for early warning

4. Failover Mechanisms:
• Automatic model exclusion when quality degrades
• Dynamic load balancing across healthy models
• Cooldown periods between recovery attempts
• Graceful degradation when insufficient models are available

5. Monitoring Setup:
• Implement comprehensive logging
• Set up alerts for degraded performance
• Monitor recovery attempts and success rates
• Track load distribution across models
• Monitor long-term quality trends

6. Recovery Strategy:
• Implement model-specific recovery logic
• Consider automatic scaling for heavy loads
• Perform and validate Disaster Recovery or Chaos Engineering tests

Auditing guidelines

1. Review CSP's infrastructure resilience and high-availability measures for hosting AI models. 

2. Assess failover mechanisms that ensure model availability during infrastructure failures. 

3. Verify documentation of redundancy architecture and recovery procedures. 

4. Confirm that redundancy implementation aligns with service level agreements and business continuity requirements. Verify that redundant implementations don't contribute to data poisoning or model theft.

Standards mappings

ISO 42001No Gap

ISO 42001 A.4.5 - System and computing resources
ISO 42001 B.4.5 System and computing resources
ISO 27001 6.1.2 - Information security risk assessment
ISO 27001 A.8.13 - Information backup
ISO 27001 A.8.14 - Redundancy of information processing facilities => disagree => I would expect the use of the cloud
and therefore
ensuring that redundant availability zones/regions are selected.

Addendum

N/A

EU AI ActNo Gap

Article 15 (4)

Addendum

N/A

NIST AI 600-1Full Gap

No Mapping

Addendum

No NIST AI 600-1 controls address risk-based evaluation of the model-serving infrastructure.

BSI AIC4No Gap

C4 BC-04
C4 SR-01
C4 SR-02
C4 SR-06
C5 BC-03
C5 BC-04
C5 OPS-05
C5 OPS-18

Addendum

N/A

AI-CAIQ questions (2)

MDS-11.1

Are risk-based evaluation of the model and model serving infrastructure for model failure performed?

MDS-11.2

Are measures defined and implemented to mitigate model and model serving infrastructure failures, and are they regularly evaluated throughout the AI system's lifecycle?