LiteLLM Load Balancing 2편 — 폴백 전략과 장애 대응 완전 가이드

2024년 4월 9일 UTC 13:00, Anthropic 클러스터에 장애가 발생했습니다. 단일 Anthropic 키에 의존하던 한 팀의 고객 지원 코파일럿은 1시간 동안 완전히 다운됐습니다. 엔지니어가 수동으로 SDK의 base_url을 OpenAI로 바꾸고 재배포하는 데 14분이 걸렸습니다. 같은 해 11월 OpenAI 장애로 2시간, 2025년 2월 Gemini 장애로 반나절. 이 팀이 LiteLLM 폴백을 설정했다면 세 번의 다운타임 모두 자동으로 흡수됐을 것입니다. 폴백은 "있으면 좋은 것"이 아닙니다. 1편의 라우팅 전략이 정상 트래픽을 분산한다면, 2편의 폴백·재시도·쿨다운은 장애 시 시스템을 살려두는 안전망입니다.

이 포스트 한 줄 요약 → 폴백 3종: fallbacks (일반 오류) · content_policy_fallbacks · context_window_fallbacks → default_fallbacks: 특정 모델 그룹 설정 누락 시 전역 백업 → 폴백은 순서대로 실행 — 리스트 순서가 우선순위 → 재시도: num_retries + retry_after + 예외 유형별 retry_policy → RateLimitError에는 자동 지수 백오프 적용 → 쿨다운: allowed_fails + cooldown_time — 1분 TTL 슬라이딩 윈도우 → AllowedFailsPolicy: 에러 유형별 개별 쿨다운 임계값 → enable_pre_call_checks: API 호출 전 컨텍스트 윈도우 초과 사전 차단 → enable_weighted_failover: 동일 그룹 내 배포 먼저 재시도 후 그룹 간 에스컬레이션 → ⚠️ 레이턴시 함정: 재시도 3회 × 타임아웃 10초 = P99 최대 33초

LiteLLM의 예외 정규화 — 왜 폴백이 프로바이더에 무관하게 동작하는가

폴백 시스템이 작동하려면 "이 에러가 429인가, 413인가, 콘텐츠 정책 위반인가"를 일관되게 판단해야 합니다. 각 프로바이더는 에러 형식이 다릅니다. LiteLLM은 모든 프로바이더의 예외를 OpenAI 형식의 표준 예외로 자동 매핑합니다.

# LiteLLM 표준 예외 계층
litellm.AuthenticationError       # 인증 실패 (401)
litellm.RateLimitError            # Rate Limit 초과 (429)
litellm.ContextWindowExceededError # 컨텍스트 윈도우 초과
litellm.ContentPolicyViolationError # 콘텐츠 정책 위반
litellm.BadRequestError           # 잘못된 요청 (400)
litellm.ServiceUnavailableError   # 서비스 불가 (503)
litellm.Timeout                   # 타임아웃
litellm.APIConnectionError        # 연결 실패

# Azure의 "ContextLengthExceeded" = Anthropic의 "too many tokens"
# = litellm.ContextWindowExceededError
# → 어느 프로바이더 에러든 동일한 폴백 로직 적용

이 정규화 덕분에 context_window_fallbacks를 한 번 설정하면 어떤 프로바이더에서 컨텍스트 초과가 발생하든 같은 폴백이 동작합니다.

폴백 3종 + default_fallbacks

폴백은 에러 유형에 따라 세 가지로 분리됩니다. 각각 다른 에러를 트리거합니다.

from litellm import Router

router = Router(
    model_list=[
        # 기본 고성능 모델
        {
            "model_name": "primary",
            "litellm_params": {"model": "claude-opus-4-7", "api_key": "..."},
        },
        # 일반 폴백 대상
        {
            "model_name": "fallback-gpt",
            "litellm_params": {"model": "gpt-5.5", "api_key": "..."},
        },
        # 컨텍스트 윈도우가 큰 모델
        {
            "model_name": "long-context",
            "litellm_params": {
                "model": "gemini-3.5-flash",
                "api_key": "...",
            },
        },
        # 콘텐츠 정책이 더 관대한 모델
        {
            "model_name": "lenient-model",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "...",
            },
        },
        # 전역 최후 방어선
        {
            "model_name": "last-resort",
            "litellm_params": {
                "model": "gemini-3.5-flash",
                "api_key": "...",
            },
        },
    ],

    # ① 일반 오류 폴백 (RateLimitError, 5xx, Timeout 등)
    # primary 실패 → fallback-gpt 시도 → 그것도 실패 → long-context 시도
    fallbacks=[
        {"primary": ["fallback-gpt", "long-context"]}
    ],

    # ② 콘텐츠 정책 위반 폴백 (ContentPolicyViolationError)
    # primary가 콘텐츠 거부 → lenient-model 시도
    content_policy_fallbacks=[
        {"primary": ["lenient-model"]}
    ],

    # ③ 컨텍스트 윈도우 초과 폴백 (ContextWindowExceededError)
    # primary 토큰 초과 → long-context 시도
    context_window_fallbacks=[
        {"primary": ["long-context"]}
    ],

    # ④ 기본 폴백: 위 설정이 없는 모델 그룹이 실패할 때
    default_fallbacks=["last-resort"],
)

폴백 우선순위 결정 순서:

에러 발생
    ↓
에러 유형 분류
    ├─ ContextWindowExceededError → context_window_fallbacks 실행
    ├─ ContentPolicyViolationError → content_policy_fallbacks 실행
    └─ 그 외 모든 에러 → fallbacks 실행
            ↓
    해당 model_name에 fallback 설정 있는가?
        YES → 설정된 폴백 순서대로 실행
        NO  → default_fallbacks 실행
                ↓
    default_fallbacks도 없거나 모두 실패 → 에러 반환

폴백은 리스트 순서대로 실행됩니다. 첫 번째 폴백이 성공하면 멈추고, 실패하면 다음으로 넘어갑니다.

context_window_fallbacks 실전 — enable_pre_call_checks와 함께

컨텍스트 윈도우 초과는 사후 처리보다 사전 차단이 더 효율적입니다. API 호출을 보내고 에러를 받는 대신, 호출 전에 토큰 수를 계산해 미리 라우팅합니다.

router = Router(
    model_list=[
        {
            "model_name": "gpt-4o-mini",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "api_key": "...",
                "max_tokens": 128000,   # 모델의 실제 컨텍스트 윈도우
            },
        },
        {
            "model_name": "gpt-4o",
            "litellm_params": {
                "model": "gpt-4o",
                "api_key": "...",
                "max_tokens": 128000,
            },
        },
        {
            "model_name": "claude-long",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "...",
                "max_tokens": 200000,   # 20만 토큰
            },
        },
        {
            "model_name": "gemini-long",
            "litellm_params": {
                "model": "gemini-3.5-flash",
                "api_key": "...",
                "max_tokens": 1048576,  # 100만 토큰
            },
        },
    ],

    # 사전 차단: API 호출 전에 토큰 수 계산 → 초과 시 즉시 폴백
    # context_window_fallbacks 사용 시 반드시 활성화
    enable_pre_call_checks=True,

    context_window_fallbacks=[
        {"gpt-4o-mini": ["gpt-4o", "claude-long", "gemini-long"]},
        {"gpt-4o": ["claude-long", "gemini-long"]},
        {"claude-long": ["gemini-long"]},
    ],

    # 일반 폴백도 함께 설정
    fallbacks=[
        {"gpt-4o-mini": ["gpt-4o", "claude-long"]},
    ],
)

enable_pre_call_checks=True 없이 context_window_fallbacks만 설정하면 사후 처리만 됩니다. API 비용은 이미 발생한 상태입니다. 사전 차단을 원하면 두 가지를 함께 설정해야 합니다.

# enable_pre_call_checks 동작 원리
# 1. 요청의 메시지를 tiktoken으로 토큰 계산
# 2. 해당 모델의 max_tokens와 비교
# 3. 초과하면 API 호출 없이 즉시 context_window_fallbacks 실행
# → API 비용 낭비 없음, 레이턴시 추가 없음

response = await router.acompletion(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": very_long_document}],  # 15만 토큰
    # → pre_call_check에서 128K 초과 감지
    # → 즉시 gpt-4o 시도 → 그것도 초과 → claude-long 시도 → 성공
)

재시도 전략 — num_retries와 retry_policy

기본 재시도

router = Router(
    model_list=[...],
    num_retries=3,        # 최대 3회 재시도
    retry_after=0,        # 재시도 전 최소 대기 시간 (초)
    timeout=30,           # 단일 요청 타임아웃 (초)
)

RateLimitError(429)에는 자동 지수 백오프가 적용됩니다. 1초 → 2초 → 4초 순으로 대기합니다. 다른 에러(5xx, 연결 오류)는 retry_after 설정값만큼 고정 대기합니다.

예외 유형별 개별 재시도 횟수 — retry_policy

에러 종류마다 재시도 횟수를 다르게 설정할 수 있습니다.

from litellm.router import RetryPolicy

router = Router(
    model_list=[...],
    retry_policy=RetryPolicy(
        # 네트워크·연결 문제: 빠르게 재시도
        ConnectTimeoutErrorRetries=3,
        ReadTimeoutErrorRetries=3,
        APIConnectionErrorRetries=3,

        # 서버 오류: 재시도 의미 있음
        InternalServerErrorRetries=2,
        ServiceUnavailableErrorRetries=2,

        # Rate Limit: 지수 백오프로 재시도
        RateLimitErrorRetries=3,

        # 콘텐츠 정책 위반: 재시도 의미 없음 (같은 결과)
        ContentPolicyViolationErrorRetries=0,

        # 컨텍스트 초과: 재시도 의미 없음 (context_window_fallbacks 사용)
        # 설정하지 않으면 기본값(num_retries) 적용
        ContextWindowExceededErrorRetries=0,

        # 인증 오류: 재시도 의미 없음
        AuthenticationErrorRetries=0,
    ),
    num_retries=3,  # retry_policy 미설정 유형에 적용되는 기본값
)

ContentPolicyViolationError를 재시도하면 같은 에러가 반복되며 시간만 낭비됩니다. 이런 에러는 재시도를 0으로 설정하고 즉시 content_policy_fallbacks로 넘기는 것이 올바른 패턴입니다.

⚠️ 레이턴시 함정 — 재시도가 P99를 폭발시키는 패턴

시나리오: num_retries=3, timeout=10초, LLM P99 레이턴시 = 18초

최악의 경우 흐름:
  1차 시도: 10초 대기 → 타임아웃
  1초 백오프 대기
  2차 시도: 10초 대기 → 타임아웃
  2초 백오프 대기
  3차 시도: 10초 대기 → 타임아웃
  → 총 경과 시간: 33초+

사용자는 33초 동안 스피너를 봄

LLM은 전통적인 마이크로서비스와 달리 P50이 8초, P95가 18초에 달합니다. 마이크로서비스 패턴(10초 타임아웃 × 3회)을 그대로 LLM에 적용하면 P99 레이턴시가 2배로 늘어납니다.

# ✅ LLM에 맞는 재시도 전략

router = Router(
    model_list=[...],
    # 재시도 횟수를 줄이고
    num_retries=1,        # 3 → 1 (타임아웃 누적 방지)
    timeout=15,           # 타임아웃을 현실적으로 (P95 이상)
    retry_after=0,        # Rate Limit 아닌 경우 즉시 재시도

    # 대신 빠른 폴백으로 보완
    fallbacks=[
        {"primary": ["fallback-fast", "fallback-backup"]}
    ],
)

# 또는 스트리밍 사용 시 타임아웃을 별도로 설정
# 스트리밍은 첫 토큰 응답 시간(TTFT)이 중요
response = await router.acompletion(
    model="primary",
    messages=[...],
    stream=True,
    timeout=5,   # TTFT 기준 타임아웃 (전체 완료가 아닌 시작 기준)
)

또한 스트리밍 요청에는 기본적으로 재시도가 작동하지 않습니다. 첫 토큰 전에 429가 발생해도 재시도 로직이 트리거되지 않는 버전 이슈가 보고되고 있습니다. 스트리밍 파이프라인에서는 애플리케이션 레벨에서 직접 재시도 래퍼를 구현하거나 num_retries=0으로 설정 후 수동 재시도를 쓰는 것이 안전합니다.

쿨다운 — 장애 배포를 자동으로 분리하는 회로 차단기

쿨다운은 LiteLLM의 Circuit Breaker 패턴입니다. 특정 배포에서 일정 횟수 이상 실패가 발생하면 그 배포를 일시적으로 제외합니다.

from litellm.router import AllowedFailsPolicy

router = Router(
    model_list=[...],

    # 기본 쿨다운 설정
    allowed_fails=2,      # 1분 이내 2회 실패 → 쿨다운 진입
    cooldown_time=60,     # 쿨다운 지속 시간 (초)

    # 에러 유형별 쿨다운 임계값 커스터마이즈
    allowed_fails_policy=AllowedFailsPolicy(
        # 인증 오류: 1번만 실패해도 즉시 쿨다운
        AuthenticationErrorAllowedFails=1,

        # 서버 오류: 5번까지 허용
        InternalServerErrorAllowedFails=5,

        # Rate Limit: 100번까지 허용 (429는 일시적 현상, 자주 발생)
        RateLimitErrorAllowedFails=100,

        # 콘텐츠 정책 위반: 매우 많이 허용
        # (특정 모델이 특정 콘텐츠를 거부하는 건 정상 동작)
        ContentPolicyViolationErrorAllowedFails=1000,
    ),
)

쿨다운 내부 동작:

failed_calls 캐시 (인메모리, TTL=1분)
    ├─ "deploy_A": [timestamp_1, timestamp_2]
    └─ "deploy_B": [timestamp_1]

요청 시 평가:
  1. failed_calls에서 현재 분 이내 실패 횟수 조회
  2. allowed_fails 초과 시 → CooldownCache에 등록
     (deploy_A는 cooldown_time 동안 선택에서 제외)
  3. cooldown_time 경과 후 자동 복구 (제거됨)

1분 TTL의 의미:
  1분 이전 실패는 자동으로 카운트에서 제외
  → 일시적 스파이크에 과도하게 반응하지 않음
  → 지속적인 장애만 쿨다운 트리거

전체 그룹이 쿨다운에 들어간 경우:

모든 배포가 쿨다운 상태가 되면 Router는 특정 모델 ID로 직접 폴백합니다. 이 경우 쿨다운 체크를 건너뛰고 강제 시도합니다.

# 전체 그룹 쿨다운 시 특정 배포 ID로 강제 폴백
router = Router(
    model_list=[
        {
            "model_name": "primary-group",
            "litellm_params": {"model": "claude-opus-4-7", "api_key": "..."},
            "model_info": {"id": "opus-backup-key"},  # 백업 키
        },
    ],
    # model_info.id를 직접 지정하면 쿨다운 체크 우회
    fallbacks=[{"primary-group": ["opus-backup-key"]}],
)

enable_weighted_failover — 그룹 내 재시도 우선

기본 폴백은 실패 시 즉시 다른 모델 그룹으로 넘어갑니다. enable_weighted_failover=True를 설정하면 먼저 같은 그룹 내 다른 배포를 재시도하고, 그룹 내 모든 배포가 실패해야 다른 그룹으로 에스컬레이션합니다.

router = Router(
    model_list=[
        # 같은 모델의 두 리전
        {
            "model_name": "claude-production",
            "litellm_params": {
                "model": "bedrock/anthropic.claude-sonnet-4-6-v1",
                "aws_region_name": "us-east-1",
                "rpm": 1000,
            },
        },
        {
            "model_name": "claude-production",
            "litellm_params": {
                "model": "bedrock/anthropic.claude-sonnet-4-6-v1",
                "aws_region_name": "eu-west-1",
                "rpm": 800,
            },
        },
        # 완전히 다른 프로바이더 (최후 폴백)
        {
            "model_name": "openai-backup",
            "litellm_params": {
                "model": "gpt-5.5",
                "api_key": "...",
            },
        },
    ],

    # 같은 그룹(claude-production) 내 배포 먼저 재시도
    # 모든 배포 실패 시에만 openai-backup으로 에스컬레이션
    enable_weighted_failover=True,

    fallbacks=[{"claude-production": ["openai-backup"]}],
)

# 동작 흐름:
# us-east-1 실패 → eu-west-1 재시도 (같은 그룹)
# eu-west-1도 실패 → openai-backup으로 에스컬레이션
# (같은 모델이지만 리전이 다른 경우 먼저 같은 모델 유지)

언제 enable_weighted_failover를 쓸까:

동일 모델의 멀티 리전 배포 (us-east-1, eu-west-1, ap-northeast-1)
동일 모델의 멀티 API 키 (Rate Limit 풀 분산)
다른 모델로 폴백하기 전에 같은 모델을 최대한 유지하고 싶은 경우

order 파라미터 — 우선순위 기반 배포 선택

model_info에 order 값을 설정하면 낮은 숫자부터 우선 시도합니다. order=1이 실패하면 order=2로 넘어갑니다. ContextWindowExceededError에는 적용되지 않습니다.

router = Router(
    model_list=[
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "PRIMARY_KEY",
            },
            "model_info": {
                "id": "primary",
                "order": 1,   # 1순위 — 항상 먼저 시도
            },
        },
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "SECONDARY_KEY",
            },
            "model_info": {
                "id": "secondary",
                "order": 2,   # 2순위 — primary 실패 시
            },
        },
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "bedrock/anthropic.claude-sonnet-4-6-v1",
            },
            "model_info": {
                "id": "bedrock-fallback",
                "order": 3,   # 3순위 — 최후 수단
            },
        },
    ],
    routing_strategy="simple-shuffle",
    # order가 설정된 경우 simple-shuffle보다 order가 우선
)

order와 simple-shuffle은 함께 사용할 수 있습니다. 같은 order 값의 배포들 사이에서는 simple-shuffle이 적용됩니다.

mock_testing_fallbacks — 폴백 작동 검증

폴백이 실제로 동작하는지 확인하려면 장애를 시뮬레이션해야 합니다. LiteLLM은 이를 위한 테스트 모드를 제공합니다.

import asyncio
from litellm import Router

router = Router(
    model_list=[
        {"model_name": "primary", "litellm_params": {"model": "claude-opus-4-7", "api_key": "..."}},
        {"model_name": "backup",  "litellm_params": {"model": "gpt-5.5", "api_key": "..."}},
    ],
    fallbacks=[{"primary": ["backup"]}],
)

async def test_fallback():
    # mock_testing_fallbacks=True → primary를 강제 실패시켜 폴백 트리거
    response = await router.acompletion(
        model="primary",
        messages=[{"role": "user", "content": "폴백 테스트"}],
        mock_testing_fallbacks=True,   # 이 요청에서만 강제 폴백 트리거
    )
    # backup으로 폴백된 응답 확인
    print(response)

asyncio.run(test_fallback())

Proxy 서버에서는 요청 본문에 "mock_testing_fallbacks": true를 추가합니다.

curl -X POST http://localhost:4000/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "primary",
    "messages": [{"role": "user", "content": "폴백 테스트"}],
    "mock_testing_fallbacks": true
  }'

Proxy 응답 헤더에서 폴백 작동 여부를 확인할 수 있습니다.

x-litellm-attempted-retries: 1
x-litellm-attempted-fallbacks: 1
x-litellm-model-used: backup    ← 실제로 응답한 모델

완성 예시 — 프로덕션 수준 장애 대응 설정

from litellm import Router
from litellm.router import RetryPolicy, AllowedFailsPolicy

router = Router(
    model_list=[
        # Tier 1: 기본 고성능
        {
            "model_name": "llm",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "PRIMARY_ANTHROPIC_KEY",
                "rpm": 1000, "tpm": 200000,
            },
            "model_info": {"id": "claude-primary", "order": 1},
        },
        {
            "model_name": "llm",
            "litellm_params": {
                "model": "bedrock/anthropic.claude-sonnet-4-6-v1",
                "aws_region_name": "us-east-1",
                "rpm": 1000, "tpm": 200000,
            },
            "model_info": {"id": "claude-bedrock-east", "order": 2},
        },
        # Tier 2: 교차 프로바이더 폴백
        {
            "model_name": "llm-xprovider",
            "litellm_params": {
                "model": "gpt-5.5",
                "api_key": "OPENAI_KEY",
                "rpm": 500, "tpm": 100000,
            },
        },
        # 롱 컨텍스트 폴백
        {
            "model_name": "llm-long",
            "litellm_params": {
                "model": "gemini-3.5-flash",
                "api_key": "GEMINI_KEY",
                "max_tokens": 1048576,
            },
        },
        # 콘텐츠 정책 폴백
        {
            "model_name": "llm-lenient",
            "litellm_params": {
                "model": "claude-sonnet-4-6",
                "api_key": "SECONDARY_ANTHROPIC_KEY",
            },
        },
    ],

    routing_strategy="simple-shuffle",
    enable_pre_call_checks=True,
    enable_weighted_failover=True,

    # 폴백 체인
    fallbacks=[
        {"llm": ["llm-xprovider"]}
    ],
    context_window_fallbacks=[
        {"llm": ["llm-long"]},
        {"llm-xprovider": ["llm-long"]},
    ],
    content_policy_fallbacks=[
        {"llm": ["llm-lenient"]}
    ],
    default_fallbacks=["llm-xprovider"],

    # 재시도 (레이턴시 함정 방지)
    num_retries=1,            # 적게 재시도
    timeout=20,               # 현실적인 타임아웃
    retry_policy=RetryPolicy(
        RateLimitErrorRetries=2,
        AuthenticationErrorRetries=0,
        ContentPolicyViolationErrorRetries=0,
        ContextWindowExceededErrorRetries=0,
        InternalServerErrorRetries=1,
    ),

    # 쿨다운
    allowed_fails=3,
    cooldown_time=60,
    allowed_fails_policy=AllowedFailsPolicy(
        RateLimitErrorAllowedFails=100,
        ContentPolicyViolationErrorAllowedFails=1000,
        AuthenticationErrorAllowedFails=1,
        InternalServerErrorAllowedFails=5,
    ),
)

✅ 결론

상황 설정

일반 장애 (5xx, 타임아웃)	fallbacks + num_retries=1~2
컨텍스트 초과	context_window_fallbacks + enable_pre_call_checks=True
콘텐츠 거부	content_policy_fallbacks + ContentPolicyViolationErrorRetries=0
누락 설정 보호	default_fallbacks
멀티 리전 동일 모델	enable_weighted_failover=True
P99 레이턴시 보호	num_retries=1, timeout=현실적 값
지속 장애 격리	allowed_fails + cooldown_time
폴백 검증	mock_testing_fallbacks=True
에러별 다른 임계값	AllowedFailsPolicy + RetryPolicy

폴백·재시도·쿨다운 세 레이어가 함께 동작할 때 LiteLLM은 단일 프로바이더 장애를 사용자가 눈치채지 못하는 수준에서 흡수합니다. 다음 편에서는 이 모든 설정을 Redis와 Proxy 서버로 프로덕션에 배포하는 방법을 다룹니다.

LiteLLM 시리즈 완결

✅ Router 구조와 라우팅 전략 6가지 https://cell-devlog.tistory.com/273
✅ 폴백 전략과 장애 대응 https://cell-devlog.tistory.com/274
✅ 프로덕션 배포: Redis + Proxy 서버 https://cell-devlog.tistory.com/275
✅ 고급 라우팅과 실전 아키텍처 https://cell-devlog.tistory.com/276

'AI 개발' 카테고리의 다른 글

LiteLLM Load Balancing 4편 — 시맨틱 라우팅, 커스텀 전략, 실전 아키텍처 3가지 (0)	2026.05.26
LiteLLM Load Balancing 3편 — 프로덕션 배포: Redis 연동, Proxy 서버, 예산 관리 (0)	2026.05.26
LiteLLM Load Balancing 완전 정복 1편 — Router 구조와 라우팅 전략 6가지 (0)	2026.05.26
회사에서 지금 몇 개의 AI 모델이 돌고 있나요 — AI-BOM이 뜨는 이유 (0)	2026.05.26
코드베이스에 모델 ID 박아놨습니까 — 6월 15일 API retirement 완전 대응 가이드 (0)	2026.05.26

CELL AI DEVLOG

LiteLLM Load Balancing 2편 — 폴백 전략과 장애 대응 완전 가이드

LiteLLM의 예외 정규화 — 왜 폴백이 프로바이더에 무관하게 동작하는가

폴백 3종 + default_fallbacks

context_window_fallbacks 실전 — enable_pre_call_checks와 함께

재시도 전략 — num_retries와 retry_policy

기본 재시도

예외 유형별 개별 재시도 횟수 — retry_policy

⚠️ 레이턴시 함정 — 재시도가 P99를 폭발시키는 패턴

쿨다운 — 장애 배포를 자동으로 분리하는 회로 차단기

enable_weighted_failover — 그룹 내 재시도 우선

order 파라미터 — 우선순위 기반 배포 선택

mock_testing_fallbacks — 폴백 작동 검증

완성 예시 — 프로덕션 수준 장애 대응 설정

✅ 결론

'AI 개발' 카테고리의 다른 글

티스토리툴바

LiteLLM Load Balancing 2편 — 폴백 전략과 장애 대응 완전 가이드

LiteLLM의 예외 정규화 — 왜 폴백이 프로바이더에 무관하게 동작하는가

폴백 3종 + default_fallbacks

context_window_fallbacks 실전 — enable_pre_call_checks와 함께

재시도 전략 — num_retries와 retry_policy

기본 재시도

예외 유형별 개별 재시도 횟수 — retry_policy

⚠️ 레이턴시 함정 — 재시도가 P99를 폭발시키는 패턴

쿨다운 — 장애 배포를 자동으로 분리하는 회로 차단기

enable_weighted_failover — 그룹 내 재시도 우선

order 파라미터 — 우선순위 기반 배포 선택

mock_testing_fallbacks — 폴백 작동 검증

완성 예시 — 프로덕션 수준 장애 대응 설정

✅ 결론

'AI 개발' 카테고리의 다른 글

'AI 개발' Related Articles

티스토리툴바