AI 에이전트 프로덕션 비용 폭탄 — 왜 LLM 청구서가 예상의 10배 나오나

AI Agent

AI 에이전트 프로덕션 비용 폭탄 — 왜 LLM 청구서가 예상의 10배 나오나

cell-devlog 2026. 5. 21. 09:35

로컬 테스트에서 요청 하나에 $0.02였습니다. 프로덕션 한 달 청구서는 $10,000이 나왔습니다. 계산이 안 맞습니다. 에이전트는 스테이트리스 API 호출이 아닙니다. 루프를 돌고, 재시도하고, 컨텍스트를 누적하고, 생각합니다. 그 모든 과정이 토큰으로 과금됩니다.

[핵심 요약]
→ 에이전트 실제 비용 = 단순 API 호출 비용 × 5~50배
→ 5가지 주요 낭비 패턴: 재시도 루프·컨텍스트 누적·Thinking 방치·툴 결과 과적재·에러 무한 증폭
→ 재시도 루프 하나가 동일 컨텍스트 × 10회 = 비용 10배
→ 컨텍스트 누적: 턴마다 이전 내용 전체 재전송 → 50턴이면 입력 토큰 수십 배
→ Thinking 기본값 방치: Claude/Gemini 기본 Medium·High → 짧은 태스크도 추론 토큰 대량 소비
→ 진단 도구: Langfuse·AgentOps·claude-devtools로 실제 토큰 추적 필수
→ 즉시 적용 가능한 절감: 캐싱(90%)·컨텍스트 압축·모델 라우팅·예산 게이트

왜 예상보다 10배 나오는가 — 5가지 패턴

패턴 1: 재시도 루프 비용 폭발

# ❌ 흔한 실수 — 재시도할 때 컨텍스트가 그대로 누적
import anthropic
import time

client = anthropic.Anthropic()

def bad_agent_with_retry(task: str, max_retries: int = 10):
    messages = [{"role": "user", "content": task}]

    for attempt in range(max_retries):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=8096,
            messages=messages
        )

        result = response.content[0].text

        # 결과 검증
        if not validate_result(result):
            # 문제: 실패할 때마다 전체 컨텍스트 + 실패 기록이 누적됨
            messages.append({"role": "assistant", "content": result})
            messages.append({"role": "user", "content": f"실패 {attempt+1}. 다시 시도해줘."})
            # → 10번 재시도 시 입력 토큰이 10배로 불어남
            continue

        return result

# ✅ 올바른 재시도 — 컨텍스트 리셋 + 지수 백오프
def good_agent_with_retry(task: str, max_retries: int = 3):
    last_error = None

    for attempt in range(max_retries):
        # 매 재시도마다 새 컨텍스트 — 누적 없음
        messages = [
            {
                "role": "user",
                "content": task if attempt == 0 else
                           f"{task}\n\n[이전 시도 실패 이유: {last_error}. 이 점에 유의해서 다시 시도.]"
            }
        ]

        response = client.messages.create(
            model="claude-sonnet-4-6",  # 재시도엔 저렴한 모델
            max_tokens=2048,            # 재시도엔 작은 토큰 한도
            messages=messages
        )

        result = response.content[0].text

        if validate_result(result):
            return result

        last_error = extract_error_reason(result)
        time.sleep(2 ** attempt)  # 지수 백오프

    raise Exception(f"최대 재시도 초과: {last_error}")

[재시도 비용 계산 — 실측]
프롬프트: 5,000 토큰
응답: 1,000 토큰

누적 방식 10회 재시도:
→ 1회: 5,000 입력 → 2회: 11,000 → ... → 10회: ~60,000 입력
→ 총 입력: ~330,000 토큰
→ Claude Opus 기준: 330,000 × $5/M = $1.65

컨텍스트 리셋 방식 3회 재시도:
→ 매회 5,000 토큰 × 3회 = 15,000 토큰
→ 비용: 15,000 × $5/M = $0.075

→ 22배 차이

패턴 2: 멀티턴 컨텍스트 누적

# ❌ 문제 — 전체 히스토리를 매 턴마다 재전송
class BadChatAgent:
    def __init__(self):
        self.messages = []  # 무한 누적

    def chat(self, user_input: str) -> str:
        self.messages.append({"role": "user", "content": user_input})

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=self.messages  # 매 턴마다 전체 히스토리
        )

        reply = response.content[0].text
        self.messages.append({"role": "assistant", "content": reply})
        return reply

# 50턴 대화에서 입력 토큰:
# 턴 1: 100 토큰
# 턴 10: ~2,000 토큰
# 턴 30: ~15,000 토큰
# 턴 50: ~40,000 토큰
# 총 누적: ~500,000 토큰 → Opus 기준 $2.50 (한 대화)

# ✅ 슬라이딩 윈도우 + 요약 압축
class EfficientChatAgent:
    def __init__(self, window_size: int = 10, summary_threshold: int = 20):
        self.messages = []
        self.summary = ""
        self.window_size = window_size
        self.summary_threshold = summary_threshold
        self.total_tokens_saved = 0

    def _compress_history(self):
        """오래된 히스토리를 요약으로 압축"""
        if len(self.messages) < self.summary_threshold:
            return

        # 압축할 구간 (최근 window_size 제외)
        to_compress = self.messages[:-self.window_size]

        compress_response = client.messages.create(
            model="claude-haiku-4-5",  # 압축엔 저렴한 모델
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"다음 대화를 핵심만 3~5문장으로 요약해줘:\n\n"
                          + "\n".join([f"{m['role']}: {m['content'][:200]}"
                                      for m in to_compress])
            }]
        )

        self.summary = compress_response.content[0].text
        # 최근 window_size만 유지
        self.messages = self.messages[-self.window_size:]

    def chat(self, user_input: str) -> str:
        self.messages.append({"role": "user", "content": user_input})

        # 주기적 압축
        if len(self.messages) > self.summary_threshold:
            self._compress_history()

        # 시스템 프롬프트에 요약 주입
        system = f"이전 대화 요약:\n{self.summary}" if self.summary else ""

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=system,
            messages=self.messages[-self.window_size:]  # 최근만 전송
        )

        reply = response.content[0].text
        self.messages.append({"role": "assistant", "content": reply})
        return reply

패턴 3: Thinking 모드 방치

# ❌ 문제 — Thinking 기본값 그대로 사용
# Claude Opus 4.7: 기본 thinking_level="medium"
# Gemini 3.5 Flash: 기본 thinking_level="medium"
# → 단순 분류 태스크에도 수천 토큰 추론

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    # thinking 설정 없음 = 기본값으로 대량 추론
    messages=[{"role": "user", "content": "이 텍스트의 감정이 긍정인지 부정인지 판단해줘: '좋아요'"}]
)
# → thinking 토큰: ~500 (불필요)
# → 실제 답변: 10 토큰
# → 낭비: 98%

# ✅ 태스크별 Thinking 레벨 제어
import anthropic

def classify_with_appropriate_thinking(text: str, task_type: str) -> str:
    """
    태스크 복잡도에 따라 Thinking 레벨 자동 조정
    """
    # 태스크별 설정 매핑
    configs = {
        # 단순 분류 → Thinking 최소화
        "classification": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 100,
            "thinking": {"type": "disabled"}  # Thinking 완전 비활성화
        },
        # 코드 생성 → 낮은 Thinking
        "code_simple": {
            "model": "claude-sonnet-4-6",
            "max_tokens": 2048,
            "thinking": {"type": "enabled", "budget_tokens": 1024}  # 제한
        },
        # 복잡한 추론 → 높은 Thinking
        "complex_reasoning": {
            "model": "claude-opus-4-7",
            "max_tokens": 8096,
            "thinking": {"type": "enabled", "budget_tokens": 8000}
        },
    }

    config = configs.get(task_type, configs["code_simple"])

    response = client.messages.create(
        model=config["model"],
        max_tokens=config["max_tokens"],
        thinking=config["thinking"],
        messages=[{"role": "user", "content": text}]
    )

    return response.content[-1].text

# 단순 분류: Thinking 비활성화 → 95% 비용 절감
result = classify_with_appropriate_thinking(
    "'이 제품 정말 별로예요'의 감정은?",
    task_type="classification"
)

[Thinking 레벨별 비용 비교 — 동일 분류 태스크 100건]

Thinking 비활성화:
→ 입력: 50 토큰 × 100 = 5,000 토큰
→ 출력: 5 토큰 × 100 = 500 토큰
→ Sonnet 기준: $0.016

Thinking medium (기본값):
→ 입력: 50 토큰 × 100 = 5,000 토큰
→ Thinking: 800 토큰 × 100 = 80,000 토큰
→ 출력: 5 토큰 × 100 = 500 토큰
→ Sonnet 기준: $0.248

→ 15배 차이 (단순 태스크에서)

패턴 4: 툴 결과 과적재

# ❌ 문제 — 툴 실행 결과 전체를 컨텍스트에 그대로 추가
def bad_tool_agent(task: str):
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,
            messages=messages
        )

        # 툴 결과를 그대로 추가
        if response.stop_reason == "tool_use":
            tool_result = execute_tool(response)
            # 문제: DB 쿼리 결과 10,000행, 로그 파일 전체 등이
            # 그대로 컨텍스트에 쌓임
            messages.append({
                "role": "user",
                "content": [{"type": "tool_result",
                             "content": str(tool_result)}]  # 무제한 크기
            })
        else:
            return response.content[0].text

# ✅ 툴 결과 트리밍 + 관련 부분만 추출
def smart_tool_agent(task: str, max_tool_result_tokens: int = 2000):
    messages = [{"role": "user", "content": task}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            tools=TOOLS,
            messages=messages
        )

        if response.stop_reason == "tool_use":
            tool_use = next(b for b in response.content
                           if b.type == "tool_use")
            raw_result = execute_tool(tool_use.name, tool_use.input)

            # 툴 결과 전처리
            processed_result = preprocess_tool_result(
                tool_name=tool_use.name,
                result=raw_result,
                max_tokens=max_tool_result_tokens
            )

            messages.append({"role": "assistant", "content": response.content})
            messages.append({
                "role": "user",
                "content": [{"type": "tool_result",
                             "tool_use_id": tool_use.id,
                             "content": processed_result}]
            })
        else:
            return response.content[0].text

def preprocess_tool_result(tool_name: str, result: any,
                           max_tokens: int) -> str:
    """툴 결과를 컨텍스트 효율적으로 전처리"""

    if tool_name == "query_database":
        # DB 결과: 상위 N행만, 나머지는 요약
        rows = result.get("rows", [])
        if len(rows) > 50:
            sample = rows[:20]
            return (f"총 {len(rows)}행 중 상위 20행:\n"
                   f"{format_rows(sample)}\n"
                   f"나머지 {len(rows)-20}행은 동일 패턴")

    elif tool_name == "read_file":
        # 파일: 관련 섹션만 추출
        content = result.get("content", "")
        if len(content) > 5000:
            # 앞뒤 500자 + 중간 요약
            return (f"[파일 앞부분]\n{content[:500]}\n\n"
                   f"...[중간 {len(content)-1000}자 생략]...\n\n"
                   f"[파일 뒷부분]\n{content[-500:]}")

    elif tool_name == "web_search":
        # 검색 결과: 제목+요약만 (전체 본문 제외)
        results = result.get("results", [])
        return "\n".join([
            f"- {r['title']}: {r['snippet']}"
            for r in results[:5]  # 상위 5개만
        ])

    return str(result)[:max_tokens * 4]  # 최후 안전장치

패턴 5: 에러 무한 증폭

# ❌ 문제 — 에러 메시지가 컨텍스트를 오염시키며 루프
def error_amplifying_agent(code: str):
    messages = [{"role": "user", "content": f"이 코드 실행해줘:\n{code}"}]
    errors_seen = []

    for i in range(20):  # 최대 20번 시도
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=8096,
            messages=messages
        )

        exec_result = execute_code(response.content[0].text)

        if exec_result.success:
            return exec_result.output

        # 문제: 에러 스택트레이스 전체가 매 루프마다 누적
        messages.append({"role": "assistant",
                         "content": response.content[0].text})
        messages.append({"role": "user",
                         "content": f"에러 발생:\n{exec_result.full_traceback}"})
        # → 20번 루프 × 긴 스택트레이스 = 수만 토큰

# ✅ 에러 정규화 + 루프 탈출 조건
def robust_code_agent(code: str, max_attempts: int = 3):
    attempt = 0
    previous_errors = set()  # 반복 에러 감지

    while attempt < max_attempts:
        # 매 시도마다 깨끗한 컨텍스트
        error_hint = ""
        if attempt > 0:
            error_hint = f"\n\n[주의: 이전 시도에서 {normalized_error} 발생. 이 부분 수정 필요]"

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=[{
                "role": "user",
                "content": f"이 코드의 버그를 수정해줘:\n{code}{error_hint}"
            }]
        )

        fixed_code = extract_code(response.content[0].text)
        exec_result = execute_code(fixed_code)

        if exec_result.success:
            return exec_result.output

        # 에러 정규화 (스택트레이스 → 핵심 메시지)
        normalized_error = normalize_error(exec_result.full_traceback)

        # 반복 에러 감지 → 더 강력한 모델로 에스컬레이션
        if normalized_error in previous_errors:
            return escalate_to_stronger_model(code, normalized_error)

        previous_errors.add(normalized_error)
        attempt += 1

    return None

def normalize_error(traceback: str) -> str:
    """긴 스택트레이스를 핵심 에러 메시지로 압축"""
    lines = traceback.strip().split("\n")
    # 마지막 에러 줄만 추출
    error_line = next(
        (l for l in reversed(lines) if l.strip()),
        "Unknown error"
    )
    return error_line[:200]  # 최대 200자

실전 — 비용 추적 설정

# Langfuse로 실제 에이전트 토큰 추적
# pip install langfuse anthropic

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
import anthropic

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"
)
client = anthropic.Anthropic()

@observe()  # 자동 트레이싱
def agent_loop(task: str) -> str:
    messages = [{"role": "user", "content": task}]
    total_input = 0
    total_output = 0

    for step in range(10):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            messages=messages
        )

        # 매 스텝 토큰 누적 추적
        total_input += response.usage.input_tokens
        total_output += response.usage.output_tokens

        # Langfuse에 스텝별 비용 기록
        langfuse_context.update_current_observation(
            metadata={
                "step": step,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "cumulative_cost_usd": (
                    total_input * 3 + total_output * 15
                ) / 1_000_000  # Sonnet 기준
            }
        )

        if response.stop_reason == "end_turn":
            break

        messages.append({"role": "assistant",
                        "content": response.content[0].text})

    return response.content[0].text

# 실행 후 Langfuse 대시보드에서 확인:
# → 어느 스텝에서 토큰이 폭발했나
# → 툴 결과가 얼마나 컨텍스트를 오염시켰나
# → 실제 vs 예상 비용 차이

비용 예산 게이트

from dataclasses import dataclass, field
from anthropic import BadRequestError

@dataclass
class BudgetGate:
    """에이전트 실행에 토큰 예산 강제"""
    max_total_tokens: int = 50_000
    max_input_per_call: int = 20_000
    warn_threshold: float = 0.8  # 80% 도달 시 경고

    input_used: int = field(default=0, init=False)
    output_used: int = field(default=0, init=False)

    @property
    def total_used(self) -> int:
        return self.input_used + self.output_used

    @property
    def budget_remaining(self) -> int:
        return self.max_total_tokens - self.total_used

    def check_and_record(self, response) -> None:
        usage = response.usage
        self.input_used += usage.input_tokens
        self.output_used += usage.output_tokens

        ratio = self.total_used / self.max_total_tokens

        if ratio >= 1.0:
            raise Exception(
                f"토큰 예산 초과: {self.total_used:,}/{self.max_total_tokens:,}"
            )

        if ratio >= self.warn_threshold:
            print(f"⚠️ 예산 {ratio*100:.0f}% 소진 "
                  f"({self.total_used:,}/{self.max_total_tokens:,})")

    def check_input_size(self, messages: list) -> None:
        """API 호출 전 입력 크기 사전 확인"""
        # 대략적 토큰 추정 (실제는 count_tokens API 사용)
        estimated = sum(len(m["content"]) // 4 for m in messages)
        if estimated > self.max_input_per_call:
            raise ValueError(
                f"단일 호출 입력 한도 초과 예상: ~{estimated:,} 토큰"
            )

# 사용
gate = BudgetGate(max_total_tokens=30_000)

def budgeted_agent(task: str) -> str:
    messages = [{"role": "user", "content": task}]

    for _ in range(20):
        gate.check_input_size(messages)  # 호출 전 사전 체크

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=messages
        )

        gate.check_and_record(response)  # 예산 기록·검증

        if response.stop_reason == "end_turn":
            print(f"✅ 완료: {gate.total_used:,} 토큰 사용")
            return response.content[0].text

        messages.append({"role": "assistant",
                        "content": response.content[0].text})

빠른 절감 체크리스트

[즉시 적용 — 코드 수정 없이]
☐ Claude/Gemini Thinking 레벨 확인 → 단순 태스크는 minimal/disabled
☐ max_tokens 재검토 → 실제 출력보다 2배 이상 크게 설정하지 말 것
☐ 에이전트 루프 최대 횟수 제한 (기본값 없으면 무한 루프)

[이번 주 내 적용]
☐ 재시도 로직에서 컨텍스트 리셋 (가장 큰 효과)
☐ 멀티턴 히스토리 슬라이딩 윈도우 (최근 N턴만 전송)
☐ 툴 결과 트리밍 (DB 결과, 파일 내용 상한선)
☐ Langfuse or AgentOps 연결 → 실제 소비 패턴 파악

[다음 달까지]
☐ 프롬프트 캐싱 적용 (시스템 프롬프트 캐시: 90% 절감)
☐ 태스크별 모델 라우팅 (분류→Haiku, 코딩→Sonnet, 설계→Opus)
☐ 예산 게이트 구현 (예상 초과 시 자동 차단)

[비용 계산 공식]
실제 에이전트 비용 = API 비용 × 재시도 배율 × 컨텍스트 누적 배율
→ 재시도 5회 + 컨텍스트 누적 = 10~25× 기본 비용
→ 최적화 후 목표: 기본 비용의 2~3배 이내

관련 글

'AI Agent' 카테고리의 다른 글

AI 에이전트 메모리 관리 실전 — 세션 간 상태 유지, 컨텍스트 압축, 레포 재탐색 방지 (0)	2026.05.21
AI 에이전트 디버깅 실전 — Langfuse·AgentOps·Braintrust 언제 뭘 쓰나 (0)	2026.05.21
AI 에이전트 보안 완전 가이드 — Double Agent 공격, 에이전트가 내부 위협이 되는 순간 (0)	2026.05.19
LLMWiki 완전 가이드 — Karpathy가 제안한 AI가 스스로 관리하는 지식 베이스 (0)	2026.05.19
uv + Ruff 완전 가이드 — OpenAI가 인수한 Python 툴링의 새 표준 (0)	2026.05.18

현재글AI 에이전트 프로덕션 비용 폭탄 — 왜 LLM 청구서가 예상의 10배 나오나

CELL AI DEVLOG

AI 에이전트 만듭니다

github copilot, 오픈소스llm, Gemini, SGLANG, LLM, LLM서빙, 바이브코딩, MCP, AI agent, Rag, Claude, Claude Opus 4.8, openai codex, AI 에이전트, Gemini 3.5 Flash, LLM as a judge, AWS Kiro, 클로드코드, 멀티에이전트, claude code,

Today :
Yesterday :

CELL AI DEVLOG

AI 에이전트 프로덕션 비용 폭탄 — 왜 LLM 청구서가 예상의 10배 나오나

왜 예상보다 10배 나오는가 — 5가지 패턴

패턴 1: 재시도 루프 비용 폭발

패턴 2: 멀티턴 컨텍스트 누적

패턴 3: Thinking 모드 방치

패턴 4: 툴 결과 과적재

패턴 5: 에러 무한 증폭

실전 — 비용 추적 설정

비용 예산 게이트

빠른 절감 체크리스트

'AI Agent' 카테고리의 다른 글

'AI Agent'의 다른글

티스토리툴바

« 2026/06 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

AI 에이전트 프로덕션 비용 폭탄 — 왜 LLM 청구서가 예상의 10배 나오나

왜 예상보다 10배 나오는가 — 5가지 패턴

패턴 1: 재시도 루프 비용 폭발

패턴 2: 멀티턴 컨텍스트 누적

패턴 3: Thinking 모드 방치

패턴 4: 툴 결과 과적재

패턴 5: 에러 무한 증폭

실전 — 비용 추적 설정

비용 예산 게이트

빠른 절감 체크리스트

'AI Agent' 카테고리의 다른 글

'AI Agent'의 다른글

관련글

티스토리툴바