SGLang 서빙에 대한 모든 것 — 설치부터 프로덕션까지 완전 가이드

2026년 현재 오픈소스 LLM 추론 엔진 중 실질적인 업계 표준은 SGLang이에요.

xAI(Grok), NVIDIA, AMD, LinkedIn, Cursor, Oracle Cloud, Google Cloud, AWS가 프로덕션에 사용 중이고, 전 세계 40만 개 이상의 GPU에서 매일 수조 개의 토큰을 처리하고 있어요.

근데 왜 vLLM을 놔두고 SGLang인가? 핵심은 하나예요.

"vLLM은 요청을 독립된 단위로 처리한다. SGLang은 요청을 프로그램으로 처리한다."

이 철학 하나가 성능을 완전히 갈라요.

SGLang이 빠른 이유 — 핵심 원리 3가지

1. RadixAttention — KV 캐시 자동 재사용

기존 추론 엔진의 문제를 먼저 이해해야 해요.

일반 추론 엔진 (vLLM 포함):
요청 A: [시스템 프롬프트] + [사용자 메시지 1] → KV 캐시 생성 → 응답 후 폐기
요청 B: [시스템 프롬프트] + [사용자 메시지 2] → KV 캐시 처음부터 다시 생성
요청 C: [시스템 프롬프트] + [사용자 메시지 3] → 또 처음부터 다시 생성

→ 시스템 프롬프트 KV 캐시를 매번 재계산: 낭비!

SGLang의 RadixAttention은 공유된 접두사의 KV 캐시를 Radix 트리에 저장해서 재사용해요.

SGLang (RadixAttention):
요청 A: [시스템 프롬프트] → KV 캐시 저장
         └─ [사용자 메시지 1] → 신규 부분만 계산

요청 B: [시스템 프롬프트] → ✅ 캐시 히트! 스킵
         └─ [사용자 메시지 2] → 신규 부분만 계산

요청 C: [시스템 프롬프트] → ✅ 캐시 히트! 스킵
         └─ [사용자 메시지 3] → 신규 부분만 계산

→ GPU 연산 최대 6배 절감

챗봇, RAG 파이프라인, Few-shot 프롬프트처럼 공유 접두사가 있는 워크로드에서 압도적이에요.

2. Zero-Overhead CPU 스케줄러

기존 엔진은 GPU가 연산하는 동안 CPU 스케줄러가 다음 배치를 준비하는 과정에서 오버헤드가 생겨요. SGLang v0.4부터 CPU 스케줄링 오버헤드를 사실상 0으로 만들었어요.

3. 구조화된 출력 고속 디코딩

JSON, 정규식 등 구조화된 출력이 필요할 때 기존 방식은 토큰마다 유효성 검사해요. SGLang은 **압축된 유한 상태 기계(FSM)**로 한 번에 여러 토큰을 결정론적으로 생성해요. 일반 JSON 디코딩 대비 3배 빨라요.

설치

pip 설치 (가장 간단)

# CUDA 12.1 + PyTorch 2.4 기준
pip install sglang[all]

# 특정 버전 (2026년 1월 최신)
pip install sglang==0.5.8

# CUDA 버전 확인
python -c "import torch; print(torch.version.cuda)"

Docker 설치 (권장)

# NVIDIA GPU용
docker pull lmsysorg/sglang:latest

docker run --gpus all \
  -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

# AMD GPU용 (ROCm)
docker pull lmsysorg/sglang:latest-rocm

기본 서버 실행

# 단일 GPU — 기본 실행
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000

# 주요 옵션들
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --port 30000 \
  --tp 4 \                    # 텐서 병렬: GPU 4개
  --dp 2 \                    # 데이터 병렬: 2개 복제본
  --mem-fraction-static 0.85 \  # GPU 메모리 85% 사용
  --max-running-requests 2048 \ # 최대 동시 요청
  --dtype bfloat16 \            # 데이터 타입
  --context-length 32768 \      # 최대 컨텍스트 길이
  --enable-torch-compile \      # torch.compile 활성화 (1.5x 속도)
  --disable-radix-cache         # RadixAttention 비활성화 (필요시)

OpenAI 호환 API 사용

SGLang은 OpenAI API와 완전 호환돼요. 기존 코드 그대로 쓸 수 있어요.

from openai import OpenAI

# SGLang 서버를 OpenAI 클라이언트로 접근
client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # 로컬이라 API 키 불필요
)

# 일반 채팅 완성
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "당신은 도움이 되는 어시스턴트입니다."},
        {"role": "user", "content": "파이썬으로 피보나치 수열을 구현해줘."}
    ],
    temperature=0.7,
    max_tokens=512
)
print(response.choices[0].message.content)

# 스트리밍
stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "간단한 정렬 알고리즘 설명해줘"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

구조화된 출력 (Structured Output)

SGLang의 킬러 기능이에요. JSON 스키마를 강제해서 LLM이 반드시 올바른 형식으로 출력하게 만들어요.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI(base_url="http://localhost:30000/v1", api_key="EMPTY")

# Pydantic 모델로 출력 스키마 정의
class UserProfile(BaseModel):
    name: str
    age: int
    skills: list[str]
    experience_years: int

class JobCandidate(BaseModel):
    candidate: UserProfile
    suitable_for_position: bool
    reason: str

# 구조화된 JSON 출력 강제
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{
        "role": "user",
        "content": "이철수, 28세, Python/FastAPI/React 개발자로 5년 경력인 개발자 정보를 JSON으로 알려줘."
    }],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "job_candidate",
            "schema": JobCandidate.model_json_schema()
        }
    }
)

import json
result = json.loads(response.choices[0].message.content)
candidate = JobCandidate(**result)
print(f"이름: {candidate.candidate.name}")
print(f"스킬: {candidate.candidate.skills}")

모델별 최적화 설정

Llama 3.x 계열

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 4 \
  --mem-fraction-static 0.88 \
  --enable-torch-compile \
  --dtype bfloat16 \
  --context-length 131072 \   # 128K 컨텍스트
  --port 30000

DeepSeek-R1 / V3 (MoE 모델)

DeepSeek 모델은 SGLang이 특별히 최적화된 MLA(Multi-head Latent Attention) 지원을 제공해요.

# DeepSeek-V3 — 멀티 노드 필요 (671B 파라미터)
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3 \
  --tp 8 \                      # 텐서 병렬 8
  --dp 4 \                      # 데이터 병렬 4
  --trust-remote-code \         # 커스텀 코드 허용
  --enable-dp-attention \       # DP 어텐션 활성화
  --dtype bfloat16 \
  --port 30000

# DeepSeek-R1-Distill (소형, 단일 GPU 가능)
python -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
  --tp 1 \
  --dtype bfloat16 \
  --port 30000

Qwen3 계열

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-72B-Instruct \
  --tp 4 \
  --trust-remote-code \
  --dtype bfloat16 \
  --port 30000

양자화 (Quantization) 설정

GPU 메모리가 부족할 때 양자화로 모델 크기를 줄여요.

# FP8 양자화 (품질 손실 최소, H100 권장)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --dtype float16 \
  --quantization fp8 \
  --tp 2 \     # FP8이면 GPU 2개로 70B 가능
  --port 30000

# INT4 양자화 (메모리 75% 절감, 단일 GPU에서 70B 가능)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --quantization awq \     # AWQ INT4
  --tp 1 \
  --port 30000

# GPTQ 양자화
python -m sglang.launch_server \
  --model-path TheBloke/Llama-2-70B-GPTQ \
  --quantization gptq \
  --tp 2 \
  --port 30000

양자화 방식 비교

방식 메모리 절감 품질 손실 GPU 요구사항

BF16 (기본)	없음	없음	높음
FP8	50%	최소	H100/A100
AWQ INT4	75%	낮음	모든 GPU
GPTQ INT4	75%	낮음	모든 GPU

Speculative Decoding — 디코딩 속도 향상

작은 드래프트 모델이 먼저 여러 토큰을 예측하고, 큰 타겟 모델이 검증하는 방식이에요. 레이턴시를 2~3배 줄여요.

# EAGLE 투기적 디코딩 (SGLang 권장 방식)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --speculative-algorithm EAGLE \
  --speculative-draft-model-path lmzheng/sglang-EAGLE-llama3.1-Instruct-8B \
  --speculative-num-draft-tokens 5 \  # 한번에 5토큰 예측
  --port 30000

# 기본 투기적 디코딩 (별도 드래프트 모델 없이)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --speculative-algorithm EAGLE3 \
  --tp 4 \
  --port 30000

멀티 GPU / 멀티 노드 설정

텐서 병렬 (단일 노드, 멀티 GPU)

# 4 GPU로 70B 모델 서빙
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 4 \     # 텐서 병렬 4 (GPU 4개)
  --port 30000

# 8 GPU로 405B 모델 서빙
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tp 8 \
  --dtype float16 \
  --port 30000

멀티 노드 (서버 2대 이상)

# 노드 1 (마스터)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-405B-Instruct \
  --tp 16 \      # 총 16 GPU (노드당 8)
  --dist-init-addr 10.0.0.1:20000 \  # 마스터 노드 IP
  --nnodes 2 \   # 총 2개 노드
  --node-rank 0 \  # 현재 노드 번호
  --port 30000

# 노드 2 (워커)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-405B-Instruct \
  --tp 16 \
  --dist-init-addr 10.0.0.1:20000 \
  --nnodes 2 \
  --node-rank 1 \
  --port 30000

SGL-Router — 캐시 인식 로드 밸런싱

여러 SGLang 서버 앞단에 로드 밸런서를 놓을 때, 단순 라운드 로빈이 아니라 KV 캐시 히트율을 최대화하는 방향으로 라우팅해요.

# 라우터 실행
python -m sglang.srt.router.launch_router \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --router-hosts 10.0.0.1:30000 10.0.0.2:30000 10.0.0.3:30000 \
  --router-port 8080 \
  --cache-threshold 0.5 \   # 캐시 히트율 50% 이상이면 해당 서버로 라우팅
  --policy cache_aware       # 캐시 인식 정책

Prefill/Decode 분리 배포 (PD Disaggregation)

대규모 서빙의 최신 기술이에요. Prefill(프롬프트 처리)은 계산 집약적이고, Decode(토큰 생성)는 메모리 집약적이에요. 이 두 단계를 다른 하드웨어에서 실행해요.

# Prefill 서버 (고성능 GPU, 빠른 계산)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --disaggregation-mode prefill \
  --disaggregation-transfer-backend mooncake \
  --port 30000

# Decode 서버 (메모리 큰 GPU)
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-70B-Instruct \
  --tp 4 \
  --disaggregation-mode decode \
  --disaggregation-transfer-backend mooncake \
  --port 30001

DeepSeek 모델에서 PD 분리 배포 시 Prefill 3.8배, Decode 4.8배 처리량 향상이 확인됐어요.

Python API로 직접 사용

서버 없이 Python 코드에서 직접 SGLang 엔진을 써요.

import sglang as sgl

# 오프라인 배치 처리
llm = sgl.Engine(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    tp_size=1
)

# 단일 요청
output = llm.generate(
    "파이썬의 GIL이 뭔지 설명해줘",
    sgl.SamplingParams(
        temperature=0.7,
        max_new_tokens=512,
        top_p=0.9
    )
)
print(output["text"])

# 배치 처리 (병렬로 여러 요청 처리)
prompts = [
    "Python GIL이란?",
    "FastAPI와 Django 차이는?",
    "Docker와 VM의 차이는?",
    "REST API란 무엇인가?"
]

outputs = llm.generate(
    prompts,
    sgl.SamplingParams(temperature=0.7, max_new_tokens=256)
)

for prompt, output in zip(prompts, outputs):
    print(f"Q: {prompt}")
    print(f"A: {output['text'][:100]}...")
    print()

# 사용 후 정리
llm.shutdown()

프로덕션 배포 — Docker Compose

# docker-compose.yml
version: "3.8"

services:
  sglang-server:
    image: lmsysorg/sglang:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1,2,3
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "30000:30000"
    command: >
      python -m sglang.launch_server
      --model-path meta-llama/Llama-3.3-70B-Instruct
      --tp 4
      --mem-fraction-static 0.85
      --max-running-requests 1024
      --dtype bfloat16
      --enable-torch-compile
      --host 0.0.0.0
      --port 30000
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:30000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s  # 모델 로딩 대기
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [gpu]
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - sglang-server

# nginx.conf — 타임아웃 설정 중요
upstream sglang {
    server sglang-server:30000;
}

server {
    listen 80;

    location / {
        proxy_pass http://sglang;
        proxy_read_timeout 300s;   # LLM 응답 대기 5분
        proxy_send_timeout 300s;
        proxy_connect_timeout 10s;

        # 스트리밍 응답을 위한 설정
        proxy_buffering off;
        proxy_cache off;
        proxy_set_header Connection "";
        chunked_transfer_encoding on;
    }
}

프로덕션 모니터링

SGLang은 Prometheus 메트릭을 기본 제공해요.

# 메트릭 확인
curl http://localhost:30000/metrics

# 주요 메트릭:
# sglang:num_running_reqs      — 현재 처리 중인 요청 수
# sglang:num_waiting_reqs      — 대기 중인 요청 수
# sglang:token_usage           — GPU 메모리 토큰 사용률
# sglang:decode_throughput     — 초당 디코딩 토큰 수
# sglang:prefill_throughput    — 초당 프리필 토큰 수
# sglang:cache_hit_rate        — RadixAttention 캐시 히트율

# 서버 상태 확인
curl http://localhost:30000/health

# 모델 정보 확인
curl http://localhost:30000/get_model_info

# Prometheus + Grafana 연동
import requests

def get_sglang_metrics():
    response = requests.get("http://localhost:30000/metrics")
    metrics = {}
    for line in response.text.split("\n"):
        if line.startswith("sglang:"):
            key, value = line.split(" ")
            metrics[key] = float(value)
    return metrics

# 핵심 지표 모니터링
metrics = get_sglang_metrics()
print(f"처리 중: {metrics.get('sglang:num_running_reqs', 0)}")
print(f"대기 중: {metrics.get('sglang:num_waiting_reqs', 0)}")
print(f"캐시 히트율: {metrics.get('sglang:cache_hit_rate', 0):.1%}")
print(f"디코딩 속도: {metrics.get('sglang:decode_throughput', 0):.0f} tok/s")

SGLang vs vLLM vs TensorRT-LLM 비교

H100 80GB, Llama 3.3 70B FP8 기준 벤치마크예요.

항목 SGLang vLLM TensorRT-LLM

처리량 (tok/s)	높음	중간	최고
TTFT (첫 토큰)	낮음	중간	최저
공유 접두사 캐싱	✅ RadixAttention	제한적	❌
구조화 출력 속도	최고	낮음	중간
설치/설정 난이도	중간	낮음	높음
컴파일 시간	없음	없음	28분+
모델 지원 범위	넓음	가장 넓음	NVIDIA 한정
멀티 노드	✅	✅	✅
PD 분리 배포	✅	개발 중	❌

결론

챗봇/RAG처럼 공유 프롬프트가 많다 → SGLang
빠른 프로토타이핑, 다양한 모델 → vLLM
단일 모델 최대 처리량, NVIDIA 전용 → TensorRT-LLM

자주 발생하는 문제와 해결법

# 문제 1: CUDA Out of Memory
# 해결: mem-fraction-static 줄이기
--mem-fraction-static 0.80  # 기본 0.88에서 줄임

# 문제 2: 모델 로딩 너무 느림
# 해결: 로컬 캐시 확인, 모델 샤드 병렬 로딩
export TRANSFORMERS_CACHE=/fast-ssd/models  # 빠른 SSD에 캐시

# 문제 3: 응답이 중간에 끊김
# 해결: max_new_tokens 늘리기
--max-running-requests 512  # 동시 요청 수 줄이기

# 문제 4: torch.compile 오류
# 해결: 비활성화
# --enable-torch-compile 제거 (첫 실행 시 컴파일 오류 발생 가능)

# 문제 5: 멀티 GPU 통신 오류
# 해결: NCCL 설정
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # InfiniBand 없으면 비활성화

프로덕션 체크리스트

✅ GPU 메모리 용량에 맞는 모델 + 양자화 선택
✅ tp 값을 GPU 수에 맞게 설정
✅ mem-fraction-static 0.85~0.90 으로 설정
✅ max-running-requests 트래픽에 맞게 조정
✅ Nginx 타임아웃 300초 이상 설정
✅ /health 엔드포인트로 헬스체크 구성
✅ Prometheus 메트릭 수집 + Grafana 대시보드
✅ 캐시 히트율 모니터링 (낮으면 RadixAttention 튜닝)
✅ 모델 파일 빠른 SSD에 저장
✅ HuggingFace 토큰 환경변수로 관리

마무리

SGLang을 한 줄로 정리하면 이래요.

"vLLM이 독립적인 요청을 빠르게 처리하는 엔진이라면, SGLang은 LLM 워크로드 전체를 하나의 프로그램으로 최적화하는 런타임이다."

RadixAttention으로 KV 캐시를 재사용하고, 구조화된 출력을 고속으로 디코딩하고, PD 분리 배포로 대규모 트래픽을 처리하는 것. 이 세 가지가 SGLang이 xAI, NVIDIA 같은 곳에서 선택받는 이유예요. 😄

📌 관련 글

SGLang PD 분리 배포

SGLang PD 분리 배포 완전 가이드 — Prefill/Decode 분리로 처리량 5배 올리기

LLM 추론에는 두 단계가 있어요.Prefill (프리필):- 입력 프롬프트 전체를 처리- 연산 집약적 (Compute-bound)- KV 캐시 생성- 보통 수백~수천 토큰을 한 번에 처리Decode (디코드):- 토큰을 하나씩 생성- 메모

cell-devlog.tistory.com

SGLang 파라미터 완전 정리

SGLang launch_server 파라미터 완전 정리

python -m sglang.launch_server --help이 명령어 치면 100개 넘는 파라미터가 쏟아져요. 뭐가 뭔지 몰라서 그냥 기본값으로 쓰는 경우가 많은데, 파라미터를 제대로 알면 성능이 2~3배 차이 나요.전체 파라

cell-devlog.tistory.com

'LLM' 카테고리의 다른 글

SGLang PD 분리 배포 완전 가이드 — Prefill/Decode 분리로 처리량 5배 올리기 (0)	2026.04.09
SGLang launch_server 파라미터 완전 정리 (0)	2026.04.09
스마트폰에서 AI를 돌리는 법 — 온디바이스 LLM 개발 입문 가이드 (0)	2026.04.08
Grok 5 완전 정리 — 6조 파라미터, AGI 10%, 역대 최대 AI의 진실 (0)	2026.04.08
Anthropic이 숨기려 했던 AI — Claude Mythos 유출 사건 완전 정리 (0)	2026.04.08

Cell DEVLOG

SGLang 서빙에 대한 모든 것 — 설치부터 프로덕션까지 완전 가이드

SGLang이 빠른 이유 — 핵심 원리 3가지

1. RadixAttention — KV 캐시 자동 재사용

2. Zero-Overhead CPU 스케줄러

3. 구조화된 출력 고속 디코딩

설치

pip 설치 (가장 간단)

Docker 설치 (권장)

기본 서버 실행

OpenAI 호환 API 사용

구조화된 출력 (Structured Output)

모델별 최적화 설정

Llama 3.x 계열

DeepSeek-R1 / V3 (MoE 모델)

Qwen3 계열

양자화 (Quantization) 설정

Speculative Decoding — 디코딩 속도 향상

멀티 GPU / 멀티 노드 설정

텐서 병렬 (단일 노드, 멀티 GPU)

멀티 노드 (서버 2대 이상)

SGL-Router — 캐시 인식 로드 밸런싱

Prefill/Decode 분리 배포 (PD Disaggregation)

Python API로 직접 사용

프로덕션 배포 — Docker Compose

프로덕션 모니터링

SGLang vs vLLM vs TensorRT-LLM 비교

자주 발생하는 문제와 해결법

프로덕션 체크리스트

마무리

'LLM' 카테고리의 다른 글

티스토리툴바

SGLang 서빙에 대한 모든 것 — 설치부터 프로덕션까지 완전 가이드

SGLang이 빠른 이유 — 핵심 원리 3가지

1. RadixAttention — KV 캐시 자동 재사용

2. Zero-Overhead CPU 스케줄러

3. 구조화된 출력 고속 디코딩

설치

pip 설치 (가장 간단)

Docker 설치 (권장)

기본 서버 실행

OpenAI 호환 API 사용

구조화된 출력 (Structured Output)

모델별 최적화 설정

Llama 3.x 계열

DeepSeek-R1 / V3 (MoE 모델)

Qwen3 계열

양자화 (Quantization) 설정

Speculative Decoding — 디코딩 속도 향상

멀티 GPU / 멀티 노드 설정

텐서 병렬 (단일 노드, 멀티 GPU)

멀티 노드 (서버 2대 이상)

SGL-Router — 캐시 인식 로드 밸런싱

Prefill/Decode 분리 배포 (PD Disaggregation)

Python API로 직접 사용

프로덕션 배포 — Docker Compose

프로덕션 모니터링

SGLang vs vLLM vs TensorRT-LLM 비교

자주 발생하는 문제와 해결법

프로덕션 체크리스트

마무리

'LLM' 카테고리의 다른 글

'LLM' Related Articles

티스토리툴바