[Paper] TiCo: 시간 제어 가능한 훈련을 위한 음성 대화 모델

발행: 1일 전 (2026년 3월 24일 AM 02:51 GMT+9)

7 분 소요

원문: arXiv

Source: arXiv - 2603.22267v1

개요

이 논문은 TiCo (Time‑Controllable Training) 를 소개한다. 이는 경량의 사후 학습 기법으로, 음성 대화 모델(SDM)에 명시적인 시간 제약—예: “약 15 초 안에 응답한다”—을 준수할 수 있는 능력을 부여한다. 모델에 경과된 발화 시간을 인식하게 함으로써, TiCo는 음성 비서와 인터랙티브 에이전트가 실제 시간에 민감한 상황에서 보다 자연스럽고 사용자 친화적으로 들리도록 만든다.

핵심 기여

Time‑aware generation: 현재 경과 시간을 모델의 토큰 스트림에 직접 삽입하는 Spoken Time Markers (STM)을 도입합니다.
Simple post‑training pipeline: 몇 천 개의 자체 생성 예시만 필요하며, 추가적인 인간 주석 QA 쌍은 필요하지 않습니다.
Reinforcement‑learning fine‑tuning: 지속 시간 정확도와 언어 품질을 균형 있게 조정하는 보상을 사용하여 모델의 유창성을 유지합니다.
Empirical validation: 오픈소스(예: Whisper‑based)와 상용 SDM 모두 TiCo 적용 후 지속 시간 제약 준수가 크게 향상됨을 입증합니다.
Minimal overhead: 추론 시 지연이 거의 없으며, 사전 학습된 모든 음성 언어 모델에 적용할 수 있습니다.

Methodology

Self‑generation of training data
- The base SDM generates a batch of utterances without any time constraint.
- For each utterance, the system computes the actual speaking duration (using a TTS front‑end or a simple phoneme‑duration model).
Insertion of Spoken Time Markers (STM)
- Tokens of the form <X.Y seconds> are interleaved into the transcript at regular intervals (e.g., every 2 seconds).
- These markers act as “internal clocks,” letting the model see how much time has already been spoken.
Reinforcement Learning (RL) fine‑tuning
- Reward function = α · (1 – |target – actual|/target) + β · quality_score.
- The first term pushes the model toward the desired duration; the second term (e.g., a BERTScore‑like metric) penalizes degradation in fluency or relevance.
- Proximal Policy Optimization (PPO) is used to update the model weights while keeping the original language knowledge stable.
Inference with time control
- At generation time, the user supplies a target duration.
- The model starts emitting tokens and STM updates; when the predicted remaining time falls below a threshold, it truncates or pads the output to hit the target.

결과 및 발견

모델 (pre‑TiCo)	목표 10 s – ±1 s 이내 비율	목표 15 s – ±1 s 이내 비율
Baseline Open‑source SDM	12 %	9 %
Baseline Commercial SDM	15 %	11 %
TiCo‑fine‑tuned Open‑source	68 %	71 %
TiCo‑fine‑tuned Commercial	62 %	66 %

지속 시간 준수가 평균 약 5배 향상되었습니다.
응답 품질(BLEU, METEOR, 인간 평가) 감소가 0.2점 미만으로, 시간 제어가 자연스러움을 희생하지 않음을 확인했습니다.
추론 지연은 발화당 < 5 ms 증가하여 실시간 음성 비서 요구 사항을 충분히 만족합니다.

Practical Implications

Voice assistants can now tailor reply length to the context (e.g., brief confirmations while driving, longer explanations when the user is idle).
→ 음성 비서는 이제 상황에 맞게 응답 길이를 조정할 수 있습니다(예: 운전 중에는 짧은 확인, 사용자가 한가할 때는 더 긴 설명).
Customer‑service bots can respect call‑center timing policies, ensuring that agents are not kept waiting too long for a system‑generated hand‑off.
→ 고객 서비스 봇은 콜센터 타이밍 정책을 준수하여, 에이전트가 시스템이 생성한 핸드오프를 오래 기다리지 않도록 할 수 있습니다.
Multimodal agents (e.g., robot companions) can synchronize speech with gestures or visual cues by aligning spoken duration with other timed actions.
→ 멀티모달 에이전트(예: 로봇 동반자)는 말하기 지속 시간을 다른 타이밍 동작과 맞추어 음성을 제스처나 시각적 신호와 동기화할 수 있습니다.
Accessibility tools (screen readers, language learning apps) can produce speech that fits predefined lesson slots, improving curriculum pacing.
→ 접근성 도구(스크린 리더, 언어 학습 앱)는 미리 정의된 수업 시간에 맞는 음성을 생성하여 커리큘럼 진행 속도를 향상시킬 수 있습니다.
Developer workflow: TiCo can be added as a post‑training step to any existing SDM, requiring only a modest compute budget (a few GPU hours) and no new annotation pipeline.
→ 개발자 워크플로우: TiCo는 기존 SDM에 사후 학습 단계로 추가할 수 있으며, 소량의 컴퓨팅 예산(몇 시간의 GPU)만 필요하고 새로운 라벨링 파이프라인이 필요 없습니다.

Limitations & Future Work

Reliance on accurate duration estimation: 현재 파이프라인은 타이밍을 위해 TTS 프런트‑엔드를 사용합니다; 음소‑지속시간 모델의 오류가 전파될 수 있습니다.
Scope of control: TiCo는 전체 발화 길이에 초점을 맞추며, 보다 세밀한 제어(예: 멈춤 위치, 억양)는 다루지 않습니다.
Generalization to extreme durations: 매우 짧은 (< 2 s) 혹은 매우 긴 (> 30 s) 목표에 대해서는 효과가 감소하여, 커리큘럼‑형식 학습이 필요함을 시사합니다.
Future directions proposed by the authors include extending STM to encode prosodic cues, integrating user‑feedback loops for online adaptation, and evaluating TiCo on multilingual SDMs.

저자

Kai‑Wei Chang
Wei‑Chih Chen
En‑Pei Hu
Hung‑yi Lee
James Glass

논문 정보

arXiv ID: 2603.22267v1
분류: cs.CL, cs.AI, eess.AS
출판일: 2026년 3월 23일
PDF: PDF 다운로드

[Paper] TiCo: 시간 제어 가능한 훈련을 위한 음성 대화 모델

개요

핵심 기여

Methodology

결과 및 발견

Practical Implications

Limitations & Future Work

저자

논문 정보

관련 글

[Paper] WorldCache: 콘텐츠 인식 캐싱을 통한 가속화된 비디오 월드 모델

[Paper] ThinkJEPA: 대규모 비전-언어 추론 모델을 활용한 잠재 세계 모델 강화

[Paper] Dyadic: 인간-인간 및 인간-AI 대화 연구를 위한 확장 가능한 플랫폼

[Paper] Gumbel Distillation을 이용한 병렬 텍스트 생성