AI가 내 블로그 포스트를 쓰게 하고, 품질을 점수 매겼더니 — 결과는 잔인했다

발행: 2일 전 (2026년 3월 8일 PM 01:32 GMT+9)

7 분 소요

Source: Dev.to

저는 글을 많이 씁니다—블로그 포스트, 문서, README—주당 약 2,000 단어 정도요. 지난달에 AI에게 몇 개의 블로그 단락을 생성하게 했습니다. 그것들은 보였고 전문적이었지만 뭔가 어색했습니다. 그래서 가독성 점수기를 사용해 보았더니, 그 수치는 잔인했습니다.

실험

저는 AI가 생성한 블로그 문단 네 개(“웹 개발에 관한 블로그 소개글을 써줘”라고 하면 ChatGPT/Claude가 만들어 내는 유형)와 제가 직접 쓴 문단 네 개를 준비했습니다. 그런 다음 textlens라는 오픈소스 텍스트 분석 라이브러리를 이용해 총 여덟 개 모두를 점수 매겼습니다.

점수 매기기 코드 (JavaScript):

import { readability, sentiment } from 'textlens';

const aiText = `In today's rapidly evolving technological landscape,
developers are constantly seeking innovative solutions to streamline
their workflows and enhance productivity. The emergence of artificial
intelligence has fundamentally transformed the way we approach
software development, offering unprecedented opportunities for
automation and optimization.`;

const humanText = `I write a lot. Blog posts, docs, READMEs — probably
2,000 words a week. Last month I got lazy and let AI write three posts
for me. They looked fine. Professional, even. But something felt off.
So I ran them through a readability scorer. The numbers were bad.`;

console.log('AI:', readability(aiText));
console.log('Human:', readability(humanText));

점수

학년 수준이 낮을수록 읽기 쉬움. Flesch 점수가 높을수록 가독성이 좋음.

지표	AI‑작성 (평균)	인간‑작성 (평균)	승자
Flesch Reading Ease	‑4.7	73.8	인간
FK Grade Level	19.9	5.1	인간
Gunning Fog Index	24.9	7.5	인간

AI 텍스트는 음수인 Flesch Reading Ease 점수를 받아 의료 연구 논문보다 읽기 어렵다는 것을 의미합니다. 19.9라는 학년 수준은 박사 과정 후보자가 블로그 게시물 서문을 편하게 읽을 수 있을 정도의 난이도입니다. 반면 인간이 쓴 텍스트는 평균 5학년 수준으로, 모든 청소년이 읽을 수 있습니다.

왜 AI 텍스트 점수가 낮은가

Readability formulas measure sentence length and syllable count. AI defaults to long, compound sentences packed with multi‑syllable jargon.

AI version:

“감성 분석 알고리즘의 구현은 자연어 처리와 머신러닝 기술의 매혹적인 교차점을 나타냅니다.”

Human version:

“감성 분석은 복잡해 보이지만, 코드는 간단합니다.”

The AI sentence scores a Gunning Fog index of 28.4 (post‑graduate level) while the human sentence scores 7.5 (7th grade).

AI also loves filler words—leverage, innovative, comprehensive, unprecedented—which add syllables without adding meaning. Real developers tend to say use, new, full.

감정 분석의 놀라움

AI 텍스트는 감정 분석에서 일관되게 더 긍정적인 점수를 받았고, 내 글은 중립에 가까웠다.

import { sentiment } from 'textlens';

const aiResult = sentiment(aiText);
// { score: 4, comparative: 0.074, positive: ['innovative', ...], ... }

const humanResult = sentiment(humanText);
// { score: -1, comparative: -0.034, positive: [], negative: ['lazy', 'bad'] }

AI는 끊임없이 낙관적이며 텍스트에 exciting, powerful, exceptional, revolutionary와 같은 단어들을 뿌린다. 내 prose는 lazy와 bad와 같은 솔직한 단어들을 포함했으며, 이는 독자들에게 더 진정성 있게 다가온다.

지금 실제로 하는 일

나는 여전히 초안을 만들 때 AI를 사용하지만, 워크플로에 점수 매기기 단계를 추가했습니다:

import { analyze } from 'textlens';

function checkDraft(text) {
  const result = analyze(text);
  const { fleschReadingEase, fleschKincaidGrade } = result.readability;

  if (fleschKincaidGrade.score > 10) {
    console.warn(`⚠️ Grade level ${fleschKincaidGrade.score} — too complex`);
    console.warn('Simplify sentences and reduce jargon.');
  }

  if (fleschReadingEase.score < 50) {
    console.warn(`⚠️ Flesch score ${fleschReadingEase.score} — hard to read`);
  }

  console.log(`✅ Grade: ${fleschKincaidGrade.score} | Flesch: ${fleschReadingEase.score}`);
}

내 규칙: 학년 8학년 이상은 절대 배포하지 않는다. AI가 16학년 수준의 문단을 만들면, 점수가 낮아질 때까지 다시 작성합니다—보통 30초 정도면 충분합니다: 문장을 짧게 하고, 어려운 단어를 바꾸면 됩니다.

요약

AI는 양을 생성하는 데 뛰어나지만 가독성에서는 어려움을 겪습니다. 결과는 인상적으로 들리지만 성과가 저조한 텍스트—높은 이탈률, 낮은 참여도, 그리고 스키밍하고 떠나는 독자들.

해결책은 AI를 피하는 것이 아니라 발행하는 내용을 측정하는 것입니다. 가독성은 주관적인 것이 아니라 수학입니다: 문장 길이, 음절 수, 단어 빈도—출판하기 전에 확인할 수 있는 숫자들.

사용 도구: textlens — Node.js용 무의존성 텍스트 분석. npm install textlens 로 설치하고 npx textlens "your text here" 로 시도해 보세요.

AI 생성 콘텐츠 품질에 대한 경험은 어떠신가요? 측정해 보셨나요, 아니면 눈대중으로만 보셨나요?

AI가 내 블로그 포스트를 쓰게 하고, 품질을 점수 매겼더니 — 결과는 잔인했다

실험

점수

왜 AI 텍스트 점수가 낮은가

감정 분석의 놀라움

지금 실제로 하는 일

요약

관련 글

법적 vs 정당성: AI 재구현이 Copyleft와 Open Source 윤리를 약화시키는 방법

MLShip을 만들었습니다 — 60초 안에 Streamlit 또는 Gradio ML 앱을 배포하세요. Docker 없이. AWS 없이.

무마찰 퍼블리싱: Human-in-the-Loop Agentic CMS, Notion MCP 기반

AI 콜드 스타트가 Kubernetes Autoscaling을 깨뜨린다