Skip to content

Detect AI-generated Text Competition, a national-level data science competition hosted in 2025, Dacon

Notifications You must be signed in to change notification settings

soyoung2118/Detect_AI_Generated_Text

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Detect LLM: Paragraph-level Classification of AI-Generated Text

image

πŸ“ Overview

This repository presents our solution for the Detect AI-generated Text Competition, a national-level data science competition hosted in 2025, Dacon.

The goal of the cahllenge was to detect whether a paragraph was written by a Large Language Model (LLM), with only document-level labels provided. This required re-egineering the data, designing weak supervision strategies, and buildig robust classifiers in a highly imbalanced setting.

Competition Link


πŸ“ Competition Information

  • Host: Dacon
  • Track: Detect AI-generated Text
  • Evaluation Metric: ROC-AUC
  • Input: Document-level 'full_text' (Provided as train.csv)
  • Label: 'generated' (0 for human-written, 1 for AI-generated)
  • Goal: Classify each paragraph as human- or AI-generated

βœ”οΈ train.csv

title full_text generated
μΉ΄ν˜Έμ˜¬λΌμ›¨μ„¬ μΉ΄ν˜Έμ˜¬λΌμ›¨μ„¬μ€ ν•˜μ™€μ΄ μ œλ„λ₯Ό κ΅¬μ„±ν•˜λŠ” (μ€‘λž΅...) 0
청색거성 μ²œλ¬Έν•™μ—μ„œ 청색거성은 광도 λΆ„λ₯˜ (μ€‘λž΅...) 0
μˆ˜λ‚œκ³‘ μˆ˜λ‚œκ³‘μ€ 배우의 μ—°κΈ° 없이 λ¬΄λŒ€μ— (μ€‘λž΅...) 1

βœ”οΈ test.csv

ID title paragraph_index paragraph_text
TEST_0000 곡쀑 λ„λ•μ˜ μ˜μ˜μ™€ ν•„μš”μ„± 0 λ„λ•μ΄λž€ μ›λž˜ 개인의 자각...
TEST_0001 곡쀑 λ„λ•μ˜ μ˜μ˜μ™€ ν•„μš”μ„± 1 도덕은 λ‹¨μˆœνžˆ 개인의 문제...
TEST_0002 곡쀑 λ„λ•μ˜ μ˜μ˜μ™€ ν•„μš”μ„± 2 여기에 이λ₯Έλ°” 곡쀑도덕은...

βœ”οΈ sample_submission.csv

ID generated
TEST_0000 0
TEST_0001 0

πŸ“ Data Analysis & Preprocessing

Our core challenge was to reconstruct paragraph-level labels from document-level data in an extremely imbalanced and noisy setting. To overcome this, we developed multiple weak supervision and filtering strategies, focusing on data-centric approaches.


image

➑️ 1) Data Restructuring

  • Each full_text (up to 9 paragraphs) was split into multiple paragraph_text units
  • For long paragraphs, we applied sliding window chunking with overlapping stride
  • Input format:
    "제λͺ©: {title} λ³Έλ¬Έ: {paragraph_text}"

➑️ 2) Class Imbalance Handling

The dataset exhibited severe class imbalance:
approximately 10:1 ratio between generated=0 and generated=1 labels.

To address this, we performed both data augmentation and filtering for high-confidence positive samples.

  • Oversampling + Label Propagation
  • Filtering using Perplexity and Semantic Similarity

πŸ” 3) KANANA-based Positive Data Augmentation

We used a pretrained KoGPT model ("kanana") to generate synthetic paragraphs mimicking generated=1 style.
These augmented paragraphs were added to the training set to enrich the positive class.

  • Model: kakaocorp/kanana-1.5-2.1b-instruct-2505 (instruction-tuned KoGPT variant)
  • Prompt-based generation: Input includes preceding ([BEFORE]) and following ([AFTER]) paragraphs. The model generates a coherent middle ([TARGET]) paragraph mimicking generated=1 style.
  • Post-processing:Used kss for sentence segmentation. Removed incomplete trailing sentences using regex-based ending pattern matching (λ‹€, μš”, μŠ΅λ‹ˆλ‹€ etc.). Applied manual spot-checking to remove low-quality generations. -Labeling & Integration: All generated paragraphs labeled as generated=1. Added to the training set to enrich the positive class distribution.
  • Scale: Generated 11,317 synthetic paragraphs for augmentation.

πŸ” 4) Weak Labeling & Filtering Strategies learn more->

We experimented with various heuristic and unsupervised labeling techniques:

1. Perplexity-based Filtering

  • Used GPT-like models to compute perplexity scores
  • Sentences or paragraphs with low perplexity were considered likely AI-generated
  • Applied thresholding to extract generated=1 candidates

2. Style Feature + PPL Clustering

  • Extracted syntactic style features (e.g., average sentence length, token diversity)
  • Combined with perplexity
  • Performed KMeans or HDBSCAN clustering to isolate AI-like patterns

3. HDBSCAN on Paragraph-level PPL

  • Applied sentence-wise perplexity estimation
  • Clustered using HDBSCAN to extract dense positive clusters
  • Label assigned to entire paragraph if one or more sentences clustered as AI

4. Perturbation-based Confidence Scoring

  • Modified (perturbed) the sentence input
  • Measured change in log-likelihood (LL) or model confidence
  • Used as a proxy for model β€œsurprise” or fluency robustness

πŸ“ Modeling Strategy

➑️ Base Models:

The first baseline model used KLUE-RoBERTa (Which you can find and use in hugging face). The model with the best performance in STS.

Model STS TC
mBERT-base 84.66 81.55
XML-R-base 89.16 83.52
KLUE-RoBERTa-base 92.50 85.07

check more ->

After that, many attempts were made using many models. leand more ->


➑️ Key Techniques:

  1. Sliding window over tokens with max pooling
  2. Sentence-level Pertubation + Perplexity filtering
  3. KLUE-RoBERTa embedding + cosine similarity for pseudo-labeling
  4. Model ensemble using ranking + voting

⭐ Final Model:

Our final submission used an ensemble of three models, trained on a paragraph-level relabeled dataset

➑️ Models

1. KLUE-RoBERTa-large (fine-tuned)

  • Chosen for its strong semantic representation capability, which is crucial for detecting subtle contextual inconsistencies in AI-generated text
  • Check training details here

2. KLUE-RoBERTa-base (fine-tuned)

  • Selected as a lighter and more generalizable counterpart to the large model, reducing overfitting risk on imbalanced data
  • Check training details here

3. CatBoost Classifier

  • Focused on stylometric anomalies often observed in AI-generated text, complementing semantic models
  • Features: unique_ratio, verb_ratio, entropy, polynomial interaction terms (degree=2), Perplexity (PPL) estimated via skt/kogpt2-base-v2
  • Check training details here

➑️ Ensemble Strategy

A custom Extreme Voting method was adopted to maximize ROC-AUC:

  • β‰₯2 models β‰₯0.5 β†’ max(probabilities) (optimistic consensus)
  • Otherwise β†’ min(probabilities) (conservative fallback)

πŸ“ Experiments & Results


πŸ“ Project Structure


πŸ“ Team & Contributions

Our team of 5 members divided responsibilities as follows:

Team Name: PokSSak(폭싹)

μ†Œμ† 이름 μ—­ν• 
μˆ™λͺ…μ—¬μžλŒ€ν•™κ΅ μ»΄ν“¨ν„°μ‚¬μ΄μ–ΈμŠ€μ „κ³΅(22) κΉ€μ†Œμ˜ Data Engineering, Relabeling & Model Development soyoung2118
μˆ™λͺ…μ—¬μžλŒ€ν•™κ΅ λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€μ „κ³΅(23) κΉ€μˆ˜λΉˆ
μˆ™λͺ…μ—¬μžλŒ€ν•™κ΅ λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€μ „κ³΅(23) μ˜€ν˜„μ„œ Data Engineering & Experiment Execution
μˆ™λͺ…μ—¬μžλŒ€ν•™κ΅ λ°μ΄ν„°μ‚¬μ΄μ–ΈμŠ€μ „κ³΅(23) μ›μ§€μš° Data Engineering & Experiment Execution
μˆ™λͺ…μ—¬μžλŒ€ν•™κ΅ μ†Œν”„νŠΈμ›¨μ–΄μœ΅ν•©μ „κ³΅(22) μž„μ†Œμ • Data Engineering & Model Development sophia

We collaborated for 4 weeks.


πŸ“ Stacks

Environment

Development

Communication


πŸ“ Key Takeaways


πŸ“ Appendix

About

Detect AI-generated Text Competition, a national-level data science competition hosted in 2025, Dacon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%