This repository presents our solution for the Detect AI-generated Text Competition, a national-level data science competition hosted in 2025, Dacon.
The goal of the cahllenge was to detect whether a paragraph was written by a Large Language Model (LLM), with only document-level labels provided. This required re-egineering the data, designing weak supervision strategies, and buildig robust classifiers in a highly imbalanced setting.
- Host: Dacon
- Track: Detect AI-generated Text
- Evaluation Metric: ROC-AUC
- Input: Document-level 'full_text' (Provided as train.csv)
- Label: 'generated' (0 for human-written, 1 for AI-generated)
- Goal: Classify each paragraph as human- or AI-generated
| title | full_text | generated |
|---|---|---|
| μΉ΄νΈμ¬λΌμ¨μ¬ | μΉ΄νΈμ¬λΌμ¨μ¬μ νμμ΄ μ λλ₯Ό ꡬμ±νλ (μ€λ΅...) | 0 |
| μ²μκ±°μ± | μ²λ¬Ένμμ μ²μκ±°μ±μ κ΄λ λΆλ₯ (μ€λ΅...) | 0 |
| μλ곑 | μλ곑μ λ°°μ°μ μ°κΈ° μμ΄ λ¬΄λμ (μ€λ΅...) | 1 |
| ID | title | paragraph_index | paragraph_text |
|---|---|---|---|
| TEST_0000 | κ³΅μ€ λλμ μμμ νμμ± | 0 | λλμ΄λ μλ κ°μΈμ μκ°... |
| TEST_0001 | κ³΅μ€ λλμ μμμ νμμ± | 1 | λλμ λ¨μν κ°μΈμ λ¬Έμ ... |
| TEST_0002 | κ³΅μ€ λλμ μμμ νμμ± | 2 | μ¬κΈ°μ μ΄λ₯Έλ° 곡μ€λλμ... |
| ID | generated |
|---|---|
| TEST_0000 | 0 |
| TEST_0001 | 0 |
Our core challenge was to reconstruct paragraph-level labels from document-level data in an extremely imbalanced and noisy setting. To overcome this, we developed multiple weak supervision and filtering strategies, focusing on data-centric approaches.
- Each
full_text(up to 9 paragraphs) was split into multipleparagraph_textunits - For long paragraphs, we applied sliding window chunking with overlapping stride
- Input format:
"μ λͺ©: {title} λ³Έλ¬Έ: {paragraph_text}"
The dataset exhibited severe class imbalance:
approximately 10:1 ratio between generated=0 and generated=1 labels.
To address this, we performed both data augmentation and filtering for high-confidence positive samples.
- Oversampling + Label Propagation
- Filtering using Perplexity and Semantic Similarity
We used a pretrained KoGPT model ("kanana") to generate synthetic paragraphs mimicking generated=1 style.
These augmented paragraphs were added to the training set to enrich the positive class.
- Model: kakaocorp/kanana-1.5-2.1b-instruct-2505 (instruction-tuned KoGPT variant)
- Prompt-based generation: Input includes preceding ([BEFORE]) and following ([AFTER]) paragraphs. The model generates a coherent middle ([TARGET]) paragraph mimicking generated=1 style.
- Post-processing:Used kss for sentence segmentation. Removed incomplete trailing sentences using regex-based ending pattern matching (λ€, μ, μ΅λλ€ etc.). Applied manual spot-checking to remove low-quality generations. -Labeling & Integration: All generated paragraphs labeled as generated=1. Added to the training set to enrich the positive class distribution.
- Scale: Generated 11,317 synthetic paragraphs for augmentation.
π 4) Weak Labeling & Filtering Strategies learn more->
We experimented with various heuristic and unsupervised labeling techniques:
- Used GPT-like models to compute perplexity scores
- Sentences or paragraphs with low perplexity were considered likely AI-generated
- Applied thresholding to extract
generated=1candidates
- Extracted syntactic style features (e.g., average sentence length, token diversity)
- Combined with perplexity
- Performed KMeans or HDBSCAN clustering to isolate AI-like patterns
- Applied sentence-wise perplexity estimation
- Clustered using HDBSCAN to extract dense positive clusters
- Label assigned to entire paragraph if one or more sentences clustered as AI
- Modified (perturbed) the sentence input
- Measured change in log-likelihood (LL) or model confidence
- Used as a proxy for model βsurpriseβ or fluency robustness
The first baseline model used KLUE-RoBERTa (Which you can find and use in hugging face). The model with the best performance in STS.
| Model | STS | TC |
|---|---|---|
| mBERT-base | 84.66 | 81.55 |
| XML-R-base | 89.16 | 83.52 |
| KLUE-RoBERTa-base | 92.50 | 85.07 |
After that, many attempts were made using many models. leand more ->
- Sliding window over tokens with max pooling
- Sentence-level Pertubation + Perplexity filtering
- KLUE-RoBERTa embedding + cosine similarity for pseudo-labeling
- Model ensemble using ranking + voting
Our final submission used an ensemble of three models, trained on a paragraph-level relabeled dataset
1. KLUE-RoBERTa-large (fine-tuned)
- Chosen for its strong semantic representation capability, which is crucial for detecting subtle contextual inconsistencies in AI-generated text
- Check training details here
2. KLUE-RoBERTa-base (fine-tuned)
- Selected as a lighter and more generalizable counterpart to the large model, reducing overfitting risk on imbalanced data
- Check training details here
3. CatBoost Classifier
- Focused on stylometric anomalies often observed in AI-generated text, complementing semantic models
- Features: unique_ratio, verb_ratio, entropy, polynomial interaction terms (degree=2), Perplexity (PPL) estimated via
skt/kogpt2-base-v2 - Check training details here
A custom Extreme Voting method was adopted to maximize ROC-AUC:
- β₯2 models β₯0.5 β max(probabilities) (optimistic consensus)
- Otherwise β min(probabilities) (conservative fallback)
Our team of 5 members divided responsibilities as follows:
Team Name: PokSSak(νμΉ)
| μμ | μ΄λ¦ | μν | |
|---|---|---|---|
| μλͺ μ¬μλνκ΅ μ»΄ν¨ν°μ¬μ΄μΈμ€μ 곡(22) | κΉμμ | Data Engineering, Relabeling & Model Development | soyoung2118 |
| μλͺ μ¬μλνκ΅ λ°μ΄ν°μ¬μ΄μΈμ€μ 곡(23) | κΉμλΉ | ||
| μλͺ μ¬μλνκ΅ λ°μ΄ν°μ¬μ΄μΈμ€μ 곡(23) | μ€νμ | Data Engineering & Experiment Execution | |
| μλͺ μ¬μλνκ΅ λ°μ΄ν°μ¬μ΄μΈμ€μ 곡(23) | μμ§μ° | Data Engineering & Experiment Execution | |
| μλͺ μ¬μλνκ΅ μννΈμ¨μ΄μ΅ν©μ 곡(22) | μμμ | Data Engineering & Model Development | sophia |
We collaborated for 4 weeks.
- Final submission score: ROC-AUC 0.889
- Full competition leaderboard: [https://dacon.io/competitions/official/236473/leaderboard]
- PDF summary: 'docs/project_summary.pdf'
