Skip to content

Conversation

@twaugh
Copy link
Owner

@twaugh twaugh commented Nov 24, 2025

No description provided.

Page-level chunks (__PAGE__) are synthetic entries in the RAG index
used for semantic search by page name/title/frontmatter. They don't
correspond to real blocks in the file structure, so they can't be
used as integration targets for actions like add_under or replace.

When the LLM selects a __PAGE__ chunk as a target, it means "this
knowledge belongs on this page" without specifying a particular block.
The correct interpretation is add_section (new top-level section).

Changes:
- Detect targets ending with "::__PAGE__" after LLM ID translation
- Normalize action to "add_section" regardless of LLM suggestion
- Clear target_block_id (add_section has no specific target)
- Add debug logging for normalization events

This eliminates the "Target block not found: page::__PAGE__" errors
and enables LLM to suggest integration into pages that have no blocks
yet (only page-level metadata exists in RAG index).

Impact:
- Before: Integration fails with "target block not found" error
- After: Page-level chunks correctly interpreted as add_section
- Enables: Adding knowledge to empty but relevant pages

Tests:
- Added test_plan_integration_for_block_normalizes_page_level_chunks
- Added test_plan_integration_for_block_preserves_regular_block_targets
- All existing llm_wrappers tests pass

Assisted-by: Claude Code
Page-level chunks (__PAGE__) store frontmatter in their context for
semantic search quality during embedding. However, when formatting
these chunks for LLM prompts, the frontmatter was duplicated:
- Once in the <properties> section (parsed from page outline)
- Again in the <block> content (from stored RAG chunk context)

This wasted ~50-200 tokens per page depending on property count.

Solution: Reuse existing _clean_context_for_llm() function from
page_indexer.py to strip frontmatter from page-level chunks during
prompt formatting. This elegant approach reuses tested code instead
of duplicating logic.

Changes:
- Import _clean_context_for_llm() in llm_helpers.py
- Detect page-level chunks (::__PAGE__) in format_chunks_for_llm()
- Apply _clean_context_for_llm() to page-level chunks only
- Regular blocks unchanged (already cleaned during indexing)
- Frontmatter remains in <properties> section for LLM context

Impact:
- Before: "tags:: foo, bar" appears twice in prompt (properties + block)
- After: "tags:: foo, bar" appears once (properties only)
- Token savings: ~50-200 per page with properties
- No impact on semantic search quality (frontmatter still embedded)

Tests:
- Added test_llm_helpers.py with 3 integration tests
- Tests cover page-level chunk stripping, regular block preservation
- All existing llm_wrappers tests pass (no regressions)

Assisted-by: Claude Code
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.67%. Comparing base (f2f7048) to head (da0af47).

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   84.63%   84.67%   +0.04%     
==========================================
  Files          48       48              
  Lines        5128     5136       +8     
==========================================
+ Hits         4340     4349       +9     
+ Misses        788      787       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@twaugh twaugh merged commit f9def7f into main Nov 24, 2025
1 check passed
@twaugh twaugh deleted the fix/page-level-chunk-integration branch November 24, 2025 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants