URL and case rules fix #29

pan93412 · 2025-10-31T14:52:30Z

This pull request introduces enhancements to the spelling and case rule detection logic, as well as improvements to documentation and database interfaces. The main focus is to support alternative case forms for terms, refine the exclusion logic for spellchecking, and clarify setup instructions.

Documentation improvements:

Updated README.md to use correct repository URLs, add shell/env block formatting, and clarify installation/startup steps. Redundant setup instructions were removed for clarity. [1] [2] [3]
Added documentation for "alternative case forms" in the case rule section, explaining that terms like App can also be written as app.

Database and model changes:

The CaseRule interface and schema now support an optional alternatives field (array of strings) for alternative case forms.
Updated the ICaseDatabase interface and its implementations (CaseDatabase, MockCaseDatabase) so addRule accepts an optional alternatives parameter. [1] [2] [3]

Spellchecking logic improvements:

Refactored the Determiner class in determiner.ts to modularize and improve exclusion logic for URLs, paths, code blocks, and mentions. Added helper methods for building/merging exclusion ranges and for checking matches outside excluded ranges.
Enhanced case rule checking to support alternative forms, ensuring that only incorrect case usages outside excluded ranges are flagged.

I also resolved the case-checking issues evident in the CI. There should be no spelling or casing problems in this codebase now.

Minor dependency and import updates:

Adjusted imports in determiner.ts to include the new CaseRule type.

Fixed #26
Fixed https://discord.com/channels/1259032762422136902/1259032897373999144/1433821827364425758

Summary by CodeRabbit

New Features
- Added support for alternative terms/casings when defining case rules.
- Improved handling of special content areas—rules now exclude URLs, code blocks, inline code, and Markdown links from checks.
Documentation
- Updated setup instructions with corrected repository references and better formatting.
Tests
- Added comprehensive test coverage for edge cases including URLs, code blocks, and Markdown scenarios.

coderabbitai · 2025-10-31T14:52:40Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This PR extends the case-rule system to support alternative term variations by adding an alternatives field to the CaseRule model and database layer. It simultaneously refactors the Determiner's spelling-check logic to use a range-based exclusion system that skips URLs, code blocks, and other contexts where style rules should not apply.

Changes

Cohort / File(s)	Summary
Configuration `.gitignore`	Added `package-lock.json` to ignore list with note on pnpm usage.
Documentation `README.md`	Updated repository URLs (Edit-Mr/WinstonBot), adjusted clone and directory references, added language hints to code blocks, documented CaseRule casing alternatives, and removed trailing setup content.
Data Model `models.ts`	Extended `CaseRule` interface and schema with optional `alternatives: string[] \| null` field and corresponding validator.
Database Interface `database-interfaces.ts`	Updated `ICaseDatabase.addRule()` signature to accept optional `alternatives?: string[] \| null` parameter alongside `term`.
Database Implementation `database.ts`, `database-test-utils.ts`	Updated `CaseDatabase.addRule()` and `MockCaseDatabase.addRule()` to accept and persist the optional `alternatives` parameter; inserted documents now include `{ term, alternatives, _id }` structure.
Spelling Checker Refactoring `determiner.ts`	Introduced URL_REGEX field and range-based exclusion system via new private methods (`#buildExcludedRanges`, `#addMatchedRanges`, `#mergeRanges`, `#createIsExcludedChecker`, `#getSanitizedContent`); refactored `checkSpelling`, `#checkGeneralRules`, and `#checkCaseRules` to skip URLs, code blocks, paths, and other excluded ranges; imported `CaseRule` type from models.
Test Coverage `determiner.test.ts`	Added comprehensive tests covering HTTP-rule exclusions in code blocks, URL/Markdown handling, inline code blocks, mixed URL/Markdown scenarios, and edge cases around casing and alternatives.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Determiner
    participant RangeBuilder
    participant RuleChecker
    participant Validator

    Caller->>Determiner: checkSpelling(content)
    activate Determiner
    Determiner->>RangeBuilder: `#buildExcludedRanges`(content)
    activate RangeBuilder
    RangeBuilder->>RangeBuilder: `#addMatchedRanges` (URLs)
    RangeBuilder->>RangeBuilder: `#addPathRanges`
    RangeBuilder->>RangeBuilder: `#addCodeBlockRanges`
    RangeBuilder->>RangeBuilder: `#mergeRanges`
    RangeBuilder-->>Determiner: excludedRanges[]
    deactivate RangeBuilder
    
    Determiner->>Determiner: `#getSanitizedContent` (remove excluded)
    Determiner->>Determiner: `#createIsExcludedChecker`
    
    rect rgb(220, 240, 255)
    Note over Determiner,RuleChecker: Check General Rules
    Determiner->>RuleChecker: `#checkGeneralRules`(sanitized, isExcluded)
    RuleChecker->>Validator: validate matches not in excluded ranges
    Validator-->>RuleChecker: mistakes[]
    RuleChecker-->>Determiner: generalMistakes[]
    end
    
    rect rgb(220, 240, 255)
    Note over Determiner,RuleChecker: Check Case Rules
    Determiner->>RuleChecker: `#checkCaseRules`(sanitized, isExcluded)
    RuleChecker->>Validator: validate term + alternatives in excluded ranges
    Validator-->>RuleChecker: mistakes[]
    RuleChecker-->>Determiner: caseMistakes[]
    end
    
    Determiner->>Determiner: merge all mistakes
    Determiner-->>Caller: mistakes[]
    deactivate Determiner

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

determiner.ts: Core refactoring introduces five new private methods with range-based exclusion logic; requires careful verification of merge logic, boundary condition handling, and integration with existing rule-checking routines.
determiner.test.ts: Extensive test additions demand validation that all code-block, URL, and Markdown scenarios are correctly handled; edge cases around alternatives and casing merit specific attention.
Database layer changes (database-interfaces.ts, database.ts, database-test-utils.ts, models.ts): While more straightforward, the cascading signature changes across multiple layers and the schema validator logic should be verified for consistency.

Possibly related PRs

Use MongoDB and new database structure #8: Directly extends the same CaseRule model and CaseDatabase.addRule method, building on the database foundations to add alternatives support.
Filtering raw Markdown syntax #28: Builds on the same database-interface and case-rule changes, extending ICaseDatabase and MockCaseDatabase signatures.

Suggested reviewers

Edit-Mr

Poem

🐰 Hop hop, the rules now know their way,
Through URLs and code they'll gently stray,
Alternative terms in harmony rest,
Ranges excluded—the speller's test—
No more false alarms where they don't belong,
The checker now dances to a smarter song! 🎵

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'URL and case rules fix' directly addresses the main changes: URL exclusion logic and case rule enhancements with alternatives support.
Linked Issues check	✅ Passed	PR addresses bug #26 by refactoring case rule detection to support alternatives and improved exclusion logic for URLs and code blocks.
Out of Scope Changes check	✅ Passed	Changes include README updates (repository URLs, documentation), database interface refactoring for alternatives support, and improved exclusion logic—all aligned with supporting alternative case forms and fixing URL/code block handling.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch alternatives-for-caserules

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0842106 and 190cb33.

📒 Files selected for processing (8)

.gitignore (1 hunks)
README.md (3 hunks)
database-interfaces.ts (1 hunks)
database-test-utils.ts (1 hunks)
database.ts (1 hunks)
determiner.test.ts (2 hunks)
determiner.ts (3 hunks)
models.ts (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-31T12:45:58.617Z

Learnt from: pan93412
Repo: Edit-Mr/WinstonBot PR: 28
File: determiner.ts:167-178
Timestamp: 2025-10-31T12:45:58.617Z
Learning: In the WinstonBot project's determiner.ts file, the `searchWord` variable in spelling rules intentionally supports regex patterns. The rules are stored in MongoDB as controlled input, so constructing RegExp from these patterns is by design and not a security concern.

Applied to files:

database-test-utils.ts
determiner.test.ts
models.ts
determiner.ts

🧬 Code graph analysis (2)

determiner.test.ts (2)

database-test-utils.ts (2)

MockSpellingDatabase (16-56)

MockCaseDatabase (61-91)

determiner.ts (1)

Determiner (15-275)

determiner.ts (2)

zhhanttw.ts (1)

detectChineseType (9-23)

models.ts (2)

SpellingRuleType (20-20)

CaseRule (47-51)

🪛 ast-grep (0.39.7)

determiner.ts

[warning] 56-56: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(this.#URL_REGEX, "g")
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html

(regexp-from-variable)

[warning] 211-211: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(pattern, "i")
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html

(regexp-from-variable)

[warning] 216-216: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(pattern, "gi")
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html

(regexp-from-variable)

[warning] 247-247: Regular expression constructed from variable input detected. This can lead to Regular Expression Denial of Service (ReDoS) attacks if the variable contains malicious patterns. Use libraries like 'recheck' to validate regex safety or use static patterns.
Context: new RegExp(lowerTerm, "gi")
Note: [CWE-1333] Inefficient Regular Expression Complexity [REFERENCES]
- https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS
- https://cwe.mitre.org/data/definitions/1333.html

(regexp-from-variable)

🔇 Additional comments (21)

.gitignore (1)

54-55: LGTM!

The addition of package-lock.json to .gitignore is appropriate for projects using pnpm as their package manager, since pnpm generates its own lockfile (pnpm-lock.yaml).

README.md (2)

27-52: LGTM!

The addition of language identifiers (shell, env) to code blocks improves syntax highlighting and readability in rendered Markdown.

86-86: LGTM!

The documentation now clearly mentions the alternative case forms feature (替代大小寫形式), which aligns well with the alternatives field added to the CaseRule model.

determiner.test.ts (4)

265-310: LGTM!

Excellent test coverage for HTTP-related case rules within code blocks and URLs. The tests properly validate that:

HTTP terms inside inline and multi-line code blocks are excluded from case checking

HTTP terms outside code blocks are still checked

URLs containing HTTP-like terms are properly excluded

417-518: LGTM!

Comprehensive test coverage for URL and path exclusion logic, covering multiple URL schemes (http, https, rtmp) and different path formats (relative, parent directory, absolute). The test validating that text outside URLs is still checked (lines 484-495) is particularly important for preventing false negatives.

520-622: LGTM!

Thorough test coverage for @ mention exclusion logic. The tests properly validate that @mentions with alphanumeric characters and underscores are excluded while text outside @mentions is still checked.

624-802: LGTM!

Excellent test coverage for the new alternatives feature. The tests comprehensively validate:

Valid alternatives are accepted (lines 625-639)

Invalid forms are flagged (lines 641-658)

Edge cases with null, undefined, and empty arrays (lines 660-711)

Multiple alternatives (lines 713-745)

Interaction with code block exclusions (lines 747-779)

Multiple rules with different alternatives (lines 781-801)

This thorough coverage gives confidence that the alternatives feature works correctly.

models.ts (2)

50-50: Type definition accepts both null and undefined.

The field accepts both null and undefined for optional values. While this was noted in a previous review, the maintainer appears to have chosen to keep both. This works correctly across the codebase (database layer, mocks, and tests all handle both values).

59-70: LGTM!

The validator correctly allows undefined, null, or an array of strings. The validation logic is sound and aligns with the TypeScript interface definition.

database-test-utils.ts (1)

76-77: LGTM!

The mock implementation correctly accepts and stores the alternatives parameter, maintaining consistency with the updated interface and the real database implementation.

database-interfaces.ts (1)

83-85: LGTM!

The interface extension to support alternatives is clean and well-documented. The optional parameter maintains backward compatibility.

database.ts (1)

140-144: LGTM!

The implementation correctly extends the database layer to persist alternatives alongside the term. The cache invalidation logic appropriately follows the insertion.
determiner.ts (9)
19-21: LGTM!

The URL_REGEX is now defined without the global flag, and a new instance with the 'g' flag is created when needed (line 57). This addresses potential statefulness issues with shared regex instances.

48-67: LGTM!

The range-based exclusion system is well-architected. Breaking down the logic into building, merging, and checking exclusion ranges makes the code modular and testable. The comprehensive test coverage validates this approach.

69-81: LGTM!

The #addMatchedRanges helper method is clean and reusable. The optional overlap checker parameter provides flexibility for different matching strategies.

83-114: Consider improving URL-in-path detection logic.

The current logic tests beforePath substring for URL patterns, but this may not correctly detect if the path is actually part of a URL (as noted in a previous review). The :// delimiter might span the boundary between beforePath and the current match.

However, since you previously indicated this approach is acceptable ("應該是還好"), and comprehensive tests are passing, the current implementation appears to work for the project's use cases.

If you'd like to make the detection more robust, consider checking if there's an overlapping URL match that contains the path range, rather than testing a substring:
// Alternative approach: check for overlapping URL ranges
let isPartOfURL = false;
const urlRegex = new RegExp(this.#URL_REGEX.source, 'g');
let urlMatch: RegExpExecArray | null;
while ((urlMatch = urlRegex.exec(content)) !== null) {
	const urlStart = urlMatch.index;
	const urlEnd = urlStart + urlMatch[0].length;
	if (start < urlEnd && end > urlStart) {
		isPartOfURL = true;
		break;
	}
}
if (isPartOfURL) {
	continue;
}
116-137: LGTM!

The range merging logic is correct and efficient. Sorting ranges by start position and merging overlapping/adjacent ranges in a single pass is the standard algorithm for this operation.

139-171: LGTM!

Excellent optimization! Automatically selecting between linear search (for ≤10 ranges) and binary search (for >10 ranges) provides good performance characteristics for both small and large exclusion lists. The binary search implementation is correct.

173-188: LGTM!

The sanitization logic correctly removes excluded ranges while preserving the remaining content.

190-232: LGTM!

The refactored general rules checking is well-structured. First checking the sanitized content for efficiency, then validating matches in the original content while skipping excluded positions is a sound approach.

Based on learnings, the RegExp construction from rule.wrong is intentional and safe since rules come from controlled MongoDB storage.

234-274: LGTM!

The case rules checking correctly implements the alternatives feature. Line 242 properly handles null/undefined alternatives with the nullish coalescing operator (??), and the validation logic correctly checks if a matched form is in the valid set before flagging it as an error.

Based on learnings, the RegExp construction from lowerTerm is safe as it comes from the controlled CaseRule database.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull Request Overview

This PR adds support for alternative case forms in case rules and refactors the spell-checking logic. The main changes include:

Adding an alternatives field to CaseRule to allow multiple valid case forms (e.g., "JavaScript" can also be "javascript")
Refactoring the Determiner class to use a more modular approach with separate methods for building excluded ranges, checking general rules, and checking case rules
Improving URL detection to use a more comprehensive regex pattern that covers various URL schemes (not just http/https)
Adding comprehensive test coverage for the new alternatives feature and URL/path exclusion logic

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
models.ts	Adds `alternatives` field to `CaseRule` interface and schema with validation
determiner.ts	Refactors spell-checking logic into modular methods and improves URL regex to support multiple schemes
determiner.test.ts	Adds comprehensive test coverage for alternatives feature and URL/path exclusion scenarios
database.ts	Updates `addRule` method to accept `alternatives` parameter
database-test-utils.ts	Updates mock database to support `alternatives` parameter
database-interfaces.ts	Updates interface definition to include `alternatives` parameter
README.md	Updates documentation with correct repository URLs, adds shell syntax highlighting, documents alternatives feature, removes duplicate setup instructions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T15:20:08Z

determiner.ts


+	// https://datatracker.ietf.org/doc/html/rfc1738#section-2.1
+	// https://datatracker.ietf.org/doc/html/rfc1738#section-3.1
+	#URL_REGEX = /[a-zA-Z0-9+.-]*:\/\/[^\s]+/g;


The URL regex is defined as a class property with the g flag, which makes it stateful. When reused multiple times (as in lines 57 and 98), the regex's lastIndex property won't reset between different strings, causing unpredictable matching behavior. This is a common bug with shared regex instances. Either remove the g flag and create new regex instances when needed, or reset lastIndex before each use.

@copilot open a new pull request to apply changes based on this feedback

determiner.ts

models.ts

determiner.ts

Copilot · 2025-11-05T11:40:30Z

@pan93412 I've opened a new pull request, #30, to work on those changes. Once the pull request is ready, I'll request review from you.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Yi-Jyun Pan <me@pan93.com>

Copilot

Pull Request Overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

models.ts

pan93412 added 5 commits October 31, 2025 22:34

feat(determiner): optimize URL and mention logics

f540b45

refactor(determiner): improve readability

4eddf94

feat(models): allow specifying alternatives of case rules

bc6c199

refactor(determiner): don't determine URL without protocols

903a4c3

feat(determiner): implement alternatives of caserules

9ef58af

pan93412 self-assigned this Oct 31, 2025

pan93412 added 2 commits October 31, 2025 22:56

docs: clean up README

24c880a

fix(determiner): sanitize for case rules and correct URL impl

35626d0

pan93412 requested a review from Copilot October 31, 2025 15:17

Copilot AI reviewed Oct 31, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 5, 2025

Fix stateful regex bug in URL pattern matching #30

Merged

Copilot AI and others added 2 commits November 5, 2025 20:02

fix: stateful regex in URL pattern matching

8f60e7c

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Yi-Jyun Pan <me@pan93.com>

refactor(determiner): simplify initialization of validForm

190cb33

pan93412 requested a review from Copilot November 5, 2025 12:13

Copilot AI reviewed Nov 5, 2025

View reviewed changes

models.ts Show resolved Hide resolved

pan93412 marked this pull request as ready for review November 5, 2025 12:15

pan93412 requested a review from Edit-Mr November 5, 2025 12:15

Edit-Mr approved these changes Nov 5, 2025

View reviewed changes

Edit-Mr merged commit 2356530 into main Nov 5, 2025
10 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

URL and case rules fix #29

URL and case rules fix #29

Uh oh!

pan93412 commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

pan93412 Nov 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

URL and case rules fix #29

URL and case rules fix #29

Uh oh!

Conversation

pan93412 commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

pan93412 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pan93412 commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading