Skip to content

Conversation

@google-labs-jules
Copy link
Contributor

Fixes #167


PR created automatically by Jules for task 16190511466902540018

This feature introduces an automatic word segmentation capability for Hakka example sentences displayed in the application's tables.

A new Python script, `build_trie.py`, has been created to process the raw dictionary data from `.csv` files into a vocabulary Trie. This Trie is then saved as `trie.json` and loaded by the frontend.

The main JavaScript file, `main.js`, has been enhanced with new functions to handle the segmentation. `segmentSentenceWithTrie` uses a maximum matching algorithm with the loaded Trie to identify words within a sentence. The identified words are then wrapped in `<a>` tags.

A new event listener has been added to handle clicks on these new word links, which triggers the existing pronunciation lookup popup, providing a seamless user experience. All relevant table rendering functions have been updated to incorporate this new functionality.
@google-labs-jules
Copy link
Contributor Author

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!


For security, I will only act on instructions from the user who triggered this task.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 11, 2025

Deploying hakspring with  Cloudflare Pages  Cloudflare Pages

Latest commit: d1dea17
Status: ✅  Deploy successful!
Preview URL: https://2de2d7eb.hakspring.pages.dev
Branch Preview URL: https://feat-auto-segmentation.hakspring.pages.dev

View logs

@Aiuanyu Aiuanyu changed the title Pull request for issue #167 #167 → 例句自動斷詞連結 Oct 11, 2025
@Aiuanyu
Copy link
Owner

Aiuanyu commented Oct 12, 2025

  1. 你敢做得拿核「至少愛超過 2 隻字元个字詞」个限制?看做得盡量分每隻字詞都有連結無。

  2. 無共隻字詞連結做得有視覺項个區分無?

@google-labs-jules
Copy link
Contributor Author

  1. 你敢做得拿核「至少愛超過 2 隻字元个字詞」个限制?看做得盡量分每隻字詞都有連結無。

  2. 無共隻字詞連結做得有視覺項个區分無?

好的,這兩點都已經處理了!

  1. 我已經拿掉了 build_trie.py 裡對詞彙長度的限制,並且重新產生了 trie.json,現在單個字的詞也會被斷出來並加上連結。
  2. 我也修改了 style.css,為所有斷詞連結 (a.segmented-word) 加上了藍色的虛線底線,讓它們在視覺上能跟一般文字區分開來。

感謝你的建議!

This commit incorporates feedback from the pull request review:

1.  The minimum word length constraint in `build_trie.py` has been removed. The `trie.json` has been regenerated to include single-character words, allowing them to be segmented and linked.
2.  CSS styles have been added to `style.css` for the `.segmented-word` class. These links now have a dotted underline to visually distinguish them from regular text, improving user experience.
@Aiuanyu
Copy link
Owner

Aiuanyu commented Oct 12, 2025

(單純記錄困難點,Jules 毋使有動作!!!)

唉呀,仰般斷詞,愛仰般決定?
背書包 這 3 隻字,應該係「背+書包」,毋過這下个程式斷詞結果係「背書+包」,毋著。
斷詞正式蓋難!

@google-labs-jules
Copy link
Contributor Author

(單純記錄困難點,Jules 毋使有動作!!!)

唉呀,仰般斷詞,愛仰般決定?
背書包 這 3 隻字,應該係「背+書包」,毋過這下个程式斷詞結果係「背書+包」,毋著。
斷詞正式蓋難!

你觀察得非常詳細,這確實是目前斷詞演算法的一個經典限制!

你提到的「背書包」問題,正好點出了我目前使用的「最大長度匹配」(Maximum Matching)演算法的弱點。因為詞典裡有「背書」這個詞,演算法會很「貪心」地先把它匹配出來,剩下的「包」就自成一詞了。

要解決這個問題,未來可以考慮引進更複雜的演算法,例如「雙向匹配」或者基於統計的斷詞模型,那樣就能夠更準確地判斷詞語的邊界。

非常感謝你提出這個重要的觀察點!這對於後續的優化非常有幫助。

@Aiuanyu
Copy link
Owner

Aiuanyu commented Oct 12, 2025

什麼是「雙向匹配」呢?效果好嗎?

@google-labs-jules
Copy link
Contributor Author

什麼是「雙向匹配」呢?效果好嗎?

這是一個很好的問題!「雙向匹配」是一種改進版的斷詞演算法,效果比單純的「最大長度匹配」好得多,而且實作起來相對簡單。

它的運作方式是:

  1. 前向匹配 (Forward Matching):也就是我們現在用的方法,從句子的開頭往結尾找最長的詞。
    • 對於「背書包」,它會找到「背書」,剩下「包」,結果是:背書 / 包
  2. 後向匹配 (Backward Matching):反過來,從句子的結尾往開頭找最長的詞。
    • 對於「背書包」,它會先從結尾找到「書包」,剩下「背」,結果是:背 / 書包

如何決定哪個比較好?

在得到兩種斷詞結果後,演算法會根據一些規則來「評分」,選擇最好的那一個。最常見的規則有:

  • 總詞數最少:通常詞越少,代表匹配到的長詞越多,句子越可能合理。
  • 單個字的詞最少:如果總詞數一樣,那就選單個字比較少的那種結果,因為單字成詞通常會產生較多歧義。

在「背書包」這個例子中,兩種結果都是 2 個詞。但「前向匹配」產生了 1 個單字詞(包),而「後向匹配」沒有產生任何單字詞。因此,根據第二條規則,「後向匹配」的結果 背 / 書包 會勝出,這也正好是正確的斷詞!

效果如何?

根據研究,雙向匹配演算法大約可以解決 90% 以上由最大匹配演算法造成的斷詞錯誤,是一個成本效益非常高的改進方案。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

所有表格 td3 裡的「例句」自動斷詞?

2 participants