Google Docs BlobTL Raw Cleaner Add-On

Introduction

This is a simple Google Docs add-on that will highlight all lines containing non-English text.

This app works through a whitelist/blacklist strategy. Every character in a line is converted to its UTF-16 code and checked against a whitelist/blacklist. If the character is on the blacklist, the entire line is marked as "Raw Text". If the character is on the whitelist, it is protected from being marked by the blacklist. The whitelist overrides the blacklist.

Currently, the blacklist contains the unicode blocks for hiragana, katakana, hangul syllables, and the unicode block for 20,000 of the most common ideographs used in Chinese and Japanese. The unicode blocks containing East Asian punctuation/symbols is NOT on the blacklist by default.
Currently, the whitelist contains the unicode blocks for ASCII, Latin-1 Supplement, and General Punctuation.
To suggest things to be added to the whitelist/blacklist, you can contact me (@yuzuki on NUF). Alternatively, you can get the source code (which is open-source) yourself and run your own version.

Suggestions and bug reports are always very welcome.


Installation:

Go to BlobTL Raw Cleaner page in Google Docs add-on store (it's free).

Click the install button and follow the prompts.

The first time I installed it, an empty google doc popped up. If the installation worked properly, you should see something like this:



You can close this empty doc.

Usage:

Go to the google document you are working on.

Go to the Add-ons section and select the BlobTL Raw Cleaner.

I recommend choosing Highlight Raw Lines to check what the script will actually delete.








If you are satisfied with the selection, you can delete the selection with the [Backspace] or [Delete] key on your keyboard.

If you accidentally delete things that you did not want to, you can press Ctrl+Z (undo) to undo the option.

Ignoring lines:

Suppose you have a line that you don't want the script to delete. For instance, a paragraph that contains a kaomoji or some other weird character that isn't in the typical western character set.

You can tell the script to ignore a line by putting two hash tags (##) at the beginning of a line.

Then, the script will ignore that line.




Marking lines for the highlighter:

You may occasionally encounter situations where you want to manually mark lines for highlighting. For example, some lines in the raw may only contain punctuation.

You can manually mark a line for deletion by placing two percent signs (%%) at the start of the line.




Changing the settings:

If you would like to change the behavior for the add-on, you can go to: BlobTL Raw Cleaner > Settings.


Here, you can modify the rules that the highlighter uses to highlight lines.

If the highlighter is not working how you want it to, you can go to: BlobTL Raw Cleaner > Highlight Raw Characters to see which characters in the text are triggering the app.

For example, in this instance, the づ character in the kaomoji is causing the entire line to be blacklisted. To prevent this line from being blacklisted, there are four possible strategies:
  1. Add a "##" at the beginning of the line so the entire line is ignored.
  2. Add the "づ" character to the custom whitelist.
  3. Lower the blacklist sensitivity from 100%.
  4. Uncheck Hiragana/Katakana from the blacklist settings (if you're not a Japanese translator)

Option (1) was previously discussed.

Option (2) involves opening the Settings menu and adding the offending character to the Whitelisted Characters text box. Each character that you put into the text box will be added to the whitelist. Any extra spaces and commas are ignored.

Option (3) involves opening the Settings menu and changing the slider on the Blacklist Sensitivity. 100% sensitivity is the strictest option (default). This means that if there is a single character in a line that is blacklisted, the entire line will be marked for highlighting.

Sometimes, this may be too strict for some people's purposes. Perhaps you would only like the line to be highlighted if 50% of the characters are Chinese/Japanese/Korean. If this is the case, you should change the slider to 50% sensitivity.

Please remember to press the save button after modifying any settings. Keep in mind that your User Settings are shared across all of your google documents. Saving the settings in one document will carry over to all the other documents you use.

Uninstalling:

Go to Add-ons > Manage add-ons...

Then click Manage > Remove





Troubleshooting and bugs:

If you encounter any difficulties, you can report it in any of the following places and provide the text that you tried. Just ask for @yuzuki (or #matcha) and someone will probably help you find me...
  • Blob Translations - Contact Us Page (https://www.blobtranslations.com/contact-us/)
  • Blob Translations - Discord Group (https://discord.gg/mM6KFz9)
  • Novel Updates Forum - @yuzuki (https://forum.novelupdates.com/members/yuzuki.4673/)
One issue that I anticipate some people might experience (although I have never actually tested this), is if your google doc contains characters that aren't in UTF-16 (and are only encoded in UTF-32). The reason why this is a problem is that Javascript only encodes/decodes in UTF-16 by default, and it's more hassle than it's worth to handle these UTF-32 characters.

This issue would probably only affect you if you use "super rare" characters in your google doc. Examples of "super rare" characters include: 
  • Fancy emojis like: 😌  😥  🏡
  • Characters from a rare and ancient languages that aren't used anymore.
    • This includes ancient Chinese scripts (e.g. CJK Unified Ideograph Extensions B-F)
    • Examples: 𠀧, 𠥹, 𠥤
  • Anything that looks like a box (which means your browser can't even decode it): 𠀀
The easy way to determine if a character cannot be decoded in UTF-16 is to look up its UTF-32 code, and then check its hex value (e.g. ☯ = U+262F). If the hex value is greater than 4 digits, this character does not exist in UTF-16 and exclusively exists in UTF-32. 

Change log:

Version
Description
Saved by
Date
6
Added sidebar menu
matcha.anko@gmail.com
2018-01-18 15:17
5
Fixed bug with shift+enter
matcha.anko@gmail.com
2018-01-14 14:49
4
Added a line repairing utility
matcha.anko@gmail.com
2018-01-14 01:21
3
Fixed a bug in the selector
matcha.anko@gmail.com
2018-01-13 23:32
2
Added a raw selector
matcha.anko@gmail.com
2018-01-13 22:50
1
Initial functional version
matcha.anko@gmail.com
2018-01-10 13:54

Credits, Source Code, and License:

The source code for this add-on is freely available here.

MIT License. Please credit Blob Translations if you modify or redistribute it.

Special thanks to @Tony for reviewing the add-on and giving comments, and @BlancFrost for testing. Also, this wouldn't exist if @Action didn't request the feature in the first place.

Finally, you can find similar tools at these following places:


Comments