When we build a software or a Web service, we utlize many static analysis tools such as Lint, ChecStyle or FindBugs. These static code analysis tools help us to write simple code and they improve productivity.
However, unfortunately there exist no widely used inspection tools for technical writing since they are characterized by deficiencies. Most tools suffer from the following problems.
- Customization is not provided: A text inspection tool should provide sufficient customization capabilities, since writing standards of different writing groups or individuals vary.
- Semi-structured text formats are not supported: The production of manuals or large technical documents requires tags in semi-structured text formats, such as headings or tables.
- They cannot be run by using a simple command: It should be possible to run a text inspection by using a simple command that is easily integrated in other systems or services, such as CI.
We have been building a simple text proofreading tool, RedPen. RedPen supports Markdown and Textile as the input format and provides sufficient customization to follow widely varying writing standards. RedPen can be used as a simple command, and therefore easily be applied with CI tools or in environments such as Travis.
We now describe the usage and the configuration and then run RedPen.
RedPen provides a simple command, redpen. The redpen command has the following parameters.
redpen -c <CONF FILE> <INPUT FILE> [<INPUT FILE>] -c,--conf <CONF FILE> Configuration file -f,--format <FORMAT> Input format -h,--help Help message -r,--result-format <RESULT FORMAT> Output format -v,--version Version information
RedPen supports various functions (validators). RedPen users add the validators that are required according to their writing standards. The language is specified in the lang attribute in redpen-conf. RedPen then loads the default character settings for the specified language.
A sample of RedPen configuration follows. For more details, please see the RedPen Document.
<redpen-conf lang="en"> <validators> <validator name="SentenceLength"> <property name="max_len" value="200"/> </validator> <validator name="InvalidSymbol"/> <validator name="SymbolWithSpace"/> <validator name="SectionLength"/> <validator name="ParagraphNumber"/> <validator name="Spelling"/> <validator name="Contraction" /> <validator name="DoubledWord" /> <validator name="SuccessiveWord" /> <validator name="EndOfSentence" /> <validator name="SpaceBeginningOfSentence" /> </validators> </redpen-conf>
Now let us run RedPen for an English text.
First, please download the RedPen package, and then, unpack the package using the following command.
$tar xvf redpen-cli-1.0-assembled.tar.gz
Now, you can run RedPen the using the following commands.
$cd redpen-cli-1.0 $bin/redpen -c conf/redpen-conf-en.xml sample-doc/en/sampledoc-en.txt 14:32:37.639 [main] INFO org.unigram.docvalidator.Main - loading character table file: sample/conf/symbol-conf-en.xml 14:32:37.652 [main] INFO o.u.docvalidator.util.CharacterTable - Succeeded to load character table 14:32:37.654 [main] INFO o.unigram.docvalidator.parser.Parser - comma is set to "," 14:32:37.655 [main] INFO o.unigram.docvalidator.parser.Parser - full stop is set to "." 14:32:37.663 [main] INFO o.u.d.v.s.ParagraphStartWithValidator - Using the default value of paragraph_start_with. CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 265 in line: ln bibliometrics and link analysis studies many attempts have been made to analyze the relationship among scientific papers, authors, and joumals and recently, these research results have been found to be effective for analyzing the link structure ofweb pages as well.
As can be seen, the redpen command produces a warning (CheckError).
Apply RedPen to CI services
Now, we describe the construction of a CI environment for a text project with a CI service (Travis). The settings are simple. We just download the RedPen package and then run the redpen command for the target file. A sample of the Travis settings for RedPen follows.
Sample GitHub project
We wrote a sample GitHub project for running RedPen in a CI environment.
- sampledoc-en.md: target article (Markdown syntax).
- repden-conf-ja.xml: RedPen settings file.
- .travis.yml: Travis settings file the describes how to deploy and run RedPen to inspect the target article.
The following is .travis.yml stored in the repository.
language: text jdk: - oraclejdk8 env: - REDPEN_VERSION="1.0" install: - wget https://github.com/recruit-tech/redpen/releases/download/v1.0-experimental-1/redpen-cli-$REDPEN_VERSION-assembled.tar.gz - tar xvf redpen-cli-$REDPEN_VERSION-assembled.tar.gz script: - redpen-cli-$REDPEN_VERSION/bin/redpen -c redpen-conf.xml -f markdown -r xml sampledoc-en.md
As can be seen, RedPen is downloaded and run for the target file.
The following is a sample article that contains several text mistakes and formatting errors.
# Distributed system Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources. In this article, we'll call a computer server that works as a member of a cluster an "instance". for example, each instance in distributed search engines stores the the fractions of data. Such distriubuted systems need a component to merge the preliminary results from member instnaces.
When we commited the initial version of the article, we received a warning from Travis by mail.
As can be seen, several errors are reported.
In the second line, there is a sentence the length of which is longer than that specified.
The sentence also contains an error in the spacing after teh right parenthesis.
In the third line, RedPen reports a contraction error. Contractions should not be used in formal writing. In addition, the third line includes the errors that the word “the” is used twice in succession and the end of the sentence is invalid.
Another error in the second sentence of the third line is also reported: the sentence begins with a lower case letter. In the fourth line, two misspelled words (distriubuted, and instnaces) are reported.
First, we divide the sentence in the second line into two separate sentences:
Some software tools work in more than one machine. Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.
Then, we corrected the remaining reported errors, including the misspelled words, end of sentence, contraction, etc. The following is the corrected article.
# Distributed system Some software tools work in more than one machine. Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources. In this article, we call a computer server that works as a member of a cluster an "instance." For example, each instance in distributed search engines stores the fractions of data. Such distributed systems need a component to merge the preliminary results from member instances.
Now let us commit the corrected article to GitHub.
Then, we received the second report from Travis. The reports stated that the validation had succeeded as expected.
In this article, the application of RedPen to the CI environments described. However, unfortunately in our opinion there is a room for improvement. In particular, the future work we will focus on the following features:
- Enhanced server: Currently, RedPen provides a sample server, but it does not provide a sufficient number of configurations. We will enhance the server so that it will be more configurable.
- Rich functions: We will provide more linguistics features using part-of-speech or parsing information.
- Language support: We will cover more languages such as Chinese or French.