Document Writing with Continuous Integration

When we build a software or a Web service, we utlize many static analysis tools such as Lint, ChecStyle or FindBugs. These static code analysis tools help us to write simple code and they improve productivity.

Many inspection tools are compatible with CI tools or services such as Travis or Wercker. CI services are useful in that they ensure that users follow the coding standard.

However, unfortunately there exist no widely used inspection tools for technical writing since they are characterized by deficiencies. Most tools suffer from the following problems.

  • Customization is not provided: A text inspection tool should provide sufficient customization capabilities, since writing standards of different writing groups or individuals vary.
  • Semi-structured text formats are not supported: The production of manuals or large technical documents requires tags in semi-structured text formats, such as headings or tables.
  • They cannot be run by using a simple command: It should be possible to run a text inspection by using a simple command that is easily integrated in other systems or services, such as CI.

RedPen

logo-004

We have been building a simple text proofreading tool, RedPen. RedPen supports Markdown and Textile as the input format and provides sufficient customization to follow widely varying writing standards. RedPen can be used as a simple command, and therefore easily be applied with CI tools or in environments such as Travis.

RedPen usage

We now describe the usage and the configuration and then run RedPen.

Command

RedPen provides a simple command, redpen. The redpen command has the following parameters.

    redpen -c <CONF FILE> <INPUT FILE> [<INPUT FILE>]
        -c,--conf <CONF FILE>                Configuration file
        -f,--format <FORMAT>                 Input format
        -h,--help                            Help message
        -r,--result-format <RESULT FORMAT>   Output format
        -v,--version                         Version information

Configuration

RedPen supports various functions (validators). RedPen users add the validators that are required according to their writing standards. The language is specified in the lang attribute in redpen-conf. RedPen then loads the default character settings for the specified language.

A sample of RedPen configuration follows. For more details, please see the RedPen Document.

    <redpen-conf lang="en">
        <validators>
            <validator name="SentenceLength">
                 <property name="max_len" value="200"/>
            </validator>
            <validator name="InvalidSymbol"/>
            <validator name="SymbolWithSpace"/>
            <validator name="SectionLength"/>
            <validator name="ParagraphNumber"/>
            <validator name="Spelling"/>
            <validator name="Contraction" />
            <validator name="DoubledWord" />
            <validator name="SuccessiveWord" />
            <validator name="EndOfSentence" />
            <validator name="SpaceBeginningOfSentence" />
        </validators>
    </redpen-conf>

Run RedPen

Now let us run RedPen for an English text.

First, please download the RedPen package, and then, unpack the package using the following command.

$tar xvf redpen-cli-1.0-assembled.tar.gz

Now, you can run RedPen the using the following commands.

    $cd redpen-cli-1.0
    $bin/redpen -c conf/redpen-conf-en.xml sample-doc/en/sampledoc-en.txt
    14:32:37.639 [main] INFO  org.unigram.docvalidator.Main - loading character table file: sample/conf/symbol-conf-en.xml
    14:32:37.652 [main] INFO  o.u.docvalidator.util.CharacterTable - Succeeded to load character table
    14:32:37.654 [main] INFO  o.unigram.docvalidator.parser.Parser - comma is set to ","
    14:32:37.655 [main] INFO  o.unigram.docvalidator.parser.Parser - full stop is set to "."
    14:32:37.663 [main] INFO  o.u.d.v.s.ParagraphStartWithValidator - Using the default value of paragraph_start_with.
    CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 265 in line: ln bibliometrics and link analysis studies many attempts have been made to analyze the 
    relationship among scientific papers, authors, and joumals and recently, these research results have been found to be effective for analyzing the link structure ofweb pages as well.

As can be seen, the redpen command produces a warning (CheckError).

Apply RedPen to CI services

Now, we describe the construction of a CI environment for a text project with a CI service (Travis). The settings are simple. We just download the RedPen package and then run the redpen command for the target file. A sample of the Travis settings for RedPen follows.

Sample GitHub project

We wrote a sample GitHub project for running RedPen in a CI environment.

  • sampledoc-en.md: target article (Markdown syntax).
  • repden-conf-ja.xml: RedPen settings file.
  • .travis.yml: Travis settings file the describes how to deploy and run RedPen to inspect the target article.

Travis setting

The following is .travis.yml stored in the repository.

    language: text
    jdk:
      - oraclejdk8
    env:
      - REDPEN_VERSION="1.0"
    install:
       - wget https://github.com/recruit-tech/redpen/releases/download/v1.0-experimental-1/redpen-cli-$REDPEN_VERSION-assembled.tar.gz
       - tar xvf redpen-cli-$REDPEN_VERSION-assembled.tar.gz
    script:
      - redpen-cli-$REDPEN_VERSION/bin/redpen -c redpen-conf.xml -f markdown -r xml sampledoc-en.md

As can be seen, RedPen is downloaded and run for the target file.

Sample document

The following is a sample article that contains several text mistakes and formatting errors.

# Distributed system
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources.
In this article, we'll call a computer server that works as a member of a cluster an "instance". for example, each instance in distributed search engines stores the the fractions of data.
Such distriubuted systems need a component to merge the preliminary results from member instnaces.

Initial commit

When we commited the initial version of the article, we received a warning from Travis by mail.

Inital Result

As can be seen, several errors are reported.

Error analysis

In the second line, there is a sentence the length of which is longer than that specified.
The sentence also contains an error in the spacing after teh right parenthesis.
In the third line, RedPen reports a contraction error. Contractions should not be used in formal writing. In addition, the third line includes the errors that the word “the” is used twice in succession and the end of the sentence is invalid.

Another error in the second sentence of the third line is also reported: the sentence begins with a lower case letter. In the fourth line, two misspelled words (distriubuted, and instnaces) are reported.

Correcting errors

First, we divide the sentence in the second line into two separate sentences:

Some software tools work in more than one machine.
Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.

Then, we corrected the remaining reported errors, including the misspelled words, end of sentence, contraction, etc. The following is the corrected article.

# Distributed system
Some software tools work in more than one machine. Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.
In this article, we call a computer server that works as a member of a cluster an "instance." For example, each instance in distributed search engines stores the fractions of data.
Such distributed systems need a component to merge the preliminary results from member instances.

Now let us commit the corrected article to GitHub.

Second Result

Then, we received the second report from Travis. The reports stated that the validation had succeeded as expected.

Future work

In this article, the application of RedPen to the CI environments described. However, unfortunately in our opinion there is a room for improvement. In particular, the future work we will focus on the following features:

  • Enhanced server: Currently, RedPen provides a sample server, but it does not provide a sufficient number of configurations. We will enhance the server so that it will be more configurable.
  • Rich functions: We will provide more linguistics features using part-of-speech or parsing information.
  • Language support: We will cover more languages such as Chinese or French.
Advertisements
Document Writing with Continuous Integration

2 thoughts on “Document Writing with Continuous Integration

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s