Released RedPen v1.8

We released RedPen v1.8 this week. This article introduces the new features supported in this version. We can download the latest version from the release page or Homebrew.

Specify the input sentence

The redpen command support –sentence option. With this option, users can specify the input sentences without creating the input file. The following is a sample to specify a input sentence.

 ➜ redpen git:(set-locale-with-option) redpen -s "This is is a pen"
[2017-03-20 22:59:02.487][INFO ] cc.redpen.Main - Configuration file: /usr/local/Cellar/redpen/1.8.0/libexec/conf/redpen-conf-en.xml
[2017-03-20 22:59:02.495][INFO ] cc.redpen.config.ConfigurationLoader - Loading config from specified config file: "/usr/local/Cellar/redpen/1.8.0/libexec/conf/redpen-conf-en.xml"
[2017-03-20 22:59:02.505][INFO ] cc.redpen.config.ConfigurationLoader - Succeeded to load configuration file
[2017-03-20 22:59:02.506][INFO ] cc.redpen.config.ConfigurationLoader - Language is set to "en"
[2017-03-20 22:59:02.506][WARN ] cc.redpen.config.ConfigurationLoader - No variant configuration...
[2017-03-20 22:59:02.507][INFO ] cc.redpen.config.ConfigurationLoader - No "symbols" block found in the configuration
[2017-03-20 22:59:02.511][INFO ] cc.redpen.config.SymbolTable - Default symbol settings are loaded
[2017-03-20 22:59:02.570][INFO ] cc.redpen.parser.SentenceExtractor - "[., ?, !]" are added as a end of sentence characters
[2017-03-20 22:59:02.570][INFO ] cc.redpen.parser.SentenceExtractor - "[', "]" are added as a right quotation characters
[2017-03-20 22:59:02.914][INFO ] org.reflections.Reflections - Reflections took 58 ms to scan 1 urls, producing 5 keys and 53 values
[2017-03-20 22:59:02.969][WARN ] cc.redpen.validator.ValidatorFactory - cc.redpen.validator.section.VoidSectionValidator is deprecated
[2017-03-20 22:59:02.976][WARN ] cc.redpen.validator.ValidatorFactory - cc.redpen.validator.sentence.SpaceBeginningOfSentenceValidator is deprecated
[2017-03-20 22:59:02.987][INFO ] org.reflections.Reflections - Reflections took 3 ms to scan 1 urls, producing 160 keys and 163 values
[2017-03-20 22:59:03.086][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load UnexpandedAcronymValidator default dictionary.
[2017-03-20 22:59:03.092][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load weak expressions.
[2017-03-20 22:59:03.099][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load word frequencies.
[2017-03-20 22:59:03.101][INFO ] cc.redpen.validator.JavaScriptValidator - JavaScript validators directory: js
1: ValidationError[SuccessiveWord], Found word "is" repeated twice in succession. at line: This is is a pen

Introduce function deprecation

Users have reported that names of several functions is misleading. For example, VoidSection  does not represents the function. In this version, we deprecates VoidSection  and instead made EmptySection with the same behavior.

Fix bugs

Plan for next version

We will support restrucuredText in the next release and make the language of error message selectable with a simple command line option.

Released RedPen v1.8

Continuous Checking Markup Documents with Travis, Asciidoctor, IntelliJ IDEA and RedPen

I introduced an idea of CI integration for document writing in a previous post. This spring, we finally built a continuous checking environment with Travis for RedPen user’s manual. The following is an image of the RedPen manual.

Screen Shot 2016-07-07 at 16.50.27

The manual is written in AsciiDoc and the source file is maintained in GitHub.

What do we check?

In the CI system, we checked two aspects of the document.

  1. Document build
    RedPen user’s manual is written in AsciiDoc. The AsciiDoc files are converted to HTML with Asciidoctor. In the CI system, we check if the conversion is successful.

  2. Document quality
    The quality of the document is checked with RedPen, a linting tool for markup text. We checked the document with the following settings. For the details of the RedPen configuration, please refer to the RedPen user’s manual.

<redpen-conf lang="en">
    <validators>
        <validator name="ParagraphNumber">
            <property name="max_num" value="5"/>
        </validator>
        <validator name="ParagraphStartWith">
            <property name="start_from" value=""/>
        </validator>
        <validator name="SectionLength">
            <property name="max_num" value="1000"/>
        </validator>
        <validator name="WordFrequency">
            <property name="deviation_factor" value="5.0"/>
            <property name="min_word_count" value="2000"/>
        </validator>
        <validator name="CommaNumber">
            <property name="max_num" value="3"/>
        </validator>
        <validator name="DoubleNegative"/>
        <validator name="EndOfSentence"/>
        <validator name="Hyphenation">
            <property name="list" value=""/>
            <property name="dict" value=""/>
        </validator>
        <validator name="InvalidExpression">
            <property name="list" value=""/>
            <property name="dict" value=""/>
        </validator>
        <validator name="InvalidSymbol"/>
        <validator name="InvalidWord">
            <property name="list" value=""/>
            <property name="dict" value=""/>
        </validator>
        <validator name="NumberFormat">
            <property name="decimal_delimiter_is_comma" value="false"/>
            <property name="ignore_years" value="true"/>
        </validator>
        <validator name="Quotation">
            <property name="use_ascii" value="false"/>
        </validator>
        <validator name="SentenceLength">
            <property name="max_len" value="150"/>
        </validator>
        <validator name="SuccessiveWord"/>
        <validator name="SuggestExpression">
            <property name="dict" value=""/>
        </validator>
        <validator name="SymbolWithSpace"/>
        <validator name="WeakExpression"/>
        <validator name="WordNumber">
            <property name="max_num" value="30"/>
        </validator>
        <validator name="JavaScript">
            <property name="script-path" value="js"/>
        </validator>
    </validators>
</redpen-conf>

As we see, the the checks such as length of the sentences are primitive, but points are important for the readability. I will enhance the tests in the near future.

Setting CI

For CI testing of RedPen manual, we use TravisCI, a popular continuous integration service. The following is the TravisCI setting for document checking.

language: ruby

rvm:
- 2.0.0-p598

jdk:
- oraclejdk8

env:
- URL=https://github.com/redpen-cc/redpen/releases/download/redpen-1.6.1

install:
- wget $URL/redpen-1.6.1.tar.gz
- tar xvf redpen-1.6.1.tar.gz
- export PATH=$PATH:$PWD/redpen-distribution-1.6.1/bin
- gem install asciidoctor
- gem install coderay
- gem install --pre asciidoctor-pdf
- sudo apt-get update && sudo apt-get install oracle-java8-installer

script:
- make check # Apply RedPen
- make html # Generate HTML document

The install block installs the tools for building and checking the document are installed. The script block checks the quality of RedPen document and tries to build the HTML files with Asciidoctor.

BUILDDIR = build
ASCIIDOCTOR = asciidoctor
.PHONY: help clean check html

check:
redpen -f asciidoc source/*.adoc

html:
mkdir -p $(BUILDDIR)/html
cp source/*.jpg source/*.png $(BUILDDIR)/html/
cp -r source/styles/redpen $(BUILDDIR)/html/
$(ASCIIDOCTOR) -a source-highlighter=coderay -a stylesdir=styles -a target-version=1.6 -d book -b html5 source/index.adoc -D$(BUILDDIR)/html
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html"

As we see the checks are run with make commands. The following is the Makefile.

# Makefile for RedPen documentation
#

# You can set these variables from the command line.
BUILDDIR = build

ASCIIDOCTOR = asciidoctor
.PHONY: help clean check html

help:
@echo "Please use \`make ' where is one of"
@echo " html to make standalone HTML files"

clean:
-rm -rf $(BUILDDIR)/*

check:
redpen -f asciidoc source/*.adoc

html:
mkdir -p $(BUILDDIR)/html
cp source/*.jpg source/*.png $(BUILDDIR)/html/
cp -r source/styles/redpen $(BUILDDIR)/html/
$(ASCIIDOCTOR) -a source-highlighter=coderay -a stylesdir=styles -a target-version=1.6 -d book -b html5 source/index.adoc -D$(BUILDDIR)/html
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html"

Check documents in an editor

I have been writing the document with IntelliJ IDEA, with plugins for document writing. IntelliJ IDEA is a popular Java IDE but also useful for document writing. In addition, to check documents we can use the IntelliJ IDEA plugin for RedPen reusing the configuration file for the CI. For details of the IntelliJ IDEA plugin for RedPen, please see this blog post.

The below is the image.

Screen Shot 2016-07-07 at 15.51.11

Results

We got the green badge from Travis as the following image.

travis-result
Travis Badge

Summary and Future Work

This article shows how to implement the continuous checking of markup documents with  CIs and editors. The system checks the build and quality  with RedPen and Asciidoctor. We will enhance the checks adding more and more validators.

Continuous Checking Markup Documents with Travis, Asciidoctor, IntelliJ IDEA and RedPen

Annotation for Suppressing Errors Reported from RedPen, a Linting tool for Markup text

RedPen version 1.6 supports error suppression with text annotations. This feature gives us a good balance between the quality and the productivity for document writing.

Sometimes we do not want to fix errors from RedPen, a text linting tool. Most of reasons are that the cost to remove the error is high. Or the writer breaks the writing standard at particular points on purpose. For such cases, the error suppression by annotation is useful. The annotations are added just before the sections containing the errors.

Currently error suppression is supported for four types of formats (AsciiDoc, Markdown, Re:VIEW, LaTeX). In the following section, I will show a sample text containing the error suppression annotation for error suppression.

As the sample of the annotation for error suppression, an AsciiDoc text is applied. AsciiDoc is a popular format, which is adopted by GitBook.

Sample: error suppression in AsciiDoc text

For AsciiDoc text, writers add the suppress annotation in attribute block. The annotation is [suppress]. For example, the following AsciiDoc text suppresses the all the errors in the section.

[suppress]
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.

When we apply RedPen to the AsciiDoc file, we got the following messages.

$ redpen sample.asciidoc
redpen redpen-suppress.asciidoc
[2016-06-14 16:10:43.850][INFO ] cc.redpen.Main - Configuration file: /usr/local/Cellar/redpen/1.6.1/libexec/conf/redpen-conf-en.xml
[2016-06-14 16:10:43.856][INFO ] cc.redpen.config.ConfigurationLoader - Loading config from specified config file: &amp;amp;amp;amp;amp;amp;amp;quot;/usr/local/Cellar/redpen/1.6.1/libexec/conf/redpen-conf-en.xml&amp;amp;amp;amp;amp;amp;amp;quot;
[2016-06-14 16:10:43.867][INFO ] cc.redpen.config.ConfigurationLoader - Succeeded to load configuration file
[2016-06-14 16:10:43.867][INFO ] cc.redpen.config.ConfigurationLoader - Language is set to &amp;amp;amp;amp;amp;amp;amp;quot;en&amp;amp;amp;amp;amp;amp;amp;quot;
[2016-06-14 16:10:43.867][WARN ] cc.redpen.config.ConfigurationLoader - No variant configuration...
[2016-06-14 16:10:43.868][INFO ] cc.redpen.config.ConfigurationLoader - No &amp;amp;amp;amp;amp;amp;amp;quot;symbols&amp;amp;amp;amp;amp;amp;amp;quot; block found in the configuration
[2016-06-14 16:10:43.872][INFO ] cc.redpen.config.SymbolTable - Default symbol settings are loaded
[2016-06-14 16:10:43.923][INFO ] cc.redpen.parser.SentenceExtractor - &amp;amp;amp;amp;amp;amp;amp;quot;[., ?, !]&amp;amp;amp;amp;amp;amp;amp;quot; are added as a end of sentence characters
[2016-06-14 16:10:43.924][INFO ] cc.redpen.parser.SentenceExtractor - &amp;amp;amp;amp;amp;amp;amp;quot;[', &amp;amp;amp;amp;amp;amp;amp;quot;]&amp;amp;amp;amp;amp;amp;amp;quot; are added as a right quotation characters
[2016-06-14 16:10:44.064][INFO ] org.reflections.Reflections - Reflections took 71 ms to scan 1 urls, producing 4 keys and 46 values
[2016-06-14 16:10:44.231][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load UnexpandedAcronymValidator default dictionary.
[2016-06-14 16:10:44.237][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load weak expressions.
[2016-06-14 16:10:44.243][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load word frequencies.
[2016-06-14 16:10:44.245][INFO ] cc.redpen.validator.JavaScriptValidator - JavaScript validators directory: js

We can see that there is no errors in the output. 

When the we want to suppress only the specified errors, add Validator names after suppress. The following example suppresses only two types of errors (Contraction WeakExpression) in the section.

[suppress='Contraction WeakExpression']
= Instances
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.

When we apply RedPen to the AsciiDoc file, we got the following messages.

redpen sample2.asciidoc
[2016-06-14 16:13:38.005][INFO ] cc.redpen.Main - Configuration file: /usr/local/Cellar/redpen/1.6.1/libexec/conf/redpen-conf-en.xml
[2016-06-14 16:13:38.010][INFO ] cc.redpen.config.ConfigurationLoader - Loading config from specified config file: &amp;amp;amp;amp;amp;amp;amp;quot;/usr/local/Cellar/redpen/1.6.1/libexec/conf/redpen-conf-en.xml&amp;amp;amp;amp;amp;amp;amp;quot;
[2016-06-14 16:13:38.019][INFO ] cc.redpen.config.ConfigurationLoader - Succeeded to load configuration file
[2016-06-14 16:13:38.019][INFO ] cc.redpen.config.ConfigurationLoader - Language is set to &amp;amp;amp;amp;amp;amp;amp;quot;en&amp;amp;amp;amp;amp;amp;amp;quot;
[2016-06-14 16:13:38.019][WARN ] cc.redpen.config.ConfigurationLoader - No variant configuration...
[2016-06-14 16:13:38.020][INFO ] cc.redpen.config.ConfigurationLoader - No &amp;amp;amp;amp;amp;amp;amp;quot;symbols&amp;amp;amp;amp;amp;amp;amp;quot; block found in the configuration
[2016-06-14 16:13:38.023][INFO ] cc.redpen.config.SymbolTable - Default symbol settings are loaded
[2016-06-14 16:13:38.082][INFO ] cc.redpen.parser.SentenceExtractor - &amp;amp;amp;amp;amp;amp;amp;quot;[., ?, !]&amp;amp;amp;amp;amp;amp;amp;quot; are added as a end of sentence characters
[2016-06-14 16:13:38.083][INFO ] cc.redpen.parser.SentenceExtractor - &amp;amp;amp;amp;amp;amp;amp;quot;[', &amp;amp;amp;amp;amp;amp;amp;quot;]&amp;amp;amp;amp;amp;amp;amp;quot; are added as a right quotation characters
[2016-06-14 16:13:38.200][INFO ] org.reflections.Reflections - Reflections took 63 ms to scan 1 urls, producing 4 keys and 46 values
[2016-06-14 16:13:38.349][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load UnexpandedAcronymValidator default dictionary.
[2016-06-14 16:13:38.353][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load weak expressions.
[2016-06-14 16:13:38.361][INFO ] cc.redpen.util.DictionaryLoader - Succeeded to load word frequencies.
[2016-06-14 16:13:38.363][INFO ] cc.redpen.validator.JavaScriptValidator - JavaScript validators directory: js
redpen-suppress-2.asciidoc:3: ValidationError[SentenceLength], The length of the sentence (226) exceeds the maximum of 120. at line: Some software tools work in more than one machi\
ne, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.
redpen-suppress-2.asciidoc:3: ValidationError[CommaNumber], The number of commas (6) exceeds the maximum of 3. at line: Some software tools work in more than one machine, and such \
distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.
redpen-suppress-2.asciidoc:3: ValidationError[SymbolWithSpace], Need whitespace after symbol &amp;amp;amp;amp;amp;amp;amp;quot;)&amp;amp;amp;amp;amp;amp;amp;quot;. at line: Some software tools work in more than one machine, and such distributed (\
cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources, such as CPU, Disk, and Memory.
[2016-06-14 16:13:38.411][ERROR] cc.redpen.Main - The number of errors &amp;amp;amp;amp;amp;amp;amp;quot;3&amp;amp;amp;amp;amp;amp;amp;quot; is larger than specified (limit is &amp;amp;amp;amp;amp;amp;amp;quot;1&amp;amp;amp;amp;amp;amp;amp;quot;).

We can see that we got the only the errors not specified in the annotation block are flush.

Summary and Future work

This article demonstrates the error suppression by text annotation. The next release of RedPen IntelliJ plugin is going to support the quick fix of errors by inserting the suppress annotation.

Annotation for Suppressing Errors Reported from RedPen, a Linting tool for Markup text

Released RedPen v1.4 (LaTeX support)

We released RedPen v1.4. We hope that you will download it from the following URL and try using it.

https://github.com/redpen-cc/redpen/releases/tag/v1.4.0

The centerpiece of release v1.4 is support for the LaTeX format.

LaTeX support

In this release, we provided experimental support for LaTeX as an input format. Many people have requested LaTeX support, starting from the initial development of RedPen. Since the v0.6 release, we took one year to support LaTeX, and we finally succeeded.

Unfortunately, the LaTeX support is limited in the following ways:

  1. RedPen LaTeX parser does not work well when macros are used to add your own tags
  2. It does not support a complete check of sentence in lists and tables

Although these are big constraints, we believe that it would be used to inspect papers and documents.

Enhancement of functions (Validators)

In v1.4, we also concentrated on enhancements the (Validator) functions. To the added functions, three types of language support were added: support in both Japanese and English, support in English only, and support in Japanese only.

Functions supported in both Japanese and English

  • DoubleNegative In both Japanese and English, double negative statements are difficult to understand. If a double negative is present in the text, an error is output.

Functions supported in English only

  • FrequentSentenceStart When writing a document in English, many sentences can start with We. Because even when there is no problem with the content, the appearance is bad, and therefore it is good to swiftly replace them. Consider the following example.


We propose a novel method. We demonstrate the effectiveness of the method.

We in the above example has been used twice in a row. Without changing the meaning, we will edit the sentences to prevent continuous use of the same subject.


We propose a novel method. The effectiveness of the method is demonstrated in the experiments.

  • UnexpandedAcronym This function checks documents for the presence of acronyms and also for the original words that they represent.
  • WordFrequency If the word frequency within the document differs from the usual, an error is output.
  • Hyphenation If hyphen usage is not correct, an error is output.
  • NumberFormat If number formats differ from correct usage in English, an error is output.
  • ParenthesizedSentence This function inspects for usage of parentheses. If there are nested parentheses or more parentheses than specified, an error is output.
  • WeakExpression If the text has an ambiguous English expression, an error is output. For example, words such as €œcompletely€ and €œhuge€ should be replaced with more accurate representations.

Functions supported in Japanese only

  • Okurigana If Japanese okurigana word endings are used incorrectly, an error is output.
  • DoubledJoshi if a particle is used more than once in a sentence, it might be difficult to read.

Prospect of version 1.5

We will continue development of RedPen v1.5 and more. The fact is that we have not yet set the priorities, but for v1.5, a mechanism that can easily test functions written in JavaScript would be included.

Released RedPen v1.4 (LaTeX support)

RedPen v.1.3 Released

We released RedPen version 1.3 a few days ago. RedPen 1.3 is available from the releases page. If you are using a Mac, you can install the latest version via Homebrew ($brew install redpen).

This article introduces the three features added to RedPen version 1.3 (AsciiDoc support, server commands, and Javascript-based extensions).

Support for AsciiDoc format

Ever since RedPen was released, I have received requests asking for AsciiDoc support.
AsciiDoc is a markup language just like Markdown, LaTeX, and others. I was satisfied with Markdown at the time of RedPen’s initial release, so I did not fully understand why there was so much demand for a markup language with similar syntax to Markdown. Below, I will touch on the benefits of AsciiDoc.

Are there shortcomings in Markdown?

Recently, I worked on a job where I was writing a long and formal document. At first, I tried to write the document in Markdown, but it just did not go as I had intended. It was then that I started pulling my hair out over the fact that the Markdown format’s expressive capacity was lacking. For example, links to clauses and segmenting of source files are necessary when writing documents with a certain degree of content, yet such features are not (formally) supported in Markdown.

The benefits of AsciiDoc

While AsciiDoc has a light syntax similar to Markdown, it is rich in features. I feel like it has all the necessary features in place to write formal and content-rich documents. Using a tool called AsciiDoctor also makes it easy to convert documents to PDF and change document styles to influence the presentation of the document. For those of you who need to write formal documents, it might be worth your while to consider using AsciiDoc once.

Server Inclusion

The RedPen server was introduced from version 1.0, and since has been extended from time to time. In version 1.2, the REST API was strengthened, providing functionality closer to that of the command line.

The advantage of using RedPen via the REST API is the high-speed response. After executing a redpen command, it takes time for the server to boot up. On the other hand, constantly running servers can return the results of an inspection immediately.

I received requests asking to be able to shutdown the server from the command line along with the feature improvements. To date, another file called redpen.war was required to be downloaded, but in version 1.3, all of the server files (redpen.war) are included in the package (redpen-1.3.tar.gz)

I also provided a simple redpen-server command in RedPen version 1.3. To start the server, just run the following command:

$redpen-server start

Similarly, shutting down the server can be done simply as follows:

$redpen-server start

Unfortunately, however, the command is not supported on Windows. I plan to add Windows support on the next version upgrade.

Extensions up to 1.2

One of RedPen’s distinguishing features is its extensibility. RedPen does all of the cumbersome processing automatically, such as extracting sentences and clauses from markup languages. As such, when a user wants to extend features on their own, all they have to do is write the processing to be done on a sentence. Plug-in features are also supported, so users do not have to build RedPen in its entirety.

However, what I discovered when developing was that, unfortunately, the current-state of plug-ins leaves something lacking. What I mean is that, to create a Java-based plug-in, only people with a certain degree of familiarity with Java would be capable of creating such a plug-in.
It also needs to be compiled and it cannot be easily tried…

Support for Javascript extensions.

Therefore, Javascript-based extensions are supported starting from version 1.3. With extensions based in Javascript, users have no need to compile. Just put the Javascript file for the extension(s) in $REDPEN_HOME/js. I plan to introduce more detailed instructions separately.

RedPen v.1.3 Released

REST API for Document Validation

RedPen is an open source command line tool for proofreading documents, but RedPen also provides a server. This article introduces the functions of the RedPen server.

The RedPen server provides not only a Web UI but also a REST API, which enables users to check their documents without installing RedPen on their computers.

One of the features of the RedPen server API is its configurability. Users can validate their documents according to their configuration settings. In addition, the server can be deployed with few clicks in Heroku.

RedPen Server Web UI

A RedPen server is available at the following URL:

http://redpen.herokuapp.com/

The below image shows the current RedPen Web UI page.

redpen-ui

When a user visits the RedPen server at the URL above, the top left box is automatically preloaded with sample text that contains many mistakes. RedPen shows the errors in the top left box as red bars. The bottom left window provides detailed error information.

When users paste their documents in the box, any validation errors are displayed in the left bottom box.

We can configure the settings in the right box.
Specifically we can configure validation items and character (symbol) settings. For detailed configuration information, please see the RedPen configuration document page.

RedPen Server API

The RedPen server provides a REST API that enables users to apply RedPen validation without installing RedPen.

Currently the RedPen server API provides three types of validation.

  • /rest/config/redpens

This function returns validation errors using preconfigured redpens.

  • /document/validate

This function validates a document with the user’s configuration and then returns the errors.

  • /document/validate/json

This function is similar to the /document/validate function, but the configuration is written in JSON format.

Sample: REST API (/doument/validate)

The /document/validate function has several parameters.

  • document contains the text of the document RedPen should validate
  • documentParser specifies the input document format. Valid options are PLAIN,
    MARKDOWN, and WIKI.
  • lang specifies the language used to tokenize the document. Currently, values of ja (Japanese) and en (English/Whitespace) are supported.
  • format determines the format for the results. This can be either json (the default), json2, plain, plain2 or xml.
  • config contains the contents of a RedPen XML configuration file.

Now let us try the REST function with both configuration and text. The following is a sample RedPen configuration file (redpen-conf-en.xml) bundled with the RedPen package.

<redpen-conf lang="en">
  <validators>
    <validator name="SentenceLength">
      <property name="max_len" value="100"/>
    </validator>
    <validator name="InvalidSymbol"/>
    <validator name="SymbolWithSpace"/>
    <validator name="SectionLength">
      <property name="max_char_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber"/>
    <validator name="Spelling"/>
    <validator name="Contraction" />
    <validator name="DoubledWord" />
    <validator name="SuccessiveWord" />
    <validator name="EndOfSentence" />
    <validator name="SpaceBeginningOfSentence" />
  </validators>
</redpen-conf>

Next we validate a short input sentence with the RedPen server. The following command sends the document and configuration.

curl --data document="Twas brillig and the slithy toves did gyre and gimble in the wabe"
  --data lang=en --data format=PLAIN2 \
  --data config="`cat redpen-conf-en.xml`" \
  redpen.herokuapp.com/rest/document/validate/
  Line: 1, Offset: 0
    Sentence: Twas brillig and the slithy toves did gyre and gimble in the wabe
      Spelling: Found possibly misspelled word "brillig".
      Spelling: Found possibly misspelled word "slithy".
      Spelling: Found possibly misspelled word "toves".
      Spelling: Found possibly misspelled word "gyre".
      Spelling: Found possibly misspelled word "gimble".
      "and".querying the input

As we see, the RedPen server returns several errors in the input sentence.

Deploying your RedPen Server with the Heroku Button

In the previous sample, I was using a server already deployed in Heroku, but this server is not powerful enough if many users send their validation requests.

If you need short response time, of course you can deploy the RedPen server in your own environment, but this could be tiresome.

For users that would like to deploy their own server easily, RedPen provides a Heroku Button. Users can deploy the RedPen server with just a few clicks.

The Heroku Button can be located in the README of the RedPen source.

heroku-button

When we click the button, then the following page is shown.

heroku-deploy

When the user clicks the Deploy for free button, the RedPen server is deployed in a few minutes.

REST API for Document Validation

Document Writing with Continuous Integration

When we build a software or a Web service, we utlize many static analysis tools such as Lint, ChecStyle or FindBugs. These static code analysis tools help us to write simple code and they improve productivity.

Many inspection tools are compatible with CI tools or services such as Travis or Wercker. CI services are useful in that they ensure that users follow the coding standard.

However, unfortunately there exist no widely used inspection tools for technical writing since they are characterized by deficiencies. Most tools suffer from the following problems.

  • Customization is not provided: A text inspection tool should provide sufficient customization capabilities, since writing standards of different writing groups or individuals vary.
  • Semi-structured text formats are not supported: The production of manuals or large technical documents requires tags in semi-structured text formats, such as headings or tables.
  • They cannot be run by using a simple command: It should be possible to run a text inspection by using a simple command that is easily integrated in other systems or services, such as CI.

RedPen

logo-004

We have been building a simple text proofreading tool, RedPen. RedPen supports Markdown and Textile as the input format and provides sufficient customization to follow widely varying writing standards. RedPen can be used as a simple command, and therefore easily be applied with CI tools or in environments such as Travis.

RedPen usage

We now describe the usage and the configuration and then run RedPen.

Command

RedPen provides a simple command, redpen. The redpen command has the following parameters.

    redpen -c <CONF FILE> <INPUT FILE> [<INPUT FILE>]
        -c,--conf <CONF FILE>                Configuration file
        -f,--format <FORMAT>                 Input format
        -h,--help                            Help message
        -r,--result-format <RESULT FORMAT>   Output format
        -v,--version                         Version information

Configuration

RedPen supports various functions (validators). RedPen users add the validators that are required according to their writing standards. The language is specified in the lang attribute in redpen-conf. RedPen then loads the default character settings for the specified language.

A sample of RedPen configuration follows. For more details, please see the RedPen Document.

    <redpen-conf lang="en">
        <validators>
            <validator name="SentenceLength">
                 <property name="max_len" value="200"/>
            </validator>
            <validator name="InvalidSymbol"/>
            <validator name="SymbolWithSpace"/>
            <validator name="SectionLength"/>
            <validator name="ParagraphNumber"/>
            <validator name="Spelling"/>
            <validator name="Contraction" />
            <validator name="DoubledWord" />
            <validator name="SuccessiveWord" />
            <validator name="EndOfSentence" />
            <validator name="SpaceBeginningOfSentence" />
        </validators>
    </redpen-conf>

Run RedPen

Now let us run RedPen for an English text.

First, please download the RedPen package, and then, unpack the package using the following command.

$tar xvf redpen-cli-1.0-assembled.tar.gz

Now, you can run RedPen the using the following commands.

    $cd redpen-cli-1.0
    $bin/redpen -c conf/redpen-conf-en.xml sample-doc/en/sampledoc-en.txt
    14:32:37.639 [main] INFO  org.unigram.docvalidator.Main - loading character table file: sample/conf/symbol-conf-en.xml
    14:32:37.652 [main] INFO  o.u.docvalidator.util.CharacterTable - Succeeded to load character table
    14:32:37.654 [main] INFO  o.unigram.docvalidator.parser.Parser - comma is set to ","
    14:32:37.655 [main] INFO  o.unigram.docvalidator.parser.Parser - full stop is set to "."
    14:32:37.663 [main] INFO  o.u.d.v.s.ParagraphStartWithValidator - Using the default value of paragraph_start_with.
    CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 265 in line: ln bibliometrics and link analysis studies many attempts have been made to analyze the 
    relationship among scientific papers, authors, and joumals and recently, these research results have been found to be effective for analyzing the link structure ofweb pages as well.

As can be seen, the redpen command produces a warning (CheckError).

Apply RedPen to CI services

Now, we describe the construction of a CI environment for a text project with a CI service (Travis). The settings are simple. We just download the RedPen package and then run the redpen command for the target file. A sample of the Travis settings for RedPen follows.

Sample GitHub project

We wrote a sample GitHub project for running RedPen in a CI environment.

  • sampledoc-en.md: target article (Markdown syntax).
  • repden-conf-ja.xml: RedPen settings file.
  • .travis.yml: Travis settings file the describes how to deploy and run RedPen to inspect the target article.

Travis setting

The following is .travis.yml stored in the repository.

    language: text
    jdk:
      - oraclejdk8
    env:
      - REDPEN_VERSION="1.0"
    install:
       - wget https://github.com/recruit-tech/redpen/releases/download/v1.0-experimental-1/redpen-cli-$REDPEN_VERSION-assembled.tar.gz
       - tar xvf redpen-cli-$REDPEN_VERSION-assembled.tar.gz
    script:
      - redpen-cli-$REDPEN_VERSION/bin/redpen -c redpen-conf.xml -f markdown -r xml sampledoc-en.md

As can be seen, RedPen is downloaded and run for the target file.

Sample document

The following is a sample article that contains several text mistakes and formatting errors.

# Distributed system
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources.
In this article, we'll call a computer server that works as a member of a cluster an "instance". for example, each instance in distributed search engines stores the the fractions of data.
Such distriubuted systems need a component to merge the preliminary results from member instnaces.

Initial commit

When we commited the initial version of the article, we received a warning from Travis by mail.

Inital Result

As can be seen, several errors are reported.

Error analysis

In the second line, there is a sentence the length of which is longer than that specified.
The sentence also contains an error in the spacing after teh right parenthesis.
In the third line, RedPen reports a contraction error. Contractions should not be used in formal writing. In addition, the third line includes the errors that the word “the” is used twice in succession and the end of the sentence is invalid.

Another error in the second sentence of the third line is also reported: the sentence begins with a lower case letter. In the fourth line, two misspelled words (distriubuted, and instnaces) are reported.

Correcting errors

First, we divide the sentence in the second line into two separate sentences:

Some software tools work in more than one machine.
Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.

Then, we corrected the remaining reported errors, including the misspelled words, end of sentence, contraction, etc. The following is the corrected article.

# Distributed system
Some software tools work in more than one machine. Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.
In this article, we call a computer server that works as a member of a cluster an "instance." For example, each instance in distributed search engines stores the fractions of data.
Such distributed systems need a component to merge the preliminary results from member instances.

Now let us commit the corrected article to GitHub.

Second Result

Then, we received the second report from Travis. The reports stated that the validation had succeeded as expected.

Future work

In this article, the application of RedPen to the CI environments described. However, unfortunately in our opinion there is a room for improvement. In particular, the future work we will focus on the following features:

  • Enhanced server: Currently, RedPen provides a sample server, but it does not provide a sufficient number of configurations. We will enhance the server so that it will be more configurable.
  • Rich functions: We will provide more linguistics features using part-of-speech or parsing information.
  • Language support: We will cover more languages such as Chinese or French.
Document Writing with Continuous Integration