Released RedPen v1.4 (LaTeX support)

We released RedPen v1.4. We hope that you will download it from the following URL and try using it.

https://github.com/redpen-cc/redpen/releases/tag/v1.4.0

The centerpiece of release v1.4 is support for the LaTeX format.

LaTeX support

In this release, we provided experimental support for LaTeX as an input format. Many people have requested LaTeX support, starting from the initial development of RedPen. Since the v0.6 release, we took one year to support LaTeX, and we finally succeeded.

Unfortunately, the LaTeX support is limited in the following ways:

  1. RedPen LaTeX parser does not work well when macros are used to add your own tags
  2. It does not support a complete check of sentence in lists and tables

Although these are big constraints, we believe that it would be used to inspect papers and documents.

Enhancement of functions (Validators)

In v1.4, we also concentrated on enhancements the (Validator) functions. To the added functions, three types of language support were added: support in both Japanese and English, support in English only, and support in Japanese only.

Functions supported in both Japanese and English

  • DoubleNegative In both Japanese and English, double negative statements are difficult to understand. If a double negative is present in the text, an error is output.

Functions supported in English only

  • FrequentSentenceStart When writing a document in English, many sentences can start with We. Because even when there is no problem with the content, the appearance is bad, and therefore it is good to swiftly replace them. Consider the following example.


We propose a novel method. We demonstrate the effectiveness of the method.

We in the above example has been used twice in a row. Without changing the meaning, we will edit the sentences to prevent continuous use of the same subject.


We propose a novel method. The effectiveness of the method is demonstrated in the experiments.

  • UnexpandedAcronym This function checks documents for the presence of acronyms and also for the original words that they represent.
  • WordFrequency If the word frequency within the document differs from the usual, an error is output.
  • Hyphenation If hyphen usage is not correct, an error is output.
  • NumberFormat If number formats differ from correct usage in English, an error is output.
  • ParenthesizedSentence This function inspects for usage of parentheses. If there are nested parentheses or more parentheses than specified, an error is output.
  • WeakExpression If the text has an ambiguous English expression, an error is output. For example, words such as €œcompletely€ and €œhuge€ should be replaced with more accurate representations.

Functions supported in Japanese only

  • Okurigana If Japanese okurigana word endings are used incorrectly, an error is output.
  • DoubledJoshi if a particle is used more than once in a sentence, it might be difficult to read.

Prospect of version 1.5

We will continue development of RedPen v1.5 and more. The fact is that we have not yet set the priorities, but for v1.5, a mechanism that can easily test functions written in JavaScript would be included.

Released RedPen v1.4 (LaTeX support)

Localizing Validators

If you are familiar with more than one natural language. It’s good to consider localizing your custom validators in RedPen.

To localize validation error messages, of course you need to avoid hard-coding error messages in your Validator implementation. To achieve that, use addLocalizedError(Sentence sentenceWithError,Object… args) instead of addError(String message, Sentence sentenceWithError) as follows:
NumberOfCharactersLocalizedValidator

package cc.redpen.validator.sentence;

import cc.redpen.model.Sentence;
import cc.redpen.validator.Validator;

public class NumberOfCharactersLocalizedValidator extends Validator {
    private final int MIN_LENGTH = 100;
    private final int MAX_LENGTH = 1000;
    @Override
    public void validate(Sentence sentence) {
        if (sentence.getContent().length() < MIN_LENGTH) {
            // actual error message is in NumberOfCharactersLocalizedValidator.properties
            addLocalizedError(sentence, MIN_LENGTH);
        }
        if (sentence.getContent().length() > MAX_LENGTH) {
            // You can specify a message key when you have multiple error message variations
            addLocalizedError("toolong", sentence, MAX_LENGTH);
        }
    }
}

Then you can place error messages in YourValidatorName[_LANGUAGE].properties:
NumberOfCharactersLocalizedValidator.properties (default messages)

NumberOfCharactersLocalizedValidator=Sentence is shorter than {0} characters long.
NumberOfCharactersLocalizedValidator.toolong=Sentence is longer than {0} characters long.

NumberOfCharactersLocalizedValidator_ja.properties (Japanese messages)

NumberOfCharactersLocalizedValidator=文が{0}文字より短いです。
NumberOfCharactersLocalizedValidator.toolong=文が{0}文字より長いです。

Error messages will be rendered by java.text.MessageFormat and you can put place holders in the error message text.

With the above implementation, you’ll get Japanese validation error message from Japanese environment:

$ ./bin/redpen -c conf/redpen-conf-en.xml ~/mytext.txt
mytext.txt:1: ValidationError[NumberOfCharacters], 文が100文字より短いです。 at line: Short paragraph.

, and English validation error message from English environment:

$ ./bin/redpen -c conf/redpen-conf-en.xml ~/mytext.txt
mytext.txt:1: ValidationError[NumberOfCharacters], Sentence is shorter than 100 characters long. at line: Short paragraph.
Localizing Validators

Writing RedPen extension with JavaScript

Extending RedPen in JavaScript

We introduced the basics of writing custom validators in the last post. For those who are unfamiliar with Java, we introduced JavaScriptValidator in RedPen v1.3. JavaScriptValidator is a special validator loads Validator implementations written in JavaScript.

Enabling JavaScriptValidator

To enable JavaScriptValidator, simply add <validator name=”JavaScript” /> in your redpen-conf.xml as follows:

<redpen-conf lang="en">
  <validators>
       ...snip...
    <validator name="JavaScript" />
  </validators>
</redpen-conf>

Write your own validator in JavaScript

Here is a JavaScript version of NumberOfCharacterValidator:

var MIN_LENGTH = 100;
var MAX_LENGTH = 1000;

function validateSentence(sentence) {
  if (sentence.getContent().length() < MIN_LENGTH) {
    addError("Sentence is shorter than "
      + MIN_LENGTH + " characters long.", sentence);
  }
  if (sentence.getContent().length() > MAX_LENGTH) {
    addError("Sentence is longer than " + MAX_LENGTH
      + " characters long.", sentence);
  }
}

The code looks pretty much similar with the Java version. But due to the difference in the type system, the callback method “validate(Sentence sentence)” is referred as “validateSentence(sentence)” in the JavaScript version.

Run

Of course, it’s JavaScript and there is no need to compile / package your validator (Actually your JavaScript code will be compiled into Java byte-code by Nashorn, and it runs very fast than you expect). JavaScriptValidator will pick any *.js file located in $REDPEN_HOME/js directory.
You can simply run the redpen command to get your file validated by the js validator.

$ ./bin/redpen -c myredpen-conf.xml 2be-validated.txt
2be-validated.txt:1: ValidationError[JavaScript], [NumberOfCharacter.js] Sentence is shorter than 100 characters long. at line: very short sentence.
Writing RedPen extension with JavaScript

Write your own RedPen Validator in Java

RedPen is extensible

RedPen is a Java based extensible document validation framework. You can implement your own extension (i.e. validation logic) in Java language.

Write your own validator

You can write your own Validator by simply overriding Validator#validate(Sentence). This is a callback method where you can receive each sentence in a document. To report a validation error in a sentence, use Validator#addError(String message, Sentence sentenceWithError).
Make sure that the class name ends with “Validator”, and your custom validation class need to be placed in either cc.redpen.validator, cc.redpen.validator.sentence, or cc.redpen.validator.section package.

Here is a very simple Validator implementation which checks sentence length:

package cc.redpen.validator.sentence;

import cc.redpen.model.Sentence;
import cc.redpen.validator.Validator;

public class NumberOfCharacterValidator extends Validator {
  private final int MIN_LENGTH = 100;
  private final int MAX_LENGTH = 1000;

  @Override
  public void validate(Sentence sentence) {
    if (sentence.getContent().length() < MIN_LENGTH) {
      // formerly addValidationError(message,sentence)
      addError("Sentence is shorter than "
        + MIN_LENGTH + " characters long.", sentence);
    }
    if (sentence.getContent().length() > MAX_LENGTH) {
      addError("Sentence is longer than " + MAX_LENGTH
        + " characters long.", sentence);
    }
  }
}

Please also consult these existing validators for your reference.

Enable your custom Validator

Once you package the validator class in jar file format and put it in $REDPEN_HOME/lib, it’s ready to validate.
Make sure that you specify a validator section in your redpen-conf.xml as follows:

<redpen-conf lang="en">
  <validators>
    <validator name="NumberOfCharacter" />
  </validators>
</redpen-conf>

In name attribute, specify the class name without “Validator”. For instance, name=”NumberOfCharacter” loads NumberOfCharacterValidator. There is no need to specify the package name.

Write your own RedPen Validator in Java

RedPen v.1.3 Released

We released RedPen version 1.3 a few days ago. RedPen 1.3 is available from the releases page. If you are using a Mac, you can install the latest version via Homebrew ($brew install redpen).

This article introduces the three features added to RedPen version 1.3 (AsciiDoc support, server commands, and Javascript-based extensions).

Support for AsciiDoc format

Ever since RedPen was released, I have received requests asking for AsciiDoc support.
AsciiDoc is a markup language just like Markdown, LaTeX, and others. I was satisfied with Markdown at the time of RedPen’s initial release, so I did not fully understand why there was so much demand for a markup language with similar syntax to Markdown. Below, I will touch on the benefits of AsciiDoc.

Are there shortcomings in Markdown?

Recently, I worked on a job where I was writing a long and formal document. At first, I tried to write the document in Markdown, but it just did not go as I had intended. It was then that I started pulling my hair out over the fact that the Markdown format’s expressive capacity was lacking. For example, links to clauses and segmenting of source files are necessary when writing documents with a certain degree of content, yet such features are not (formally) supported in Markdown.

The benefits of AsciiDoc

While AsciiDoc has a light syntax similar to Markdown, it is rich in features. I feel like it has all the necessary features in place to write formal and content-rich documents. Using a tool called AsciiDoctor also makes it easy to convert documents to PDF and change document styles to influence the presentation of the document. For those of you who need to write formal documents, it might be worth your while to consider using AsciiDoc once.

Server Inclusion

The RedPen server was introduced from version 1.0, and since has been extended from time to time. In version 1.2, the REST API was strengthened, providing functionality closer to that of the command line.

The advantage of using RedPen via the REST API is the high-speed response. After executing a redpen command, it takes time for the server to boot up. On the other hand, constantly running servers can return the results of an inspection immediately.

I received requests asking to be able to shutdown the server from the command line along with the feature improvements. To date, another file called redpen.war was required to be downloaded, but in version 1.3, all of the server files (redpen.war) are included in the package (redpen-1.3.tar.gz)

I also provided a simple redpen-server command in RedPen version 1.3. To start the server, just run the following command:

$redpen-server start

Similarly, shutting down the server can be done simply as follows:

$redpen-server start

Unfortunately, however, the command is not supported on Windows. I plan to add Windows support on the next version upgrade.

Extensions up to 1.2

One of RedPen’s distinguishing features is its extensibility. RedPen does all of the cumbersome processing automatically, such as extracting sentences and clauses from markup languages. As such, when a user wants to extend features on their own, all they have to do is write the processing to be done on a sentence. Plug-in features are also supported, so users do not have to build RedPen in its entirety.

However, what I discovered when developing was that, unfortunately, the current-state of plug-ins leaves something lacking. What I mean is that, to create a Java-based plug-in, only people with a certain degree of familiarity with Java would be capable of creating such a plug-in.
It also needs to be compiled and it cannot be easily tried…

Support for Javascript extensions.

Therefore, Javascript-based extensions are supported starting from version 1.3. With extensions based in Javascript, users have no need to compile. Just put the Javascript file for the extension(s) in $REDPEN_HOME/js. I plan to introduce more detailed instructions separately.

RedPen v.1.3 Released

REST API for Document Validation

RedPen is an open source command line tool for proofreading documents, but RedPen also provides a server. This article introduces the functions of the RedPen server.

The RedPen server provides not only a Web UI but also a REST API, which enables users to check their documents without installing RedPen on their computers.

One of the features of the RedPen server API is its configurability. Users can validate their documents according to their configuration settings. In addition, the server can be deployed with few clicks in Heroku.

RedPen Server Web UI

A RedPen server is available at the following URL:

http://redpen.herokuapp.com/

The below image shows the current RedPen Web UI page.

redpen-ui

When a user visits the RedPen server at the URL above, the top left box is automatically preloaded with sample text that contains many mistakes. RedPen shows the errors in the top left box as red bars. The bottom left window provides detailed error information.

When users paste their documents in the box, any validation errors are displayed in the left bottom box.

We can configure the settings in the right box.
Specifically we can configure validation items and character (symbol) settings. For detailed configuration information, please see the RedPen configuration document page.

RedPen Server API

The RedPen server provides a REST API that enables users to apply RedPen validation without installing RedPen.

Currently the RedPen server API provides three types of validation.

  • /rest/config/redpens

This function returns validation errors using preconfigured redpens.

  • /document/validate

This function validates a document with the user’s configuration and then returns the errors.

  • /document/validate/json

This function is similar to the /document/validate function, but the configuration is written in JSON format.

Sample: REST API (/doument/validate)

The /document/validate function has several parameters.

  • document contains the text of the document RedPen should validate
  • documentParser specifies the input document format. Valid options are PLAIN,
    MARKDOWN, and WIKI.
  • lang specifies the language used to tokenize the document. Currently, values of ja (Japanese) and en (English/Whitespace) are supported.
  • format determines the format for the results. This can be either json (the default), json2, plain, plain2 or xml.
  • config contains the contents of a RedPen XML configuration file.

Now let us try the REST function with both configuration and text. The following is a sample RedPen configuration file (redpen-conf-en.xml) bundled with the RedPen package.

<redpen-conf lang="en">
  <validators>
    <validator name="SentenceLength">
      <property name="max_len" value="100"/>
    </validator>
    <validator name="InvalidSymbol"/>
    <validator name="SymbolWithSpace"/>
    <validator name="SectionLength">
      <property name="max_char_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber"/>
    <validator name="Spelling"/>
    <validator name="Contraction" />
    <validator name="DoubledWord" />
    <validator name="SuccessiveWord" />
    <validator name="EndOfSentence" />
    <validator name="SpaceBeginningOfSentence" />
  </validators>
</redpen-conf>

Next we validate a short input sentence with the RedPen server. The following command sends the document and configuration.

curl --data document="Twas brillig and the slithy toves did gyre and gimble in the wabe"
  --data lang=en --data format=PLAIN2 \
  --data config="`cat redpen-conf-en.xml`" \
  redpen.herokuapp.com/rest/document/validate/
  Line: 1, Offset: 0
    Sentence: Twas brillig and the slithy toves did gyre and gimble in the wabe
      Spelling: Found possibly misspelled word "brillig".
      Spelling: Found possibly misspelled word "slithy".
      Spelling: Found possibly misspelled word "toves".
      Spelling: Found possibly misspelled word "gyre".
      Spelling: Found possibly misspelled word "gimble".
      "and".querying the input

As we see, the RedPen server returns several errors in the input sentence.

Deploying your RedPen Server with the Heroku Button

In the previous sample, I was using a server already deployed in Heroku, but this server is not powerful enough if many users send their validation requests.

If you need short response time, of course you can deploy the RedPen server in your own environment, but this could be tiresome.

For users that would like to deploy their own server easily, RedPen provides a Heroku Button. Users can deploy the RedPen server with just a few clicks.

The Heroku Button can be located in the README of the RedPen source.

heroku-button

When we click the button, then the following page is shown.

heroku-deploy

When the user clicks the Deploy for free button, the RedPen server is deployed in a few minutes.

REST API for Document Validation

Document Writing with Continuous Integration

When we build a software or a Web service, we utlize many static analysis tools such as Lint, ChecStyle or FindBugs. These static code analysis tools help us to write simple code and they improve productivity.

Many inspection tools are compatible with CI tools or services such as Travis or Wercker. CI services are useful in that they ensure that users follow the coding standard.

However, unfortunately there exist no widely used inspection tools for technical writing since they are characterized by deficiencies. Most tools suffer from the following problems.

  • Customization is not provided: A text inspection tool should provide sufficient customization capabilities, since writing standards of different writing groups or individuals vary.
  • Semi-structured text formats are not supported: The production of manuals or large technical documents requires tags in semi-structured text formats, such as headings or tables.
  • They cannot be run by using a simple command: It should be possible to run a text inspection by using a simple command that is easily integrated in other systems or services, such as CI.

RedPen

logo-004

We have been building a simple text proofreading tool, RedPen. RedPen supports Markdown and Textile as the input format and provides sufficient customization to follow widely varying writing standards. RedPen can be used as a simple command, and therefore easily be applied with CI tools or in environments such as Travis.

RedPen usage

We now describe the usage and the configuration and then run RedPen.

Command

RedPen provides a simple command, redpen. The redpen command has the following parameters.

    redpen -c <CONF FILE> <INPUT FILE> [<INPUT FILE>]
        -c,--conf <CONF FILE>                Configuration file
        -f,--format <FORMAT>                 Input format
        -h,--help                            Help message
        -r,--result-format <RESULT FORMAT>   Output format
        -v,--version                         Version information

Configuration

RedPen supports various functions (validators). RedPen users add the validators that are required according to their writing standards. The language is specified in the lang attribute in redpen-conf. RedPen then loads the default character settings for the specified language.

A sample of RedPen configuration follows. For more details, please see the RedPen Document.

    <redpen-conf lang="en">
        <validators>
            <validator name="SentenceLength">
                 <property name="max_len" value="200"/>
            </validator>
            <validator name="InvalidSymbol"/>
            <validator name="SymbolWithSpace"/>
            <validator name="SectionLength"/>
            <validator name="ParagraphNumber"/>
            <validator name="Spelling"/>
            <validator name="Contraction" />
            <validator name="DoubledWord" />
            <validator name="SuccessiveWord" />
            <validator name="EndOfSentence" />
            <validator name="SpaceBeginningOfSentence" />
        </validators>
    </redpen-conf>

Run RedPen

Now let us run RedPen for an English text.

First, please download the RedPen package, and then, unpack the package using the following command.

$tar xvf redpen-cli-1.0-assembled.tar.gz

Now, you can run RedPen the using the following commands.

    $cd redpen-cli-1.0
    $bin/redpen -c conf/redpen-conf-en.xml sample-doc/en/sampledoc-en.txt
    14:32:37.639 [main] INFO  org.unigram.docvalidator.Main - loading character table file: sample/conf/symbol-conf-en.xml
    14:32:37.652 [main] INFO  o.u.docvalidator.util.CharacterTable - Succeeded to load character table
    14:32:37.654 [main] INFO  o.unigram.docvalidator.parser.Parser - comma is set to ","
    14:32:37.655 [main] INFO  o.unigram.docvalidator.parser.Parser - full stop is set to "."
    14:32:37.663 [main] INFO  o.u.d.v.s.ParagraphStartWithValidator - Using the default value of paragraph_start_with.
    CheckError[sample/doc/txt/en/sampledoc-en.txt: 0] = The length of the line exceeds the maximum 265 in line: ln bibliometrics and link analysis studies many attempts have been made to analyze the 
    relationship among scientific papers, authors, and joumals and recently, these research results have been found to be effective for analyzing the link structure ofweb pages as well.

As can be seen, the redpen command produces a warning (CheckError).

Apply RedPen to CI services

Now, we describe the construction of a CI environment for a text project with a CI service (Travis). The settings are simple. We just download the RedPen package and then run the redpen command for the target file. A sample of the Travis settings for RedPen follows.

Sample GitHub project

We wrote a sample GitHub project for running RedPen in a CI environment.

  • sampledoc-en.md: target article (Markdown syntax).
  • repden-conf-ja.xml: RedPen settings file.
  • .travis.yml: Travis settings file the describes how to deploy and run RedPen to inspect the target article.

Travis setting

The following is .travis.yml stored in the repository.

    language: text
    jdk:
      - oraclejdk8
    env:
      - REDPEN_VERSION="1.0"
    install:
       - wget https://github.com/recruit-tech/redpen/releases/download/v1.0-experimental-1/redpen-cli-$REDPEN_VERSION-assembled.tar.gz
       - tar xvf redpen-cli-$REDPEN_VERSION-assembled.tar.gz
    script:
      - redpen-cli-$REDPEN_VERSION/bin/redpen -c redpen-conf.xml -f markdown -r xml sampledoc-en.md

As can be seen, RedPen is downloaded and run for the target file.

Sample document

The following is a sample article that contains several text mistakes and formatting errors.

# Distributed system
Some software tools work in more than one machine, and such distributed (cluster)systems can handle huge data or tasks, because such software tools make use of large amount of computer resources.
In this article, we'll call a computer server that works as a member of a cluster an "instance". for example, each instance in distributed search engines stores the the fractions of data.
Such distriubuted systems need a component to merge the preliminary results from member instnaces.

Initial commit

When we commited the initial version of the article, we received a warning from Travis by mail.

Inital Result

As can be seen, several errors are reported.

Error analysis

In the second line, there is a sentence the length of which is longer than that specified.
The sentence also contains an error in the spacing after teh right parenthesis.
In the third line, RedPen reports a contraction error. Contractions should not be used in formal writing. In addition, the third line includes the errors that the word “the” is used twice in succession and the end of the sentence is invalid.

Another error in the second sentence of the third line is also reported: the sentence begins with a lower case letter. In the fourth line, two misspelled words (distriubuted, and instnaces) are reported.

Correcting errors

First, we divide the sentence in the second line into two separate sentences:

Some software tools work in more than one machine.
Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.

Then, we corrected the remaining reported errors, including the misspelled words, end of sentence, contraction, etc. The following is the corrected article.

# Distributed system
Some software tools work in more than one machine. Such distributed (cluster) systems can handle huge data or tasks, because such software tools utlize large amount of computer resources.
In this article, we call a computer server that works as a member of a cluster an "instance." For example, each instance in distributed search engines stores the fractions of data.
Such distributed systems need a component to merge the preliminary results from member instances.

Now let us commit the corrected article to GitHub.

Second Result

Then, we received the second report from Travis. The reports stated that the validation had succeeded as expected.

Future work

In this article, the application of RedPen to the CI environments described. However, unfortunately in our opinion there is a room for improvement. In particular, the future work we will focus on the following features:

  • Enhanced server: Currently, RedPen provides a sample server, but it does not provide a sufficient number of configurations. We will enhance the server so that it will be more configurable.
  • Rich functions: We will provide more linguistics features using part-of-speech or parsing information.
  • Language support: We will cover more languages such as Chinese or French.
Document Writing with Continuous Integration