Converting RedPen documentation to AsciiDoc

We recently converted the RedPen documentation from RST (reStructured Text) to AsciiDoc, primarily to provide a working example of AsciiDoc markup to use as a test-case for RedPen.

This initial attempt to convert the RST documents involved using pandoc v1.13.2, an automated markup translator.

We processed the resulting .adoc files using AsciiDoctor.

Pandoc correctly converted in-line formatting, and automatically provided anchors for all headings. It quickly enabled us to get basic AsciiDoc versions of the existing files.

However, although pandoc is a very useful and powerful tool, it encountered a few problems converting our RST files to AsciiDoc. This was not totally unexpected, since there are markup options that cannot be directly translated between reStructured Text and AsciiDoc. However, some of the problems encountered meant that a significant amount of text had to to be reconverted by hand.

The issues we encountered were:

Missing tables

Several of the tables in the source documents were totally absent in the AsciiDoc files pandoc created. For example, this RST text:

SentenceLength validator checks the length of sentences in the input document. If the length of the sentence is greater than the specified maximum length, the validator generates a warning.

.. table::

  ============== ============= ============================
  Property       Default Value Description
  ============== ============= ============================
  ``"max_len"``  50            Maximum length of sentence.
  ============== ============= ============================

was translated by pandoc to:

 
SentenceLength validator checks the length of sentences in the input
document. If the length of the sentence is greater than the specified
maximum length, the validator generates a warning.

The table is completely absent. The correct AsciiDoc table is as follows:

[options="header"]
|====
|Property        |Default Value  |Description
|``max_len``     |50             |Maximum length of sentence.
|====

Source blocks

Source blocks in AsciiDoc should be formatted as follows:

[ source,xml]
----
<validators>
    <validator name="SentenceLength">
        <property name="max_len" value="200"/>
    </validator>
    <validator name="InvalidSymbol" />
    <validator name="SpaceWithSymbol" />
    <validator name="SectionLength">
        <property name="max_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber" />
 </validators>
---- 

However, the converted version was translated as:

code,sourceCode,xml------------------------------------------------------------------------------------------------
code,sourceCode,xml
<validators>
    <validator name="SentenceLength">
        <property name="max_len" value="200"/>
    </validator>
    <validator name="InvalidSymbol" />
    <validator name="SpaceWithSymbol" />
    <validator name="SectionLength">
        <property name="max_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber" />
 </validators>
------------------------------------------------------------------------------------------------

The format did not render properly when processed with AsciiDoctor.

Unsupported :ref: and :doc:

The RST source files included :ref: and :doc: references, as in:

RedPen supports default symbols for "en" and "ja", which are 
described in :ref:`en-default-symbol-setting` 
and :ref:`ja-default-symbol-setting`.

was converted to:

RedPen supports default symbols for ``en'' and ``ja'', which are
described in en-default-symbol-setting 
and ja-default-symbol-setting.

Although not 100% correct, we were hoping for something more like the following, which would preserve the notion that a link was involved:

RedPen supports default symbols for "en" and "ja", which are 
described in <<en-default-symbol-setting>> 
and <<ja-default-symbol-setting>>.

Heading levels

Although pandoc converted our headings perfectly correctly, it used the AsciiDoc heading style which is arguably the least flexible. Pandoc chose the following (correct) translation:

Heading Level 1
---------------
Heading Level 2
~~~~~~~~~~~~~~~
Heading Level 3
^^^^^^^^^^^^^^^

However, we would have preferred the easier-to-edit alternate format:

= Heading Level 1
== Heading Level 2
=== Heading Level 3
==== Heading Level 4

Other considerations

AsciiDoctor does not currently support the creation of a table of contents that spans multiple source documents. Given this limitation, we decided to combine all RedPen documents into a single HTML page. Although the page is larger, it is easier to navigate on some devices, and does not require any additional tools to build a surrounding multi-document index or menu.

Summary

Converting documents between different markup formats is reasonably straightforward, and tools such as pandoc are excellent choices for making headway quickly and easily. However, such tools do not always support all markup formats equally, and there may still be plenty of manual validation and editing to get your documents in good order.

Although pandoc v1.13.2 is not the current version, it was the version available at the time on Fedora 23. However, at the time of writing, the latest version available via http://pandoc.org/try/ still appears to produce the same translation as we encountered.

Advertisements
Converting RedPen documentation to AsciiDoc

Released RedPen v1.4 (LaTeX support)

We released RedPen v1.4. We hope that you will download it from the following URL and try using it.

https://github.com/redpen-cc/redpen/releases/tag/v1.4.0

The centerpiece of release v1.4 is support for the LaTeX format.

LaTeX support

In this release, we provided experimental support for LaTeX as an input format. Many people have requested LaTeX support, starting from the initial development of RedPen. Since the v0.6 release, we took one year to support LaTeX, and we finally succeeded.

Unfortunately, the LaTeX support is limited in the following ways:

  1. RedPen LaTeX parser does not work well when macros are used to add your own tags
  2. It does not support a complete check of sentence in lists and tables

Although these are big constraints, we believe that it would be used to inspect papers and documents.

Enhancement of functions (Validators)

In v1.4, we also concentrated on enhancements the (Validator) functions. To the added functions, three types of language support were added: support in both Japanese and English, support in English only, and support in Japanese only.

Functions supported in both Japanese and English

  • DoubleNegative In both Japanese and English, double negative statements are difficult to understand. If a double negative is present in the text, an error is output.

Functions supported in English only

  • FrequentSentenceStart When writing a document in English, many sentences can start with We. Because even when there is no problem with the content, the appearance is bad, and therefore it is good to swiftly replace them. Consider the following example.


We propose a novel method. We demonstrate the effectiveness of the method.

We in the above example has been used twice in a row. Without changing the meaning, we will edit the sentences to prevent continuous use of the same subject.


We propose a novel method. The effectiveness of the method is demonstrated in the experiments.

  • UnexpandedAcronym This function checks documents for the presence of acronyms and also for the original words that they represent.
  • WordFrequency If the word frequency within the document differs from the usual, an error is output.
  • Hyphenation If hyphen usage is not correct, an error is output.
  • NumberFormat If number formats differ from correct usage in English, an error is output.
  • ParenthesizedSentence This function inspects for usage of parentheses. If there are nested parentheses or more parentheses than specified, an error is output.
  • WeakExpression If the text has an ambiguous English expression, an error is output. For example, words such as €œcompletely€ and €œhuge€ should be replaced with more accurate representations.

Functions supported in Japanese only

  • Okurigana If Japanese okurigana word endings are used incorrectly, an error is output.
  • DoubledJoshi if a particle is used more than once in a sentence, it might be difficult to read.

Prospect of version 1.5

We will continue development of RedPen v1.5 and more. The fact is that we have not yet set the priorities, but for v1.5, a mechanism that can easily test functions written in JavaScript would be included.

Released RedPen v1.4 (LaTeX support)

Localizing Validators

If you are familiar with more than one natural language. It’s good to consider localizing your custom validators in RedPen.

To localize validation error messages, of course you need to avoid hard-coding error messages in your Validator implementation. To achieve that, use addLocalizedError(Sentence sentenceWithError,Object… args) instead of addError(String message, Sentence sentenceWithError) as follows:
NumberOfCharactersLocalizedValidator

package cc.redpen.validator.sentence;

import cc.redpen.model.Sentence;
import cc.redpen.validator.Validator;

public class NumberOfCharactersLocalizedValidator extends Validator {
    private final int MIN_LENGTH = 100;
    private final int MAX_LENGTH = 1000;
    @Override
    public void validate(Sentence sentence) {
        if (sentence.getContent().length() < MIN_LENGTH) {
            // actual error message is in NumberOfCharactersLocalizedValidator.properties
            addLocalizedError(sentence, MIN_LENGTH);
        }
        if (sentence.getContent().length() > MAX_LENGTH) {
            // You can specify a message key when you have multiple error message variations
            addLocalizedError("toolong", sentence, MAX_LENGTH);
        }
    }
}

Then you can place error messages in YourValidatorName[_LANGUAGE].properties:
NumberOfCharactersLocalizedValidator.properties (default messages)

NumberOfCharactersLocalizedValidator=Sentence is shorter than {0} characters long.
NumberOfCharactersLocalizedValidator.toolong=Sentence is longer than {0} characters long.

NumberOfCharactersLocalizedValidator_ja.properties (Japanese messages)

NumberOfCharactersLocalizedValidator=文が{0}文字より短いです。
NumberOfCharactersLocalizedValidator.toolong=文が{0}文字より長いです。

Error messages will be rendered by java.text.MessageFormat and you can put place holders in the error message text.

With the above implementation, you’ll get Japanese validation error message from Japanese environment:

$ ./bin/redpen -c conf/redpen-conf-en.xml ~/mytext.txt
mytext.txt:1: ValidationError[NumberOfCharacters], 文が100文字より短いです。 at line: Short paragraph.

, and English validation error message from English environment:

$ ./bin/redpen -c conf/redpen-conf-en.xml ~/mytext.txt
mytext.txt:1: ValidationError[NumberOfCharacters], Sentence is shorter than 100 characters long. at line: Short paragraph.
Localizing Validators

Writing RedPen extension with JavaScript

Extending RedPen in JavaScript

We introduced the basics of writing custom validators in the last post. For those who are unfamiliar with Java, we introduced JavaScriptValidator in RedPen v1.3. JavaScriptValidator is a special validator loads Validator implementations written in JavaScript.

Enabling JavaScriptValidator

To enable JavaScriptValidator, simply add <validator name=”JavaScript” /> in your redpen-conf.xml as follows:

<redpen-conf lang="en">
  <validators>
       ...snip...
    <validator name="JavaScript" />
  </validators>
</redpen-conf>

Write your own validator in JavaScript

Here is a JavaScript version of NumberOfCharacterValidator:

var MIN_LENGTH = 100;
var MAX_LENGTH = 1000;

function validateSentence(sentence) {
  if (sentence.getContent().length() < MIN_LENGTH) {
    addError("Sentence is shorter than "
      + MIN_LENGTH + " characters long.", sentence);
  }
  if (sentence.getContent().length() > MAX_LENGTH) {
    addError("Sentence is longer than " + MAX_LENGTH
      + " characters long.", sentence);
  }
}

The code looks pretty much similar with the Java version. But due to the difference in the type system, the callback method “validate(Sentence sentence)” is referred as “validateSentence(sentence)” in the JavaScript version.

Run

Of course, it’s JavaScript and there is no need to compile / package your validator (Actually your JavaScript code will be compiled into Java byte-code by Nashorn, and it runs very fast than you expect). JavaScriptValidator will pick any *.js file located in $REDPEN_HOME/js directory.
You can simply run the redpen command to get your file validated by the js validator.

$ ./bin/redpen -c myredpen-conf.xml 2be-validated.txt
2be-validated.txt:1: ValidationError[JavaScript], [NumberOfCharacter.js] Sentence is shorter than 100 characters long. at line: very short sentence.
Writing RedPen extension with JavaScript

Write your own RedPen Validator in Java

RedPen is extensible

RedPen is a Java based extensible document validation framework. You can implement your own extension (i.e. validation logic) in Java language.

Write your own validator

You can write your own Validator by simply overriding Validator#validate(Sentence). This is a callback method where you can receive each sentence in a document. To report a validation error in a sentence, use Validator#addError(String message, Sentence sentenceWithError).
Make sure that the class name ends with “Validator”, and your custom validation class need to be placed in either cc.redpen.validator, cc.redpen.validator.sentence, or cc.redpen.validator.section package.

Here is a very simple Validator implementation which checks sentence length:

package cc.redpen.validator.sentence;

import cc.redpen.model.Sentence;
import cc.redpen.validator.Validator;

public class NumberOfCharacterValidator extends Validator {
  private final int MIN_LENGTH = 100;
  private final int MAX_LENGTH = 1000;

  @Override
  public void validate(Sentence sentence) {
    if (sentence.getContent().length() < MIN_LENGTH) {
      // formerly addValidationError(message,sentence)
      addError("Sentence is shorter than "
        + MIN_LENGTH + " characters long.", sentence);
    }
    if (sentence.getContent().length() > MAX_LENGTH) {
      addError("Sentence is longer than " + MAX_LENGTH
        + " characters long.", sentence);
    }
  }
}

Please also consult these existing validators for your reference.

Enable your custom Validator

Once you package the validator class in jar file format and put it in $REDPEN_HOME/lib, it’s ready to validate.
Make sure that you specify a validator section in your redpen-conf.xml as follows:

<redpen-conf lang="en">
  <validators>
    <validator name="NumberOfCharacter" />
  </validators>
</redpen-conf>

In name attribute, specify the class name without “Validator”. For instance, name=”NumberOfCharacter” loads NumberOfCharacterValidator. There is no need to specify the package name.

Write your own RedPen Validator in Java

RedPen v.1.3 Released

We released RedPen version 1.3 a few days ago. RedPen 1.3 is available from the releases page. If you are using a Mac, you can install the latest version via Homebrew ($brew install redpen).

This article introduces the three features added to RedPen version 1.3 (AsciiDoc support, server commands, and Javascript-based extensions).

Support for AsciiDoc format

Ever since RedPen was released, I have received requests asking for AsciiDoc support.
AsciiDoc is a markup language just like Markdown, LaTeX, and others. I was satisfied with Markdown at the time of RedPen’s initial release, so I did not fully understand why there was so much demand for a markup language with similar syntax to Markdown. Below, I will touch on the benefits of AsciiDoc.

Are there shortcomings in Markdown?

Recently, I worked on a job where I was writing a long and formal document. At first, I tried to write the document in Markdown, but it just did not go as I had intended. It was then that I started pulling my hair out over the fact that the Markdown format’s expressive capacity was lacking. For example, links to clauses and segmenting of source files are necessary when writing documents with a certain degree of content, yet such features are not (formally) supported in Markdown.

The benefits of AsciiDoc

While AsciiDoc has a light syntax similar to Markdown, it is rich in features. I feel like it has all the necessary features in place to write formal and content-rich documents. Using a tool called AsciiDoctor also makes it easy to convert documents to PDF and change document styles to influence the presentation of the document. For those of you who need to write formal documents, it might be worth your while to consider using AsciiDoc once.

Server Inclusion

The RedPen server was introduced from version 1.0, and since has been extended from time to time. In version 1.2, the REST API was strengthened, providing functionality closer to that of the command line.

The advantage of using RedPen via the REST API is the high-speed response. After executing a redpen command, it takes time for the server to boot up. On the other hand, constantly running servers can return the results of an inspection immediately.

I received requests asking to be able to shutdown the server from the command line along with the feature improvements. To date, another file called redpen.war was required to be downloaded, but in version 1.3, all of the server files (redpen.war) are included in the package (redpen-1.3.tar.gz)

I also provided a simple redpen-server command in RedPen version 1.3. To start the server, just run the following command:

$redpen-server start

Similarly, shutting down the server can be done simply as follows:

$redpen-server start

Unfortunately, however, the command is not supported on Windows. I plan to add Windows support on the next version upgrade.

Extensions up to 1.2

One of RedPen’s distinguishing features is its extensibility. RedPen does all of the cumbersome processing automatically, such as extracting sentences and clauses from markup languages. As such, when a user wants to extend features on their own, all they have to do is write the processing to be done on a sentence. Plug-in features are also supported, so users do not have to build RedPen in its entirety.

However, what I discovered when developing was that, unfortunately, the current-state of plug-ins leaves something lacking. What I mean is that, to create a Java-based plug-in, only people with a certain degree of familiarity with Java would be capable of creating such a plug-in.
It also needs to be compiled and it cannot be easily tried…

Support for Javascript extensions.

Therefore, Javascript-based extensions are supported starting from version 1.3. With extensions based in Javascript, users have no need to compile. Just put the Javascript file for the extension(s) in $REDPEN_HOME/js. I plan to introduce more detailed instructions separately.

RedPen v.1.3 Released

REST API for Document Validation

RedPen is an open source command line tool for proofreading documents, but RedPen also provides a server. This article introduces the functions of the RedPen server.

The RedPen server provides not only a Web UI but also a REST API, which enables users to check their documents without installing RedPen on their computers.

One of the features of the RedPen server API is its configurability. Users can validate their documents according to their configuration settings. In addition, the server can be deployed with few clicks in Heroku.

RedPen Server Web UI

A RedPen server is available at the following URL:

http://redpen.herokuapp.com/

The below image shows the current RedPen Web UI page.

redpen-ui

When a user visits the RedPen server at the URL above, the top left box is automatically preloaded with sample text that contains many mistakes. RedPen shows the errors in the top left box as red bars. The bottom left window provides detailed error information.

When users paste their documents in the box, any validation errors are displayed in the left bottom box.

We can configure the settings in the right box.
Specifically we can configure validation items and character (symbol) settings. For detailed configuration information, please see the RedPen configuration document page.

RedPen Server API

The RedPen server provides a REST API that enables users to apply RedPen validation without installing RedPen.

Currently the RedPen server API provides three types of validation.

  • /rest/config/redpens

This function returns validation errors using preconfigured redpens.

  • /document/validate

This function validates a document with the user’s configuration and then returns the errors.

  • /document/validate/json

This function is similar to the /document/validate function, but the configuration is written in JSON format.

Sample: REST API (/doument/validate)

The /document/validate function has several parameters.

  • document contains the text of the document RedPen should validate
  • documentParser specifies the input document format. Valid options are PLAIN,
    MARKDOWN, and WIKI.
  • lang specifies the language used to tokenize the document. Currently, values of ja (Japanese) and en (English/Whitespace) are supported.
  • format determines the format for the results. This can be either json (the default), json2, plain, plain2 or xml.
  • config contains the contents of a RedPen XML configuration file.

Now let us try the REST function with both configuration and text. The following is a sample RedPen configuration file (redpen-conf-en.xml) bundled with the RedPen package.

<redpen-conf lang="en">
  <validators>
    <validator name="SentenceLength">
      <property name="max_len" value="100"/>
    </validator>
    <validator name="InvalidSymbol"/>
    <validator name="SymbolWithSpace"/>
    <validator name="SectionLength">
      <property name="max_char_num" value="2000"/>
    </validator>
    <validator name="ParagraphNumber"/>
    <validator name="Spelling"/>
    <validator name="Contraction" />
    <validator name="DoubledWord" />
    <validator name="SuccessiveWord" />
    <validator name="EndOfSentence" />
    <validator name="SpaceBeginningOfSentence" />
  </validators>
</redpen-conf>

Next we validate a short input sentence with the RedPen server. The following command sends the document and configuration.

curl --data document="Twas brillig and the slithy toves did gyre and gimble in the wabe"
  --data lang=en --data format=PLAIN2 \
  --data config="`cat redpen-conf-en.xml`" \
  redpen.herokuapp.com/rest/document/validate/
  Line: 1, Offset: 0
    Sentence: Twas brillig and the slithy toves did gyre and gimble in the wabe
      Spelling: Found possibly misspelled word "brillig".
      Spelling: Found possibly misspelled word "slithy".
      Spelling: Found possibly misspelled word "toves".
      Spelling: Found possibly misspelled word "gyre".
      Spelling: Found possibly misspelled word "gimble".
      "and".querying the input

As we see, the RedPen server returns several errors in the input sentence.

Deploying your RedPen Server with the Heroku Button

In the previous sample, I was using a server already deployed in Heroku, but this server is not powerful enough if many users send their validation requests.

If you need short response time, of course you can deploy the RedPen server in your own environment, but this could be tiresome.

For users that would like to deploy their own server easily, RedPen provides a Heroku Button. Users can deploy the RedPen server with just a few clicks.

The Heroku Button can be located in the README of the RedPen source.

heroku-button

When we click the button, then the following page is shown.

heroku-deploy

When the user clicks the Deploy for free button, the RedPen server is deployed in a few minutes.

REST API for Document Validation