Custom Personality Tutorial

From DocDataFlow
Revision as of 01:08, 16 January 2014 by Kris (Talk | contribs)

Jump to: navigation, search

First, we'll build a very simple personality, and we'll gradually extend it to better demonstrate how Crawler works.

The basis of nearly all document conversion personalities is the ViewExporter adapter.

The ViewExporter is a complex adapter. Basically, it connects two 'main' sub-adapters: a disassembler (which breaks a document granule into smaller granules) and an assembler (which takes the granules coming out of the disassembler, and builds the desired end-result).

The disassembler is part of the default Crawler setup. When running Crawler, the ViewExporter will ask the currently active application to provide it with an appropriate disassembler, and it will then use that disassembler in the ViewExporter.

The disassembler gets further configuration through the configuration files.

Adjusting The Top-Level config.ini

First, we'll enhance the top-level configuration file so it knows about the new personality we're going to build.

Let's call the personality 'tutorial'.

Change the top-level config.ini (i.e. the config.ini which resides next to Export.jsxbin). Initially it it looks similar to this (I've omitted most comments for brevity).

[conditionals]

selectors = text

[main]

personalityConfig?text = "./Personalities/Text/config.ini"

# ********************************************************************************

[debug]

debugMonitoring = false
logLevel = 0

Change it so it becomes like this:

[conditionals]

selectors = tutorial

[main]

personalityConfig?tutorial = "./Personalities/Tutorial/config.ini"
personalityConfig?text = "./Personalities/Text/config.ini"

# ********************************************************************************

[debug]

debugMonitoring = true
monitorAdapters = inputSplitter

logLevel = 5
logFileName = Crawler.log

This tells Crawler that we want to select 'tutorial', and it also says that the personalityConfig entry needs to be the lower-level config.ini inside the Tutorial folder inside the Personalities folder.

We also switch on debug monitoring, and hook a Debug Monitor into the inputSplitter adapter inside the ViewExporter.

Creating A Tutorial Personality

Now that Crawler 'knows' about the new personality, the next step is to make a start building it.

Open the Personalities folder, and create a new subfolder called Tutorial. Inside that subfolder, create a text file called config.ini.

TutorialInitialPersonality.png

Put the following text in this personality-level config.ini file:

[main]

views = tutorialView

nesting = document/text.story

[main:tutorialView]

fileSplitLevel = document
xmlEncode = 0
accepted = document, text.story

[flush:tutorialView]

document = text.story

View

With this config file, we tell the ViewExporter that we only need a single view, named tutorialView.

Later on, we'll build personalities with multiple views. Views are a way to concurrently build separate, but related files.

For example, when converting to XHTML, we need to build a CSS structure as well as an XHTML structure, and keep track of how they relate to one another.

This kind of 'interrelated' file building is handled through views in Crawler.

For this first tutorial, we don't have a need to build multiple views concurrently, so we can make so with just the single tutorialView.

Disassembly Hierarchy

We'll be processing InDesign documents, which have a 'natural' hierarchy: documents contain stories, stories contain paragraphs, paragraphs contain text runs, text runs contain words.

(Remark: this is not the only hierarchy we could use in InDesign documents. An alternate hierarchy would be documents contain spreads, spreads contain text frames, text frames contain text runs, text runs contain words. This alternate hierarchy does not 'map' onto the first hierarchy: text frames do not map cleanly onto paragraph boundaries or vice versa.)

In this case, we're initially interested in getting the text, and we don't care too much about the lower level granules, so all we tell the disassembler is:

nesting = document/text.story.

This tells the disassembler: if you see a document granule, please disassemble it into text.story granules.

When presented with a document granule, the disassembler will spit out a few story granules, followed by the original document granule.

Later on, we'll tell the disassembler to dig deeper than that.

Class Identifier Shorthand

The nesting entry is a slash-separated list of class identifiers. The expression

[main]
...
nesting = document/text.story
...

is actually shorthand for:

[main]
...
nesting = com.rorohiko.granule.document/com.rorohiko.granule.text.story
...

because Crawler allows granule class identifiers to be shortened by dropping the com.rorohiko.granule. prefix.

File Split Level

The [main:tutorialView] section gives the ViewExporter some info on the desired view.

It tells the ViewExporter that we want a fileSplitLevel = document.

As with the nesting, this is shorthand for fileSplitLevel = com.rorohiko.granule.document.

This instructs the ViewExporter that we want a separate output file for each input document.

If we were to use our tutorial personality to process an InDesign book file, we'd end up with a separate output file for each of the InDesign files in the book file.

By tweaking the fileSplitLevel, we could ask for a single output file for the whole book, or we ask to get a separate output file per individual story.

Log Output

Our rudimentary personality is not complete yet, but we can already try to run it. We won't get any meaningful output just yet, but we've configured a debug monitor, and a log file, so we'll see some useful information there.

If you run the Export.jsxbin script on an InDesign document, you'll get something like this in the Crawler.log file that should appear next to the Export.jsxbin file:

Mon Dec 30 2013 18:02:18 GMT+1300: Error: OutputTextFile.prototype.dumpData_protected_: needs file
Mon Dec 30 2013 18:02:18 GMT+1300: Note : ViewSplitter 'inputSplitter' input log:
**************************
InDesignStoryGranule [Facesti quo tet, offictation reculpa ritiand icatum doles maios re ...]

InDesignStoryGranule [Pudipsunt alitionet que labo. Liquo cor rerundelento eos et apiet  ...]

InDesignDocumentGranule [TutorialTest.indd]

**************************

Mon Dec 30 2013 18:02:18 GMT+1300: Error: OutputTextFile.prototype.dumpData_protected_: needs file

We can see the OutputTextFile adapter is unhappy (Error: OutputTextFile.prototype.dumpData_protected_: needs file) because it does not know where it needs to send its output. We'll fix that soon.

The area of interest are the granules that ran through the adapter network: we can see had two story granules, followed by a document granule.

Getting the text into a file

We'll now modify the personality-level config.ini a little bit so the text gets dumped into a file.

The sub-adapters used in the ViewExporter in Crawler look at a number of predefined context variables to determine what to do. One of these many predefined variables is called FILEPATH.

The OutputTextFile adapter will query the context for a variable called FILEPATH. If this variable is defined and contains a path to a file, then it the OutputTextFile adapter will dump its output into that file.

We can set the FILEPATH variable in the 'global' app context for the application by means of the personality-level config.ini file.

Contexts are arranged in a hierarchy; the 'topmost' context is the app context which we can influence from the config.ini file.

Change the config.ini in the Tutorial directory so it becomes:

[main]

views = tutorialView

nesting = document/text.story

[appContextData]

FILEPATH = ~/Desktop/output.txt

[main:tutorialView]

fileSplitLevel = document
xmlEncode = 0
accepted = document, text.story

[flush:tutorialView]

document = text.story

The added entry FILEPATH = ... in the [appContextData] section defines a context variable FILEPATH.

This value is picked up by the OutputTextFile adapter inside the ViewExporter.

Run the Export.jsxbin again.

You should now end up with a file called output.txt on the desktop. This file will contain all the text extracted from the InDesign document.

Book Files

Now close all documents, and open just an InDesign book file.

Run the Export.jsxbin again. This time, you should end up with the collated text from all the book's documents.

You might ask how this could work given that the nesting in the config file does not mention book files - i.e. you might expect that you would have needed to change the config.ini to read:

[main]
...
nesting = book/document/text.story
  • start from a book
  • descend into the documents
  • descend into the text stories in each document

The nesting string book/document/text.story would also work for processing books, but it would not work for single documents. When processing a single document, the starting granule is a document, not a book, and the nesting string would require a book object before it starts 'digging in'.

The nesting string document/text.story works for both books and documents.

The underlying mechanism works as follows: when we run Export.jsxbin, a granule comes down the line: either it is a book or a document. When it is a document, we get a match with the top-level nesting element, and the disassembly starts.

If it is a book, there is no matching granule type in the nesting string.

The ViewExporter will then try to automatically pick a default approach to disassemble.

For books, it will disassemble the book into its component documents, and process these documents one by one.

Each of these documents will then again be presented to the ViewExporter. These granules then match the starting entry of the nesting string document/text.story and the 'normal' document disassembly will take its course.

Nesting String

Change the config.ini to look as follows:

[main]

views = tutorialView

nesting = document/text.word

[appContextData]

FILEPATH = ~/Desktop/output.txt

[main:tutorialView]

fileSplitLevel = document
xmlEncode = 0
accepted = document, text.word

[flush:tutorialView]

document = text.story

There are two changes: the nesting string now mentions text.word instead of text.story, and the accepted entry now lists text.word instead of text.story.

This setup will grab the input granule (either a document or a book), and peel it apart, layer by layer, until it reaches the individual words.

Th 'defaulting' mechanism mentioned in the previous section also occurs each time we have unmatched elements in the nesting string.

When a document is disassembled, the ViewExporter will try to match the required granule type. For this example the granule type is 'text.word'; documents don't disassemble readily into words.

So, the defaulting mechanism kicks in, and the document is disassembled into frame granules. Some of these frames are text frames.

ViewExporter will now try to try to disassemble the frame granules to the text.word granule type. For image frames, that won't work. For text frames, that will work: ViewExporter knows how to grab the contents of a text frame and disassemble it into words.

This satisfies the nesting string which called for text.word granules.

As these granules come 'down the pipe' we need to re-assemble them again into some kind of output. That's where the accepted comes in: the tutorialView accepts document and text.word granules.

Furthermore, the flushing settings

Changing the source disassembly

Currently, the source document is disassembled in story order: the disassembler goes into the document, then enumerates all the stories:

[main]
...
nesting = document/text.story