Overview
Overview
Crawler is designed along principles that are similar to the ones found in the [Data Flow Programming] paradigm.
Personality
One of the basic components in a Crawler-based system is called the 'Personality'.
A Personality is a Crawler component which will take input data in some shape or form, and will process it into output data in some other form.
A few examples:
- InDesign-to-XHTML/CSS: takes in InDesign documents or books and outputs XHTML/CSS files.
- InDesign-to-EPUB: takes in InDesign documents or books, outputs EPUB.
- InDesign-to-Database input: takes in InDesign document or books, and updates a database with information extracted from the document(s).
Personalities are constructed out of simpler elements.
A personality is composed of:
- a workflow network of interconnected processing units called 'Adapters'
- a set of configuration files
- a set of template files
Adapters
The input document(s) are pushed through the network of adapters provided by the personality.
These adapters then process the document, and take it apart into ever smaller chunks of data, or collate smaller chunks back into larger chunks. These chunks of data are called 'Granules'.
Some adapters take in larger granules and split them up into smaller granules (e.g. they might take in a paragraph granule and split it into individual word granules).
Some adapters collate smaller granules back into larger granules (e.g. they might take a number of word granules and concatenate them back into a paragraph granules). Other adapters perform some kind of processing on the granules they receive; they might change them in some way, discard them, create new granules based on previous granules...
Some adapters construct new granules based on template snippet. For example, some adapter could take in some raw text, and combine this raw text with a template snippet into some XML formatted granule.
The general idea is that the input data is broken apart into smaller entities, and then these smaller entities are put back together again a different shape.
Template Files
Template files in Crawler come in many types. One of the most common types is the 'snippet template'. These are template files that generate a snippet of text.
The template file is simply a text file with a .snippet file name extension.
Inside the template file, there is a mix of boilerplate text and placeholders. An example: there could be a template maindoc.xhtml.snippet which could contain
<html>
$HEAD$
$BODY$
</html>