Assembler

From DocDataFlow
Jump to: navigation, search

An assembler is an atomic adapter.

Assemblers accept granules via their input connection.

They then use some of these input granules to construct collated granules.

Typically, assemblers will rely on the presence of certain 'trigger granules' in the input stream, to help them decide when they have all the necessary data needed to finish a constructed granule.

When the constructed granule is ready, it is released via the assembler's output connection.

Assemblers will often drop the smaller granules they used from the data flow, and only emit the newly constructed granules.

For example, an assembler could be collecting 'word granules', and string these 'word granules' together into some new 'word group' granule.

As time goes, the assembler needs to know when the 'word group' under construction is complete. The presence in the input stream of some other type granule (e.g. a 'text frame' granule or a 'paragraph' granule) will typically be the trigger to release the newly constructed 'word group' granule, and get ready to construct the next 'word group' granule.

In a typical Crawler workflow, a disassembler will normally only add to the data flow. It won't take granules away. In other words: the larger granules that are broken apart by disassemblers are not stripped away and remain part of the data flow.

Assemblers, on the other hand, do take granules away.

For example, when a disassembler breaks apart a 'paragraph' granule into a series of 'word' granules, the output of the disassembler will typically consist of a stream of word granules, followed by the original paragraph granule from which the word granules were extracted.

An assembler further down the data flow will ignore the content of such paragraph granules. Instead it will collect the word granules, and wait for the paragraph granule as a terminating trigger to signify the series of word granules is complete.

An example: consider the following data flow emitted by a disassembler further up the data flow:

Word: this
Word: is
Word: a
Word: paragraph
Para: this is a paragraph
Word: this
Word: is
Word: another
Word: paragraph
Para: this is another paragraph
TextFrame: pos (10, 20), width 20, height 80

An assembler might be set up to count the word granules, and emit a word count granule each time it is triggered by a paragraph granule.

This example assembler might convert the data flow into the following:

WordCount: 4
Para: this is a paragraph
WordCount: 5
Para: this is another paragraph
TextFrame: pos (10, 20), width 20, height 80

i.e. it has dropped the word granules, and emits a new 'word count' granule each time it 'sees' a paragraph granule pass by.