Difference between revisions of "Assembler"
(12 intermediate revisions by one user not shown) | |||
Line 1: | Line 1: | ||
An assembler is an [[Atomic adapter|''atomic adapter'']]. | An assembler is an [[Atomic adapter|''atomic adapter'']]. | ||
− | Assemblers accept granules via their input. They then use these input granules to construct | + | Assemblers accept granules via their input connection. |
+ | |||
+ | They then use some of these input granules to construct collated granules. | ||
+ | |||
+ | Typically, assemblers will rely on the presence of certain 'trigger granules' in the input stream, to help them decide when they have all the necessary data needed to finish a constructed granule. | ||
+ | |||
+ | When the constructed granule is ready, it is released via the assembler's output connection. | ||
+ | |||
+ | Assemblers will often drop the smaller granules they used from the data flow, and only emit the newly constructed granules. | ||
For example, an assembler could be collecting 'word granules', and string these 'word granules' together into some new 'word group' granule. | For example, an assembler could be collecting 'word granules', and string these 'word granules' together into some new 'word group' granule. | ||
− | As time goes, the assembler needs to know when the 'word group' under construction is complete. The presence in the input stream of some other type granule (e.g. a 'text frame' granule) will typically be the trigger to release the newly constructed 'word group' granule, and get ready to construct the next 'word group' granule. | + | As time goes, the assembler needs to know when the 'word group' under construction is complete. The presence in the input stream of some other type granule (e.g. a 'text frame' granule or a 'paragraph' granule) will typically be the trigger to release the newly constructed 'word group' granule, and get ready to construct the next 'word group' granule. |
+ | |||
+ | In a typical Crawler workflow, a [[Disassembler|''disassembler'']] will normally only add to the data flow. It won't take granules away. In other words: the larger granules that are broken apart by [[Disassembler|''disassemblers'']] are not stripped away and remain part of the data flow. | ||
+ | |||
+ | Assemblers, on the other hand, do take granules away. | ||
+ | |||
+ | For example, when a [[Disassembler|''disassembler'']] breaks apart a 'paragraph' granule into a series of 'word' granules, the output of the disassembler will typically consist of a stream of word granules, followed by the original paragraph granule from which the word granules were extracted. | ||
+ | |||
+ | An assembler further down the data flow will ignore the content of such paragraph granules. Instead it will collect the word granules, and wait for the paragraph granule as a terminating trigger to signify the series of word granules is complete. | ||
+ | |||
+ | An example: consider the following data flow emitted by a [[Disassembler|''disassembler'']] further up the data flow: | ||
+ | |||
+ | <pre> | ||
+ | Word: this | ||
+ | Word: is | ||
+ | Word: a | ||
+ | Word: paragraph | ||
+ | Para: this is a paragraph | ||
+ | Word: this | ||
+ | Word: is | ||
+ | Word: another | ||
+ | Word: paragraph | ||
+ | Para: this is another paragraph | ||
+ | TextFrame: pos (10, 20), width 20, height 80 | ||
+ | </pre> | ||
+ | |||
+ | An assembler might be set up to count the word granules, and emit a word count granule each time it is triggered by a paragraph granule. | ||
+ | |||
+ | This example assembler might convert the data flow into the following: | ||
− | + | <pre> | |
+ | WordCount: 4 | ||
+ | Para: this is a paragraph | ||
+ | WordCount: 5 | ||
+ | Para: this is another paragraph | ||
+ | TextFrame: pos (10, 20), width 20, height 80 | ||
+ | </pre> | ||
− | + | i.e. it has dropped the word granules, and emits a new 'word count' granule each time it 'sees' a paragraph granule pass by. |
Latest revision as of 19:07, 29 December 2013
An assembler is an atomic adapter.
Assemblers accept granules via their input connection.
They then use some of these input granules to construct collated granules.
Typically, assemblers will rely on the presence of certain 'trigger granules' in the input stream, to help them decide when they have all the necessary data needed to finish a constructed granule.
When the constructed granule is ready, it is released via the assembler's output connection.
Assemblers will often drop the smaller granules they used from the data flow, and only emit the newly constructed granules.
For example, an assembler could be collecting 'word granules', and string these 'word granules' together into some new 'word group' granule.
As time goes, the assembler needs to know when the 'word group' under construction is complete. The presence in the input stream of some other type granule (e.g. a 'text frame' granule or a 'paragraph' granule) will typically be the trigger to release the newly constructed 'word group' granule, and get ready to construct the next 'word group' granule.
In a typical Crawler workflow, a disassembler will normally only add to the data flow. It won't take granules away. In other words: the larger granules that are broken apart by disassemblers are not stripped away and remain part of the data flow.
Assemblers, on the other hand, do take granules away.
For example, when a disassembler breaks apart a 'paragraph' granule into a series of 'word' granules, the output of the disassembler will typically consist of a stream of word granules, followed by the original paragraph granule from which the word granules were extracted.
An assembler further down the data flow will ignore the content of such paragraph granules. Instead it will collect the word granules, and wait for the paragraph granule as a terminating trigger to signify the series of word granules is complete.
An example: consider the following data flow emitted by a disassembler further up the data flow:
Word: this Word: is Word: a Word: paragraph Para: this is a paragraph Word: this Word: is Word: another Word: paragraph Para: this is another paragraph TextFrame: pos (10, 20), width 20, height 80
An assembler might be set up to count the word granules, and emit a word count granule each time it is triggered by a paragraph granule.
This example assembler might convert the data flow into the following:
WordCount: 4 Para: this is a paragraph WordCount: 5 Para: this is another paragraph TextFrame: pos (10, 20), width 20, height 80
i.e. it has dropped the word granules, and emits a new 'word count' granule each time it 'sees' a paragraph granule pass by.