The general meaning which we can decipher from this word is extraction of metadata but we first need to understand what metadata is and its definition in terms of Content Management System.
Use of Content is to hold the Data and in typical Content Management System we attached the properties with Contents which will give us detail about the type of Data content hold and also regarding the Content itself. We call those as Metadata. It is very imperative part of any CMS.
Metadata Extraction is the process of extracting the basic properties and information related to the data stored in the content from and mapped it to the set of properties which is recognizable to the CMS. These properties are then indexed in the database for supporting search. It is very significant process as it helps in converting unstructured data in to structured content which is main objective of any CMS.
When Word Document is uploaded in alfresco it will automatically extract the title, description, author, and size all this properties and map it to the alfresco content model which will make document easily searchable based on those metadata.
Similarly for any Image there could be few extra properties which we would like to extract which could be height, width, resolution etc…
Alfresco has set of few inbuilt MatadataExtractor which do this job for Alfresco
It is evident from above example that different type of contents would require different set of properties require to be extracted hence alfresco has various Extractors.
What are MetadataExtractor Consist of?
They work by leveraging the existing OSS Java library like Apache Tika along with some external OS Processes like ImageMagik,Openoffice.org etc…
What are the steps involved in Metadata Extraction?
- It triggered on the content creation or update.
- It selects the best available Extractor from the MetadataExtractorRegistry.
- Selected Extractor pulls the Metadata from the content.
- Map extracted data with content model of Alfresco.
How does it select the decide Best MedataExtractor?
It considers various factors like CPU usage, past Performances, In Memory Process etc…
Can we add new MetadataExtractors? If Yes How?
Yes, we can add custom MedataExtractors based on our requirements we can even customize the existing Extractors Steps to achieve it are as follow
- Customisation of existing extractors
- Define new mappings – to an existing or a new content model.
- Adding new extractors
- Identify 3rd party lib that can read the binary file
- Or write your own code to do this
- Extend AbstractMappingMetadataExtracter
- Or write a Tika plugin
- Define metadata mappings
Hope this will give you clear idea about Metadata Extraction in Alfresco. Please feel free to ask your doubts.