For any Content Management System It is very important to manage content and its metadata. Metadata are the properties associated with any content which gives identity to particular document. It is basically a set of properties associated with content which make it more relevant. It also helps in managing contents easily as based on those metadata’s documents can be easily searched with in repository. Various rules can be created in repository which executes actions on incoming documents to do various operations like moving it to relevant space or transforming it to another content type etc…in short, metadata is one the most significant part of any CMS.
Next question which comes to our mind is from where these metadatas are produced or generated? For different type of documents already have some set of metadata associated with them some of the very common are title, description, author, mimetype etc…which are associated through the editor in which those documents are produced like MSWord and Excel .
How does Alfresco handle those basic metadata when document is uploaded in Alfresco?
Alfresco has various Metadata Extractor classes available inside alfresco which does this job of metadata extraction during which it automatically extracts metadata information from inbound and/or updated content and updates the corresponding nodes properties with the metadata values.
Definitions of the default set of extractors are in the <configRoot>/alfresco/content-services-context.xml file.
These are the extractors defined within <WEB-INF>/classes/alfresco/content-services-context.xml
<bean id="extracter.PDFBox" class="org.alfresco.repo.content.metadata.PdfBoxMetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.Office" class="org.alfresco.repo.content.metadata.OfficeMetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.Mail" class="org.alfresco.repo.content.metadata.MailMetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.Html" class="org.alfresco.repo.content.metadata.HtmlMetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.MP3" class="org.alfresco.repo.content.metadata.MP3MetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.OpenDocument" class="org.alfresco.repo.content.metadata.OpenDocumentMetadataExtracter" parent="baseMetadataExtracter" /> <bean id="extracter.OpenOffice" class="org.alfresco.repo.content.metadata.OpenOfficeMetadataExtracter" parent="baseMetadataExtracter" > <property name="connection"> <ref bean="openOfficeConnection" /> </property> </bean>
Most of the above extractor internally use Apache tika library to extract the meta-information. Once this meta data are extracted it is associated with the properties defined inside alfresco based on the mapping provided with each metadata extractor class. We can also create our own metadata extractor class for the document types which are not supported out of box. We will learn about it in upcoming articles. Keep following.