What is Metadata Embedding?
Extraction of metadata from binary files is a critical task for enterprise content and digital asset management systems. The information contained in those files can aid in searching, workflows, and user interface visualizations.
Alfresco does a fantastic job of handling metadata extraction through it’s concept of MetadataExtracters registering themselves in the MetadataExtracterRegistry, and the use of the Apache Tika project to power many of those extractors enables a huge number of file formats and metadata standards to be supported.
We ingest a binary file, metadata is extracted and mapped to Alfresco data model properties, and we can view and edit those properties in an interface like Alfresco Share.
In some cases it’s important to get those property changes or other required fields back into the binary file as metadata. You might, for example, want to set the author metadata in a document or set copyright info in images before sending them outside of your organization.
In 4.2.c we introduced the concept of metadata embedders, which are essentially the inverse of
MetadataExtracters, and are responsible for writing properties into content.
How Does it Work?
MetadataEmbedder interface has just two methods,
Rather than create an entirely separate registry for embedders, the
MetadataExtracterRegistry was extended with a
getEmbedder(String sourceMimetype) method. Note that currently only embedders which are also extractors can be registered, but in the future support may be added for explicitly registering embedders. You’d usually implement both in the same class anyway. Speaking of…
AbstractMappingMetadataExtracter now implements the
MetadataEmbedder interface and contains:
supportedEmbedMimetypescollection that’s used in the
embedMappingthat defines the mapping from Alfresco properties to metadata fields
embedInternalmethod to be overridden by extending classes
Again, just the reverse of the extraction pattern.
For classes extending
AbstractMappingMetadataExtracter, the embed mapping can be defined in a properties file in the same location as the extract mapping properties but with an embed suffix, i.e.
classpath:/x/y/z/MyExtracter.embed.properties (note that the preferred location for mapping files for extractors and embedders has changed after 4.2.c, see ALF-17891). If no embed properties are found a reverse mapping of the extract mapping is used by default, cool right?
What About Tika?
“But that’s still sooooo… abstract. How are we going to leverage Tika? It doesn’t support embedding, does it?”
Well as a matter of fact it does, as of version 1.3 (TIKA-775).
The same notion of writing metadata into a binary has been outlined with an interface and basic implementation in Tika, so of course our
TikaPoweredMetadataExtracter builds on that and overrides the
embedInternal method defined in its parent
AbstractMappingMetadataExtracter to convert Alfresco properties to Tika metadata fields and passes that on to a Tika
embed method, which then passes back the new binary with the metadata embedded.
How Can we Use Embedding?
Our shiny new Alfresco metadata embedder’s embed method isn’t very useful if we don’t have an easy way to call it, so we’ve added a
ContentMetadataEmbedder action executor which shows up as a standard ‘Embed properties as metadata in content’ action that can be used in a rule on a folder or executed in a workflow. (After 4.2.c you can find this in
So what kinds of files and metadata does Tika have embed support for? Truth be told, not many at the moment, but the tika-exiftool project does!
The Media Management module contains an example which brings all of this together with an extension of
TikaPoweredMetadataExtracter that uses the Tika
Embedder defined in the tika-exiftool project to enable IPTC embedding in image files.
We can add an embed rule to a folder that fires on content update such that when we edit our caption field through Share, the new value is embedded in the file and can be seen using standard image metadata tools, like Photoshop’s file info.
Sit down and stop clapping, everyone is staring at you. Aw, who cares, go ahead.
We’ll be adding embed support for more file and metadata types to Tika and Alfresco in the future including, of course, documents, but in the meantime, what other formats are you anxious to start embedding?