This visual guide has been created for learning and for teaching the underlying concepts of cloudooo conversion server.
It is distributed under Creative Comons Shared Alike Non Commercial.
Note: Please note that the examples sited in this document are only of academic interest and imply no relation of any kind between Cloudooo and the entities/organisations mentioned herein.
The current structure of cloudooo is partly inspired by oood, an openoffice conversion daemon authored originally by Bartek Gorny as part of ERP5 document management system. Oood has been implemented successfully by Nexedi to convert large scale document databases in an international organization. Through this experience, it has gained reliability, by allocating a pool of single threaded openoffice conversion daemons and monitoring the activity of running daemons.
The Federal Fluminense Institute (IFF) and Nexedi decided to refactor oood in order to use it not only with ERP5 but also with Plone. Another purpose of the refactoring was to use modern standards such as WSGI and to introduce a component architecture so that other conversion tools could be used beyond openofffice.org. This has lead to cloudooo, which is now the default conversion server of ERP5 and is designed to be used by community as part of any CMS/ECM. A demo site, www.cloudooo.org, was created to show the conversion capabilities of cloudooo daemon.
Cloudooo receives incoming conversion requests through XML-RPC or REST. Based on the mime type, a conversion handler is selected by the Mime Mapper and invoked. The converted result is brought back to the requester. Current handlers are based on openoffice.org, ffmpeg, imagemagick. They support many office and multimedia file formats. The openoffice.org handlers uses the monitoring facility so that a pool of headless openoffice servers can be allocated. Dead or faulty openoffice.org processes are killed automatically. Openoffice.org processes are also killed after a defined number of conversions.
Other handlers are simple wrappers for command line tools. Cloudooo manages the generation of temporary files for conversion and garbage collection of unused temporary files.
All parameters of cloudooo are configured through a simple text based configuration file.
Multiple cloudooo servers can be run in parallel and aggregated as if they were a single server by using standard load balancing tools such as ha-proxy. Cloudood as such does not handle load balancing. If load balancers such as ha-proxy are used, sophisticated load balancing policies can be implemented. For example, based on HTTP request parameters, a given group of servers can be selected.
Cloudooo defines interfaces for the service it provides as well as for components it can support. The main interface is named IManager. It allows to convert a file (convertFile), extract metadata from a file (getFileMetadataItemList), update metadata into a file (updateFileMetadata) and request the list of extensions which a file can be converted to (ex. .doc, .pdf, etc.).
Currently, the conversion interface only takes specified parameters. In the future, it should support more conversion parameters, so that it is possible to change the resolution of an image or access the nth image in a video. Such features are already supported in ERP5 through built-in conversion APIs which are meant to be moved out from ERP5 to Cloudooo.
Cloudooo defines an interface for handlers. This interface implements part of the IManager interface and delegates implementation either to command line invocations or to external web services. The interface consists of convert, getMetadata and setMetadata methods.
The notion of metadata here relates to the title, creation date and other metadata properties which are explicitely defined inside certain file formats. ODF for example defines an undefined number of user-defined metadata properties. Exif defines various metadata properties including geolocation.
Setting explicit metadata on files can be useful for file interchange purpose.
The future architecture of Cloudooo will introduce three new components: granulator, classifier and normalizer.
The purpose of a granulator is to extra sub-content of a larger content. For example, a PDF files contains paragraphs, images, tables. Some contents are explicitely defined in the PDF through typesetting instructions. It is possible to extract many structured content from a PDF file. Images may be converted to text through OCR (this is the role of the conversion Handler). Sophisticated OCRs can extract tables from the text content of an image. This is useful for example to extract the invoice lines from a scanned invoice.
The purpose of a classifier is to extract implicit metadata from a file. Implicit metadata can be the language, if not defined, the type of content (ex. poetry, marketing, etc.), the emotion displayed in a picture (happy, sad), etc.
The purpose of a normalizer is to find common language for metadata which is extracted from file content. For example, normalizing column names can help finding equivalent columns from one table to another.
Both 3 components may be used in relation with conversion Handler. A preliminary conversion to a base format may be done automatically before calling the appropriate granulator component. Also, certain ouput of granulation (ex. Images) maybe need additional conversion and resolution change. The same applies to classifier and normalizer in terms of input format.
The IGranulator interface provides APIs to extract tables from a document, to extract images from a document and to extract paragraphs from a document.
Implementation is normally based on a analysis of a base format (ex. ODT, HTML). Initial conversion to that base format may thus be required. Output of granulation is provided in a standard XML-RPC output. For images, it is provided in any image format and can then be converted by a conversion handler.
Paragraphs can be extracted from a text document. Each paragraph is identified by an ID, which possibly should also exist as an HTML anchor in the HTML conversion of the document. Each paragraph has a class, which relates either to a CSS class or to a specific item such as the “Table Of Contents” (TOC) or the “Table of Images” (TOI). Paragraphs which play the role of chapters, sections and subsections can be listed.
The IClassifier interface provides a classification of text content, image content, etc. by returning a list o key value pairs. This list defines implicit metadat which is analyzed from the file content.
A list of sample data which defines a learning sample for machine learning software may be provided as an option. This URL represents a file which can be downloaded. This file may either contain sample data or connection information to connect to a Web Service and retrieve sample data.
It is up to the classifier to implement persistence so that sample data only needs to be downloaded from time time. In between, the learning model should remain unchanged and persistent in RAM or on the filesystem.
The INormalizer interface is similar to the IClassifier interface. It uses the same way to initialize machine learning. It returns a list of normalized colum names, or list of normalized tag values. The concept of “normalized tag” refers to the idea that synonims should be unified into the sale value.