Files and formats

Do: check which is the central data system.

Before you start work on a digitization project, you should clarify which system is to be used for data storage. Is there a library catalogue, an archive database or something comparable? Should the data be enriched in this system before digitization, or should any metadata be extracted and recorded solely within the workflow?

Do: conduct pre-checks on data received from partners.

If you plan to work with a partner, you should ideally request and conduct pre-checks on a relatively large body of data before commencing the project. Actual data provide the only reliable indication of whether all the project partners have the same understanding of the delivery format. The later you receive and pre-check the data, the greater the risk that a large volume of data could be produced in the wrong format and therefore be unusable.

Do: make judicious use of the available options for extracting and recording metadata.

Your approach to extracting and recording metadata should be judicious and pragmatic. Researchers using your digital collection will appreciate being able to locate and open a particular chapter on the basis of descriptive metadata. By contrast, few users will be interested in the distinction between a title, formal title, uniform title, heading, other titles, title proper and an equivalence. Equally, most users will be happy with the author’s given name and family name, so there may well be no need to include additional name information such as titles, prefixes, numbers, nicknames, or generic names.

Do: publish your results.

In the ideal scenario, digitization results should be made publicly available. The OAI-PHM and SRU interfaces provide established standards that allow other users, harvesters and portals to make use of your data. External partners can also work with the data that you supply or enrich it using crowdsourcing platforms.

Do: store raw data whenever possible.

The raw data produced in the course of individual workflow steps should be stored whenever it is feasible and useful to do so. By way of example, the original OCR raw data format can be used at a later stage to provide further information to supplement that available from the converted data, e.g. information about the layout, fonts, languages, recognition accuracy rates, and statistics. Even the master digitized images can be used later on to extract further information.

Do: use appropriate data formats.

All data formats used in digitization projects should be chosen carefully. Wherever possible, digitized output and metadata should be stored using standardized file formats that are well-established and suitable for machine processing. In exceptional cases, if you need to use your own proprietary data format, you should ensure that your files are nevertheless machine-readable, e.g. by using an XML file structure.

Do: validate your metadata.

As a rule, all metadata should undergo an automated validation process. This should be performed as early as possible in the workflow to avoid potential problems at a later stage when working with other technical systems. Validation processes are very simple to configure on the basis of “regular expressions”.

Do: wait until your images have been cropped before performing OCR.

As a general rule, OCR should not be performed until your images have been cropped, black borders removed, the book-fold area trimmed, and the images deskewed or scaled. Otherwise there is a risk that the full text and the corresponding word coordinates (e.g. in the output ALTO files) will no longer match the modified image coordinates.

Don’t: expect your end users to be metadata experts.

Don’t assume that the end users of you digitized material will have anything more than a superficial knowledge of metadata. The contents of your online digital collections do not need to meet the stringent requirements of a library catalogue. For most users, it is more than enough to provide a link from the digitized object to the full catalog entry. This streamlined approach will save a great deal of workflow time otherwise spent manually extracting and recording metadata.

Don’t: make up new metadata standards.

Only existing and open standard formats should be used for descriptive metadata. Avoid creating your own new document types and metadata definitions. The MARC and MODS standards of the Library of Congress are generally adequate for most purposes. Under no circumstances should the end product of your digitization workflow be in a proprietary format. Wherever possible, you should work with established formats such as METS/MODS, LIDO and EAD for descriptive metadata and ALTO or TEI for full text.

Dos and Don'ts for digitisation workflows

Navigation