Digitization Process
Edited by Kaarina Mikalson
New to the digitization process? Here is a basic rundown of the process, including important tips about scanning, image size and format, and file naming and management.
Step 1: Find an archival copy of your text
Perhaps you already have a manuscript, first edition, or bundle of artefacts you wish to preserve. If not, you will have to find the physical text itself. Depending on your text, this may be very easy or very difficult, very cheap or very expensive. Beware of facsimiles–considering the time and effort you will put into digitizing it, it should be the real thing.
- Check libraries, special collections and archives that may contain the text you wish to preserve. Striking up a friendship with a librarian or archivist is always a good idea. If they are reluctant to give you access to the text, explain to them how your digital project will benefit them and the rare text: digitizing a text preserves it and makes it more accessible.
- Search websites like AbeBooks, Amazon, and eBay regularly to see if someone is selling the text. Your desired artefact could be anywhere, even a shed in rural Quebec.
Step 2: Find scanning equipment
Depending on your equipment, scanning can be quick and easy or a painful, drawn-out process.
- Digital cameras are easy to find. However, it may be difficult to set up the camera and get a good image. Try to get a hold of a tripod to set up your camera. You can lay a sheet of glass or clear plastic over your text to hold it open, but this may also reflect light and tamper with your photograph.
- Scanners are the preferred equipment, and there is a huge variety to choose from. That basic flatbed scanner attached to your friend’s computer might work just fine–but it might not give you the quality you are looking for. Check local libraries to see what kind of equipment they have. High-quality scanners are incredibly expensive, but if you can get access to one your scanning process will be efficient and effective.
- Keep in mind the size of your artefact. If your text is larger than 8.5 x 11, you will want a scanner that can accommodate its size. If you choose to scan your page in sections, you will have to reassemble the scanned images using an image manipulation program.
Step 3: Scan your text (please read Step 4: Name your files before you scan, the steps will be done simultaneously)
Before you scan your images, you must decide what format and size the images will be. You should only have to scan once, so make sure that you are scanning the best possible way.
- Scan your images as tif (or tiff) files. These will be easy to manipulate in programs like Photoshop, easy to copy and backup, easy to upload to the web, and easy to convert to other formats, such as pdf.
- Scan your files in good quality, at least 300 dpi (dots per inch), though 600 dpi is preferable. These will be large files, but you can always make them smaller later. If you scan small, you can never make them larger. If you do resize your images later, make sure you keep a copy of the original 300 dpi or 600 dpi images. Even if you are not planning on working with images this size, you will want them on hand for future use or emergencies (Oops, I just deleted all my 72 dpi files! Good thing I kept those larger ones, now I can just resize those and create a whole new batch of 72 dpi files).
- Scan your files as consistently as possible, so that each page of your digital edition has the same colouring, quality, etc. It’s a good idea to write down your scanner settings before you start. If the settings get changed or your scanning takes more than one session, you can ensure you are scanning with the same settings.
Step 4: Name your files
Before you scan, carefully think through a system for naming your files. Think ahead: when you are months into your project, and have four versions of every scanned image, you do not want to go through the tedious process of renaming all your files. So don’t be hasty.
- Come up with a consistent and intuitive root for your file names. If you are scanning a novel by Mordecai Richler, it might be safe to name your files Richler_1 and so on. If you are scanning multiple novels by Richler, you will have to be more specific: Richler_Apprenticeship_1. If you are scanning multiple versions, editions, or manuscripts of the same novel, you will have to be even more specific: Richler_Apprenticeship1959_1.
- Be careful about numbers. When your files are hanging out in your folders, they will most likely be organized by name. This is also the order that they will most likely be processed through automatic processes such as resizing, OCR (optical character recognition), uploading, or ingesting into the EMiC commons. You will want them ordered consecutively, 1, 2, 3, 4, 5, 6, so that they process consecutively.
Note: If you are scanning the pages in sets of two, that is, with the book or magazine open on the scanner, you will probably be splitting your images during image manipulation and saving them by individual page. When you split the images, do not save them as .5, _5, or any variation. This will mess with the order of your files, as the computer will organize them as 1.5, 1, 2.5, 2, 3.5, 3. If you know you will be splitting your images, save them as odd numbers (1,3,5,7). When you split the files, you will be able to save those newly created pages as even numbers.
- Think about how many pages/files you will have. If you are scanning 60 pages, ensure that you have enough digits in your file name: Richler_01 rather than Richler_1. This will ensure that your file names are consistent. If you are scanning 60 images which will then be split into 120 pages, ensure that you include 3 digits in your file name: Richler_001.
Step 5: Back up your files
You now have your raw scanned files, and you don’t want to lose them. But no matter how careful you are, bad things happen to good files. Make sure you back them up.
- Copy them onto a usb key (if they fit), an external hard drive, into Dropbox (a free file hosting service that offers cloud storage and file synchronization), or anywhere you like, so long as a copy exists outside of the dangerous, vulnerable world of your computer.
- Copy your files into a new folder on your computer. This will be your working folder, full of images you can open up and mark up, resize, crop, convert. The original folder of scanned images should remain completely raw and untouched, waiting in pristine condition in case of some awful emergency. Accidentally copied and pasted a cat into Mordecai Richler’s manuscript and saved it? Thank goodness you have that raw, cat-free file to go back to. Just copy that raw file to create a brand new working file, and you can stop thinking up ways of incorporating Lolcats into your critical edition of Richler’s life and work.
Step 6: Image manipulation
You tried to scan nice, consistent, good quality images. But they aren’t perfect. Image manipulation will give you a chance to prepare them for the next stages of the process.
- Find an image manipulation program. Adobe Photoshop is a great program if you can get ahold of it (say, on your university library’s computers), or you can commit to buying it yourself–students get a major discount. Alternatively, GIMP is a free image manipulation program, download it on the GIMP website: http://www.gimp.org/
- Open your working files in the image manipulation program and decide what you want to fix. Crop off those black spaces around the text, brighten them up so you can read them, rotate them, or split them into separate pages (but make sure you name them carefully! see Step 4). If you are going to be running them through OCR (optical character recognition), which is a step in the ingesting process of the EMiC Commons, you will want to make sure the text is as straight as possible. You can ensure this by rotating them by a few degrees clockwise or counterclockwise until the text is nice and straight.
- Once you have edited the images, you may want to resize them, especially if you have scanned them at 600 dpi. Photoshop allows you to automate the resizing process, so you can run through a whole batch of images quickly rather than resizing them all individually.