OCR Data « Pennsylvania Newspaper Archive

OCR Data

What is OCR?

Optical character recognition (OCR) is a fully automated process that converts the visual image of numbers and letters into computer-readable numbers and letters. Computer software can then search the OCR-generated text for words, phrases, numbers, or other characters. However, OCR is not 100 percent accurate, and, particularly if the original item has extraneous markings on the page, unusual text styles, or very small fonts, the searchable text OCR generates will contain errors that cannot be corrected by automated means.

Although errors in the process are unavoidable, OCR is still a powerful tool for making text-based items accessible to searching. For example, important concept words often appear more than once within an article. Therefore, if OCR misreads one instance of a key word in a passage, but correctly reads the second instance, the passage will still be found in a full-text search.

To enable research and external services, Open ONI provides bulk access to its OCR data. The table below itemizes a list of data files available for download. Each file will decompress into directory structure that lets you easily map the OCR file to the URL identifier for that page. For example a file such as sn830030214/1903/05/01/ed-1/seq-1/ocr.txt maps to the URL https://panewsarchive.psu.edu/lccn/sn830030214/1903-05-01/ed-1/seq-1/.

If you are interested in automated access to this data you may want to use the Atom and JSON versions of this table.

Filename	Batch	Created	Size	SHA-1 Checksum