About Pennsylvania Newspaper Archive

Information about Pennsylvania's participation in the National Digital Newspaper Program

Jump to:

Phase I, 2008-2010

The Penn State University Libraries was awarded a $393,650 grant from the National Endowment for the Humanities to digitize historical Pennsylvania newspapers on microfilm, under the National Digital Newspaper Program (NDNP). The two-year grant covered the digitization of 103,409 pages of Pennsylvania newspapers published between 1880 and 1922, which were entered into the Library of Congress's historical newspaper database, Chronicling America: Historic American Newspapers (http://chroniclingamerica.loc.gov/). The database is accessible to all, and readers can search content and read, download, save and print articles and advertisements from the available publications.

Librarian L. Suzanne Kellerman, the Judith O. Sieg Chair for Preservation, said that digitizing selected newspapers that currently exist on microfilm will greatly widen access to these rich historical resources. Providing good quality images from historically significant Pennsylvania titles along with keyword searching of the content will provide unknown research and scholarship opportunities to a publication type that has been under-utilized by researchers for years. Only within the last three to five years has technology allowed us to tap into published newspaper content. Providing access to the Commonwealth's rich newspaper heritage via the Chronicling America database will allow researchers, school students and everyone the opportunity to search newspaper content as never before.

Kellerman said Penn State worked with the State Library of Pennsylvania and the Free Library of Philadelphia to identify titles in their collections for the project. The search used U.S. census data to locate 12 cities in Pennsylvania with the largest populations from 1880-1922. They were Allegheny, Allentown, Altoona, Erie, Harrisburg, Johnstown, Lancaster, Philadelphia, Pittsburgh, Reading, Scranto and Wilkes-Barre. From these cities, 48 publications were reviewed for initial consideration, with the final selection made by an advisory board of researchers, scholars, librarians and historians.

The selection was based on intellectual content—research value, geographic representation and temporal coverage as evident by a long continuous run that includes the targeted time frame; and quality of microfilm. After review and evaluation, four titles were selected for digitization – theScranton Tribune, the Lancaster Daily Intelligencer, the Pittsburg Dispatch, and the Evening Public Ledger (Philadelphia).

Kellerman noted that this program was considerably different from the many digitization projects her department has completed in the past, as it requires adhering to specific Library of Congress requirements and timelines. Staff from Penn State Libraries' Digitization and Preservation Department, Cataloging and Metadata Services, and Digital Library Technologies were involved in the project.

The NDNP is a long-term effort that aims to widen access to historical U.S. newspapers by providing content in digital format through the Library of Congress. Once completed, http://www.loc.gov/chroniclingamerica/ will contain historical newspaper resources from every U.S. state and territory. For more information on the NDNP, click here.

Phase II, 2010-2012

Local history researchers can immerse themselves in the stories of workers, unions and businesses of the anthracite coal industry in the northeast, now that the Evening Herald, a Shenandoah, Pennsylvania newspaper, is available in digital format online. Eight years of the Evening Herald (1891 to 1899) have been uploaded to the online newspaper repository Chronicling America. All articles, advertisements and images can be viewed online, downloaded, saved and printed from any computer, offering researchers unparalleled access to this historic publication.

The Chronicling America website is produced by the National Digital Newspaper Program (NDNP), a partnership between the National Endowment for the Humanities (NEH) and the Library of Congress. An NEH award program funds digitization projects at the state level. The site is free for all to use, whether for school history assignments, genealogy projects or university research. Of added value to researchers is the rich level of metadata provided by the Library of Congress for every title, which makes it easier to find specific information.

Penn State's involvement in the project began in 2008, when the University was awarded a grant to digitize four newspapers from the time period 1880 to 1922—the Scranton Tribune, Pittsburg Dispatch, Lancaster Daily Intelligencer and Philadelphia’s Evening Public Ledger. An additional $393,489 was awarded under phase two, which covers the years 1836 to 1922. Project Manager for the Pennsylvania Digital Newspaper Project Karen Morrow said title selection depends on many factors, including geographical location, completeness of coverage and condition of the microfilm.

As part of the initial Phase II work, an analysis of the temporal, geographic, and historic event coverage of digitized titles revealed that over 400 Pennsylvania titles have been digitized and made accessible online at either free access or paid subscription sites by either public institutions or commercial vendors, underscoring the fact that Pennsylvania does not have a central repository for newspapers. The distribution of the digitized titles, however, is not even across the Commonwealth, which has 67 counties.

As a result, Phase II (2010-2012), title selection efforts focused on 17 Pennsylvania counties without any known digitized newspapers. In addition, several titles representing Columbia, Clearfield, and Venango counties and the city of Philadelphia were added upon the recommendation of the project’s collaborating institutions and the statewide Advisory Board. By the conclusion of the Phase II grant cycle (2010-2012) on August 31, 2012, PaDNP had digitized 45 more titles amounting to 151,968 newspaper pages representing 20 of the Commonwealth’s 67 counties.

Collaborations with partner institutions in the state—the State Library of Pennsylvania, the Free Library of Philadelphia, Bloomsburg University Library and the Pennsylvania Historical and Museum Commission facilitated the access to titles which were ultimately loaded in Chronicling America.

Phase III, 2012-2014

Penn State University Libraries' Digitization and Preservation Department received, from the National Endowment for the Humanities (NEH), a $321,526 grant for the Pennsylvania Digital Newspaper Project, Phase III (PaDNP3) 2012-2014. As part of the National Digital Newspaper Program (NDNP), PaDNP3 digitized from microfilm 109,035 pages from 39 historic newspaper titles published between 1836 and 1922 and upload the items to Chronicling America, a free online repository of digitized newspapers from across the country maintained by the Library of Congress.

The PaDNP Advisory Board approved three categories of titles for digitization as follows:

  • Titles from four counties with very little or no digitzation
  • Titles that represent the Commonwealth's German and Italian ethnic heritage
  • Title that cover World War 1 and the Spanish influenza epidemic

Of Pennsylvania’s 67 counties, just three remained uncovered by any online free or paid-subscription digitized newspaper collections. Since Pennsylvania does not have a single, centralized newspaper repository, microfilmed newspaper collections reside with a variety of public, private, and commercial entities, as do the Commonwealth’s digitized collections. In 2010, a market survey by PaDNP staff identified all the digitized titles in both free and paid-subscription online collections with Pennsylvania content. As a result, project staff found that 20 counties did not have any digitized newspapers. The PaDNP Advisory Board recommended that we use Chronicling America to fill in the gaps so that every county has at least one digitized title wherever it may reside – fee access or paid subscription.

During PaDNP, Phase II (2010-2012), project staff digitized titles from 17 of those counties. During this Phase III (2012-2014), project staff ensured that the three remaining counties - Cameron, Sullivan, and Montour - without any digitization, plus one county with very little – Butler- would scan at least one title from each of these four counties.

With Phase III grant funding, the Library of Congress made it possible for this cycle to consider non-English titles that reflect the Commonwealth’s ethnic heritage for inclusion in Chronicling America, which is now accepting German, Italian, French, and Spanish in addition to English language content. Early Colonial and subsequent pre-Civil War German immigration and the later 19th century Italian immigration added to the rich ethnic diversity of Pennsylvania’s newspaper publishing history. The PaDNP Advisory approved the digitization of six German language titles published from 1838 to 1918 in three different cities - Reading, Allentown, and Scranton. Three Italian titles from Philadelphia and one from Indiana that were published between 1914 and 1922 were also selected for digitization.

In preparation for the 100th anniversary of the beginning of World War I, our IFLA colleagues in Europe began digitizing newspaper from that era. The PaDNP Advisory Board decided to follow their lead and address the gaps on Chronicling America in Pennsylvania newspaper coverage of the war and the Spanish influenza epidemic (1914-1919).

A grant of $393,489 funded Phase II, 2010–2012, which digitized from microfilm 151,968 pages from 45 historic newspaper titles published between 1836 and 1922 representing 20 of Pennsylvania’s 67 counties. With a grant of $393,650, the earlier Phase I of the project digitized 103,409 pages from four titles published between 1880 and 1922 in four counties.

Since 1985, the Penn State University Libraries have participated in numerous external grant-supported and consortia-wide preservation activities for microfilming, preservation planning, digitization, cataloging, and access. With this most recent grant from NEH for PaDNP3, the University Libraries have received more than $2.2 million in funding the preservation and digitization of Pennsylvania newspapers over the past 21 years.

Top

Introduction to Open ONI

The Open Online Newspaper Initiative (Open ONI) provides access to information about digitized newspaper pages. To encourage a wide range of potential uses, we designed several different views of the data we provide, all of which are publicly visible. Each uses common Web protocols, and access is not restricted in any way. You do not need to apply for a special key to use them. Together they make up an extensive application programming interface (API) which you can use to explore all of our data in many ways.

Details about these interfaces are below. In case you want to dive right in, though, we use HTML link conventions to advertise the availability of these views. If you are a software developer or researcher or anyone else who might be interested in programmatic access to the data in Open ONI, we encourage you to look around the site, "view source" often, and follow where the different links take you to get started.

For more information about the open source Open ONI software please see the Open ONI/openoni GitHub site. Also, please consider subscribing to the chronam-users discussion list if you want to discuss how to use or extend the software or data from its APIs.

The API

Searching newspaper pages is possible via OpenSearch. This is advertised in a LINK header element of the site's HTML template as "Open ONI Page Search", using this OpenSearch Description document.

  • andtext: the search query
  • format: 'html' (default), or 'json', or 'atom' (optional)
  • page: for paging results (optional)

Examples:

Top

Open ONI uses links that follow a straightforward pattern. You can use this pattern to construct links into specific newspaper titles, to any of its available issues and their editions, and even to specific pages. These links can be readily bookmarked and shared on other sites.

We are committed to supporting this link pattern over time, so even if we change how the site works, we will redirect any requests to the system using this specific pattern.

The link pattern uses LCCNs, dates, issue numbers, edition numbers, and page sequence numbers.

Examples:

Top

IIIF Views

In addition to the use of JSON in OpenSearch results, there are also IIIF Presentation API and Image API JSON views available for various resources. These IIIF views are typically linked from their HTML representation using the <link> element. For example:

Top

Linked Data

Linked Data allows us to connect the information in Open ONI directly to related data on the Web explicitly. Open ONI provides several Linked Data views to make it easy to connect with other information resources and to process and analyze newspaper information with conceptual precision.

We use concepts like Title (defined in DCMI Metadata Terms) and Issue (defined in the Bibliographic Ontology) to describe newspaper titles and issues available in the data. Using these concepts, defined in existing ontologies, can help to ensure that what we mean by "title" and "issue" is consistent with the intent of other publishers of linked data.

These elements are used in RDF views of several types of pages, ranging from a list of the newspaper titles available on the site and information about each, to enumerations of all the pages that make up each issue and all of the files available for each page.

Examples:

Comparing the RDF versions of the links above with their HTML counterpart links, you might notice that the URI pattern we follow for these views is to remove the final slash, replacing it with ".rdf". We follow this pattern to comply with best practices for publishing linked data, and also to keep the URIs easy to understand and use.

For each of the HTML pages with a linked data counterpart in RDF, we provide links to those alternate views from the HTML page using the LINK header element. This can support automating the process of using the RDF data in tools like bookmarklets, plugins, and scripts, and it also helps us to advertise the availability of the additional views. In many views, such as newspaper page images, we also provide LINK elements pointing to the various available files (image, text, OCR coordinate XML) for each available page or other potentially useful information. We encourage you to explore the entire site and to look for and use these LINK elements. Just follow your nose, and view the source.

In addition to the concepts describe above, we use concepts from several other vocabularies in describing materials and also in linking to related data available on other sites. These additional vocabularies and external sites include:

We are grateful to all of these providers and we hope we can follow their lead in encouraging additional connections between data and vocabulary providers. Please be aware that how we use these vocabularies will likely change over time, as they continue to develop, and as new vocabularies are introduced.

Top

Bulk Data

In certain situations the granular access provided by the API may be somewhat constraining. For example, perhaps you are a researcher who would like to try out new indexing techniques on the millions of pages of OCR data. Or perhaps you are a service provider and anticipate needing to support a high volume of fulltext searches across the corpus, and do not want the Open ONI API as an external dependency.

To support these and other potential use cases we are beginning to provide bulk access to the underlying data sets. The initial bulk data sets include:

  • Batches: each batch of digitized content is made available via the Batches HTML, Atom and JSON views. These views provide links to where the files comprising the batch can be fetched with a web crawling tool like wget.
  • OCR Bulk Data: the complete set of OCR XML and text files that make up the newspaper collection are made available as compressed archive files. These files are listed in the OCR report, and are also made available via Atom and JSON feeds that will allow you to build automated workflows for updating your local collection.

Top

CORS and JSONP Support

To help you integrate Open ONI into your JavaScript applications, the OpenSearch and AutoSuggest JSON responses support both Cross-Origin Resource Sharing (CORS) and JSON with Padding (JSONP). CORS and JSONP allow your JavaScript applications to talk to services without the need to proxy the requests yourself.

CORS Example


curl -i 'http://chroniclingamerica.loc.gov/suggest/titles/?q=manh'

HTTP/1.1 200 OK
Date: Mon, 28 Mar 2011 19:45:34 GMT
Expires: Tue, 29 Mar 2011 19:45:37 GMT
ETag: "7d786bec2ca003d86009f8ccdfd72912"
Cache-Control: max-age=86400
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: X-Requested-With
Content-Length: 7045
Last-Modified: Mon, 28 Mar 2011 19:45:37 GMT
Content-Type: application/x-suggestions+json

[
  "manh",
    [   
      "Manhasset life. (Manhasset, N.Y.) 19??-19??",
      "Manhasset mail. (Manhasset, N.Y.) 1927-1986"
    ],
    [
      "sn97063690",
      "sn95071148"
    ],
    [
      "http://chroniclingamerica.loc.gov/lccn/sn97063690/",
      "http://chroniclingamerica.loc.gov/lccn/sn95071148/"
    ]
]

JSONP Example


curl -i 'http://chroniclingamerica.loc.gov/suggest/titles/?q=manh&callback=suggest'

HTTP/1.1 200 OK
Date: Mon, 28 Mar 2011 19:45:34 GMT
Expires: Tue, 29 Mar 2011 19:45:37 GMT
ETag: "7d786bec2ca003d86009f8ccdfd72912"
Cache-Control: max-age=86400
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: X-Requested-With
Content-Length: 7045
Last-Modified: Mon, 28 Mar 2011 19:45:37 GMT
Content-Type: application/x-suggestions+json

suggest([
  "manh",
    [   
      "Manhasset life. (Manhasset, N.Y.) 19??-19??",
      "Manhasset mail. (Manhasset, N.Y.) 1927-1986"
    ],
    [
      "sn97063690",
      "sn95071148"
    ],
    [
      "http://chroniclingamerica.loc.gov/lccn/sn97063690/",
      "http://chroniclingamerica.loc.gov/lccn/sn95071148/"
    ]
]);

CORS is arguably a more elegant solution, and is supported by most modern browsers. However JSONP might be a better option if your application needs legacy browser support.

Top