Examining Data Loss in Digitization Through Early Data Databases

The United States Army 6th Cavalry carded medical records were created in the pre-Digital Era; computers, digitization, and online transcription were not options available to the federal government. They could only transfer the data from one paper organizational structure to another. Digitization is the transfer of data to a digital organizational structure. Digitization and the process of creating the carded medical records share the fundamental characteristics of data translation. Both processes involve a change of medium, recontextualization of data, and application of the data to a new purpose. The success of digitization and other data transferal projects is also largely dependent upon the quality and thoroughness of the original dataset. If the process of compiling the original dataset resulted in the exclusion of some crucial data, then all subsequent projects involving that data will lack context and information. For these reasons, the kinds of information that are lost through digitization can be explored by examining what was lost in the process of creating the United States Army 6th Cavalry’s carded medical records.

There are many different reasons why any amount or kind of information may be left out of a dataset through the digitization process. Access to resources and concerns about the utility of certain details to a specialized project can both affect what gets included or excluded from a transcribed dataset.

The archival data explored is a partial collection of United States Army 6th Cavalry carded medical records. According to The U.S. National Archives and Records Administration, the U.S. War Department’s Record and Pension Office created carded military medical records to aid in “the verification and approval process for pension applications.”^[1] The information was collected verbatim from military hospital registers, reports, and other sources transferred from the Office of the Surgeon General and then transcribed onto the cards.^[2] This dataset includes medical records for soldiers with last names beginning with “McCo'' to “McCr.” The cards are made of thick, rectangular paper with a vertically-oriented template printed on each one. The template starts with three blank spaces for the first two letters of the patient's name, their cavalry, and the country of the hospital they were admitted to. Next comes the patient’s name, rank, company, and regiment. Below is their diagnosis and admittance date to the hospital. There is space for the hospital name, with stamps used to fill it out more often than not. There are several routes of discharge from the hospital that the person recording this information would choose from based on the patient’s situation. The template also includes the first two numbers of the century (“18”) printed wherever the year is designated with room to fill out the rest by hand. There is also a lined space for any additional marks to be written down, though it was usually left blank. The medical records end with the hospital’s number, register number, the card’s number in the set, and the signature of the copyist that transcribed it. The back of each card is stamped with a serial number and a date that likely denotes the card’s creation.

A lack of access to resources like complete records, specialized knowledge, and material support can affect what information is excluded from a dataset. This is made clear by the historical context of the carded medical records and the original hospital records they were transcribed from. Details like a soldier’s age, height, weight, and even the treatments administered to him and their effects are almost completely left out of the original hospital records. As a result, that information was also excluded from the carded records and is now unavailable. This omission leads one to believe that those details were not considered important to the hospital staff creating the initial records. The omission of such information is likely due to the underdeveloped state of hospital administrative operations before the late 19th and early 20th centuries. In places like New York and Massachusetts, medical records were becoming more detailed by the late 1800s and early 1900s^[3], but the cards in this dataset were mostly compiled from military hospitals in more rural states and American territories like Idaho, Utah, Arizona, and Texas. With many of the entries being from the Civil War era and immediately after, hospital staff wouldn’t have had the resources or knowledge needed to create and update more detailed records or to understand the future utility of the information they left out of their records. Even the latest entries by and large are from the 1880s when innovations in medical records were still not widely implemented.^[4] Because that information was excluded from the beginning, the Pension Office could not have included it even if they had deemed it relevant to their work. Omitted crucial information is not exclusive to past efforts of data collection and organization. The process of digitization requires one to make decisions regarding the relevance and utility of the information held in a data set. Digitizing all of the information in a dataset can be a good idea if historical preservation is the goal of the project. If not though, including unnecessary data can be costly and inefficient.

Compared to the U.S. Census Bureau (the only federal agency primarily dedicated to the collection of data), most federal agencies collect data in service of their primary functions and sometimes fail to preserve data that other agencies and organizations could find useful.^[5] The medical treatment information could have been a useful primary source for researchers studying the differences in the diagnosis of illnesses, medicinal techniques, and the administration of specific treatments across regions. But the original records were not created with those possible future uses in mind. Given the lack of information recorded regarding the actual health of their patients, it is likely that hospitals started compiling patient medical records in the first place to assist with their basic administrative duties. Future or present scientific research as well as the needs of other government agencies would have been a distant concern compared to addressing present data organization issues within the American hospital system. The loss of detailed physical and treatment information will affect future digitization plans by limiting the scope of useful information and context provided to the public.

Another example of this type of omission can be found in New York state marriage licenses from the early 1900s. As seen in Franklin and Eleanor Roosevelt’s marriage license, the template only included a space for the groom’s employment information and not the bride’s. The missing category suggests that even if a woman was working before she got married, that work was not relevant to her marriage in the eyes of the state. While that information would have cost no more to obtain than the groom’s employment status, it was likely excluded from the license template because it was deemed to be of low utility. The preservation of historical data on women in the workforce could have provided additional context to the established employment patterns of that era as well as information about the intersection between women’s work at home and in public. In this case, digitizing the set of NY marriage licenses would not result in much loss. Unlike the carded medical records, the original licenses are still available. As long as they stay that way, digitization will preserve the original intention and historical context behind the dataset. In contrasting these two datasets, the importance of preserving original records is emphasized.

Whether it be because of the limitations of a developing medical system or sexist attitudes toward women, crucial information can be lost or obscured by institutional shortcomings and the biases of those creating the datasets and collecting information for them. But because that information is left out, the broader patterns of inclusion and exclusion of certain kinds of data are illuminated. When examined through the digitization process, which requires recontextualizing the dataset into a modern framework and refitting it to a new purpose, the modern viewer is given a holistic perspective on the data in question.

When thinking about the digitization of these records, it's important to note that the cards from the referenced data set are only one portion of the full collection. The data set is a part of the National Archives Record Group 94: Records of the Adjutant General’s Office, 1762-1984.^[6] It is the only part of the set available online; the rest of the carded medical records are contained within 896 letter archives boxes^[7] and have not been added to the online database. One would need to contact or visit the National Archives in D.C. to find the medical records for soldiers with last names not included in the “McCo” to “McCr” set in order to complete the digitization.

Before the digitization process can begin in full force, a team of workers and resources must be allocated to the project. One would then need to sort through the cards for any potential duplicates, entries too damaged to read, and misfiled records. The medical records would also need to be sorted by last name in accordance with the Pension Office’s system if they aren’t in order already. The cards should then be digitally scanned in preparation for transcription. In order to preserve as much of the dataset’s integrity as possible, the digitized project should follow the same template as the cards and the transcription should include both the exact wording of the cards and notes that explain the more confusing textual elements for layman viewers. Thus, the project would need to take the form of a clear scan of each card with an accompanying transcription of the text. The scanned records, transcripts, and their metadata (information about the data) may then be uploaded to the National Archives online catalog for public use. There should be a search function included to make the dataset as accessible as possible. There are about 71 cards in the portion of the data set available online, but they still need to be transcribed. Given the relatively small amount of information found on each card and the approximate size of the complete dataset, the whole process of digitization should take about a month. If the work is done according to a typical Monday to Friday 9 a.m. to 5 p.m. schedule with a modest number of experienced employees, then 4 weeks (160 hours) should be sufficient. More time could be allotted to account for bureaucratic lag and technological issues. In the interest of providing a holistic view of the dataset, the finished product should also include some of the contextualizing information discussed above.

The issue of data loss is inherent to digitization. It is also representative of fundamental questions about data transcription and translation across mediums. There are always potential losses through translation––be it original intention, historical context, or real data. The conflict here lies in how one can reconcile the loss of data with the convenience and accessibility of digitization. Digitization can cause a loss of original context and intention. But it also creates a new frame of reference through which viewers gain insight into social and institutional effects on methods of organization, classification, and knowledge building. This project should preserve the integrity of the original archival dataset while also embracing the transformational process of digitization and the new perspectives it brings.

[1] “Carded Medical Records for Soldiers in the U.S. Army, 1821–1912.” National Archives and Records Administration, National Archives and Records Administration, https://www.archives.gov/research/military/army/carded-medical-records.

[2] See footnote 1.

[3] Gillum, Richard F. “From Papyrus to the Electronic Tablet: A Brief History of the Clinical Medical Record with Lessons for the Digital Age.” The American Journal of Medicine 126, no. 10 (2013): 854. https://doi.org/10.1016/j.amjmed.2013.03.024.

[4] See footnote 3.

[5] MILLER, Arthur R. Essay. In The Assault on Privacy: Computers, Data Banks and Dossiers, 55. New York: New American Library, 1971.

[6] See footnote 1 for citations for footnotes 6 and 7.

[7] See footnote 6.