Bad Data Online: Problems and Solutions
By Donn Devine, CG, CGIThere’s no question that some of the genealogical information available on the Internet is either incorrect or unreliable. But whether it’s a problem for individual researchers depends entirely on how they use it, whether they use it at all, and how they make that decision.
It remains, however, a problem for the genealogical community because the bad data continues to proliferate. Recent threads on several mailing lists suggest that there’s some genealogical equivalent to Gresham’s law of economics at work: bad data drives out good data just as debased coinage drives good money from circulation. No matter how long ago a correction for a particular error may have appeared in print or online, it never seems to catch up with the ever-widening distribution of the error.
It’s not a problem caused by digital technology, or limited to it. The same thing happened in print long before the Internet. Multi-volume series like Frederick Virkus’s Compendium of American Genealogy and John S. Wurtz’s Magna Charta were standard fare in many public libraries. They published lineages submitted by their subscribers that were as unreliable as any found online. But who would suspect an expensive reference selected by their trusted librarian?
I was taken in myself years ago when I found a line in Virkus’s Compendium , submitted by my grandfather’s cousin, that extended our common ancestry back to the seventeenth century. I spent six months trying to find out more about the people named in her undocumented list before I realized that there were no reliable sources to support a number of her claims. Some of the same misinformation had also appeared in anonymous contributions to the Boston Transcript “Genealogical Column,” which from 1906 to 1941 played a role in today’s genealogy like that of today’s Web. The columns have been published in microfor m and are indexed in the American Genealogical and Biographical Index , popularly called the Rider Index and available online. I still don’t know whether my cousin was the uncritical contributor or the gullible victim, but access to the misinformation continues to proliferate, thanks to the marvels of modern technology.
Much of the unreliable data available, both online and on CD-ROM, is in compiled pedigrees or genealogies contributed by enthusiastic but uncritical volunteers, eager to help others and proud of how many spaces they have filled on their charts. The errors often appear in many different pedigrees, carefully copied from one unreliable source to another. Correcting those errors published on disc isn’t possible, and the large number of websites propagating the same errors makes their correction unfeasible. However, we can easily identify information that isn’t soundly based, and still use it for whatever help it might offer in suggesting new directions for our own research.
Weeding Out Worthless Data
The first indicator of bad data is the lack of source information. Without knowing where it came from, we can’t be confident in either its original accuracy or the faithfulness with which it has been passed along before reaching us.
If a source is given, but we don’t know the quality (e.g., an author or pedigree contributor whose work and reputation aren’t known to us), we will also need to look further. If we find that the source is a thoroughly researched and well-documented compiled genealogy, we may be inclined to accept it uncritically, but even in this case we need to be cautious. One online pedigree cites its source as a genealogy I had written and published in a respected research journal, but the Web pedigree didn’t include corrections I made to it that were published over the following ten years in the same journal. The online pedigree continues to perpetuate the errors I made originally, even though the corrections are as widely available as my original article.
We can accept any information or data item with some confidence when 1) the source is given for it, 2) further investigation shows the source is reliable, and 3) the date of a record is close to the event, or the date of a compiled source is recent enough to include the latest corrections and critical comment. As with any genealogical finding, however, if new evidence comes along, we will have to reconsider our earlier acceptance no matter how convincing that available evidence may have been.
Causes of Data Errors
We usually find that errors in compiled genealogies, or in the lineage-linked databases produced by genealogy software programs and often shared through GEDCOM files, result from one of four causes:
1. An event was reported with errors regarding date, location, participants, or circumstances.
2. A name or event was attributed to the wrong individual.
3. A relationship or other status was erroneously reported or concluded.
4. A record was misread or misinterpreted.
To verify that none of these causes have affected a particular item of information, we must determine how the original source learned of it—by direct observation or by deduction from other knowledge. If we find the original source believable, then we must judge whether the information has been reliably transmitted through whatever derivative sources it passed (e.g., other persons, records, books, transcriptions, abstracts, copies) before it got to us. Information or data that passes these two tests is of high quality and likely to be reliable.
Quality of Purchased Data
Up to this point, we have been considering unreliable information given ever-wider distribution by well-meaning but uncritical volunteers, and how we can avoid being misled by their failings. It’s another matter when there are errors or omissions in material that has been indexed, transcribed, abstracted, or i maged by commercial publishers or database services for sale to their customers or subscribers. This has been the most serious concern in the recent mailing list threads mentioned earlier. The complaint is not with the errors themselves (everyone recognizes that they will occur in any human endeavor) but in the low priority given to quality control in the initial production of publications and databases, and in online services of correcting problems identified by subscribers.
With publications in fixed formats such as books and CD-ROMs, little can be done until a new edition is released, but often new editions appear without corrections. With online services, technology allows instant correction, but often the priority is quantity and expansion over quality and improvement—for corporations and customers. Some websites allow users to attach “sticky notes” to entries, which helps in warning of problem entries, but where an omission has been made there’s nothing to attach the warning to.
At a minimum, database services should provide organized errata pages, with numerous references and links to them. The pages would display subscribers’ notifications of errors, omissions, and corrections organized by database title and page or other location designator. The subscriber notices would remain until the appropriate correction could be made in the database itself.
Not long ago, few people would have predicted that there would be a market for the huge quantity of high-quality genealogical source material that is now available (like images of original records), or that electronic searching would make it so accessible. When we find that an indexer has misread a handwritten census entry and placed the name under the wrong initial letter, or a text-recognition program has missed a name in a newspaper image, we should return, in those individual cases, to the tedious way we used to search—by scrolling through entire enumeration districts or minor civil divisions, or by reading entire ne wspaper files covering some period of time.
Looking Ahead
For the future, we can expect genealogical garbage to continue to proliferate online, but we can also expect technology to bring even greater improvements, making it easier, faster, and more convenient to access high-quality data, including images of original documents on a scale hardly dreamed of. The benefits of digital technology, and particularly the Web, more than make up for whatever efforts we must expend to cope with the problems of bad or missing data.
Meanwhile, as users we can avoid being misled; we must be very critical of any information that isn’t attributed to a source we can consider reliable. As contributors of data to websites and digital databases, we can assure the quality of our own input by always including a reference to the source of each information item.
Finally, when confronted by a database shortcoming, we can always resurrect those useful old strategies we used in the days before technology simplified our search procedures. For example, if a name that should be in a census index doesn’t appear, try to identify a close neighbor from city directories, deeds, tax assessments, or cadastral maps that show names of occupants. Then search for the neighbor’s name in the defective census index. If found, you should be able to find your desired listing on the same or adjacent pages without having to go through all the pages for the area in which the person lived.
Those of us who remember the pre-Web days can dredge our memories for some of these useful but now seldom-used techniques. When the shortcomings of a particular database or digital publication leave us frustrated, applying some old-fashioned research practices may help us control both temper and blood pressure. We can also share these techniques with new researchers so they, too, can deal effectively with digital data problems that arise. Then, disappointment over incorrect or missing data won’t dull their enjoyment of the benefits of technology.
Donn Devine, CG SM , CGI SM , a genealogical consultant from Wilmington, Delaware, is an attorney for the city and archivist of the Catholic Diocese of Wilmington. He is a former National Genealogical Society board member, currently chairs its Standards Committee, and is a trustee of the Board for Certification of Genealogists®.
Email This Post