Link Rot in Your GEDCOMs and Your Databases

Link rot, broken links, link death, link decay, link breaking, reference rot, there are various names for it, but it all comes down to this: There is a decent chance that at some point in the future, a page on a website you as a genealogist use as a source or as a jumping-off point may change, may stop working, may disappear completely, may go behind a pay wall, or maybe if you are lucky, only parts of a page stop working such as images disappearing or “decaying” (hence “rot”). If you don’t have a copy of that page, information on that page could be lost forever.

This lost information could be hosted on something as simple as a small family blog run by a distant cousin with some photos and personal stories that exist nowhere else, or even worse, a massive genealogy-focused website/service such as RootsWeb that is now a shadow of itself (R.I.P.), or a social media network such as MySpace (R.I.P.) or an entire web hosting service (GeoCities, we hardly knew ya, but you were the launching pad for many a family history website). I have come across pages and websites created by people who are no longer with us, and the relatives that I could contact about the sites weren’t sure where that information came from. And if the relatives don’t pay for the domain name, don’t do what is needed to keep it running, the site will disappear (but maybe not forever, depending on if it was archived).

Why am I writing about this?

The Internet Archive (aka Archive.org) was offline for quite a few days, and it caused some problems for me. I recently broke through a brick wall and so I’m reevaluating some things and am in the process of a “genealogy do-over” (thanks Thomas MacEntee). I’m rebuilding some family trees with a stricter mindset about certain things, and I’m going through old notes and databases/GEDCOMs that stretch back to the early 2000s, and I’m trying to pull up dead or broken links in Archive.org.

For more about the Internet Archive and its 916 billion saved pages that date back to the 1990s, see this article at Ars Technica. Link rot was also recently referenced on the genealogy reddit as well (link), but Archive.org being down has been driving me crazy – you don’t realize how much you use some internet tools until they are down! I filed a lot of various links to websites in my family trees/GEDCOMs or notes. I originally started with Yahoo Notes and then migrated to Evernote, and then some have been migrated to Apple Notes, so I have notes that stretch back over 20 years and multiple note-taking apps/services.

What does this have to do with genealogy software?

A lot actually – many of us are constantly adding URLs (Uniform Resource Locator, aka website names/link that start with “http://” or “https://”) to the databases in our genealogy programs or to our note-taking software. It’s not uncommon for us to not just include a text citation these days, but a link to the original source. Or maybe we are sent a GEDCOM file or pull an older GEDCOM file up in our software and see reference URLs that may contain information that never made it into the GEDCOM file or the genealogy program database (which is what happened to me as well).

We copy and paste those URLs to our web browser and hit “ENTER” (or “RETURN”), and maybe we see the page that the original author of the GEDCOM file saw (or the page that we saw when we added the URL to our database), but there’s also a decent chance we see a “server not found” message or we just get a blank screen or Ye Olde 404 “Not Found” message indicating that the particular page in question no longer exists on the server at that location. If we are lucky and we get a 404, we may be able to search the site and find the original record. The worst-case scenario is that we discover the domain name is now a part of a spam site (or repurposed for other similar endeavors that don’t need to be named here).

How widespread is this?

It’s very widespread. Many of us have seen it ourselves. ahrefs is a search engine optimization/marketing service that deals with these issues daily and not too long ago, they published a study they conducted (link):

Since January 2013, 66.5% of the links pointing to the 2,062,173 websites we sampled have rotted. We found another 6.45% with temporary errors. We don’t know if they’re still there or not.

And it’s not just genealogists that have an issue with it as the ahrefs blog mentions:

Often, the links that no longer work are important. Check out this example of a website that was referenced in a U.S. Supreme Court case. Someone bought the domain and used it to make a statement.

And Pew Research (which many of you may have heard of) published an even more current article (link) with some shocking statistics of their own:

A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

……Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

The scary number to me is the 8% that have broken since 2023. Those numbers above are not surprising to me at all, but it’s stlil jarring to see. Granted, I have far more links to RootsWeb, GeoCities and a few other genealogy-related services/hosts in my notes/databases than many so I feel it a bit more than others, but it’s a problem no matter who you are.

A side note about the WWW tag

WWW has been a part of the GEDCOM specifications since GEDCOM 5.5.1, which was at first a draft spec in 1999. Even though it wasn’t finalized until 2019, it was the default industry standard for many programs (and EMAIL was also a tag added in 5.5.1). I’m not taking a dig at GEDCOM standards either – we may all have our problems with this or that aspect of the standards, but many of us need to store URLs and email addresses with our genealogy data, so it’s going to be in there regardless of whether we use GEDCOMs or an SQLite database in RootsMagic or Family Tree Maker.

I came across this in some FamilySearch GEDCOM v7 documentation (link) about the WWW tag when importing and exporting data:

If an invalid or no longer existing web address is present upon import, it should be preserved as-is on export.

That’s extremely important for reasons I’ll discuss below, but if you have the original link, even if it’s broken, there is a chance you can get an archived copy of the page from Archive.org (or other services).

Back to the regularly scheduled show! What can lead to link rot?

I like the term “link rot” because it conjures up images of a decaying website, and it’s actually closer to the truth than you might think. When you look at a web page, you may think that all of the images and text and code on that page is contained on a single server and everything is right there on that page, but that’s not always the case, especially on large servers.

With larger servers and services (think Ancestry.com, FamilySearch, MyHeritage, etc.), the text you see on the screen resides in a database that could be scattered across multiple servers (and most likely is), and the images may reside on another server. The code that generates what you see on the screen – say a census record in a record/spreadsheet format, with text of the census, images of the census, and links to individuals or families, could be the only thing on the server that you directly connect to. The code generates the “look” of the page you are on and links to the text and images on other servers and displays those for you. All of these resources are divided amongst multiple servers for redundancy, backup, and speed reasons.

If hundreds or thousands of people are hitting a single page on a website, and everything on that page is loaded from a single server, it could get slow fairly fast depending on the content and how much is happening on that page. With larger services, not only are all of the things you see on a single page divided among various servers, but they may be duplicated between servers located in different geographic regions. A visitor in the US may see the exact same page on MyHeritage that a visitor in Scotland sees, but the person in the US is being sent images from a sever in the US while the Scottish visitor is being sent images from a server in the UK. This is done to speed things up for the visitors while also saving money for the company hosting the information. Spreading all of these types of data/content out among many servers is also a recipe for the problems of link rot.

Why and how does it happen?

Link rot happens for a lot of reasons:

  • Something happens to the person maintaining the website or page, and over time if nobody maintains it, and changes to the underlying website occur, things can break .
  • Somebody deletes it for whatever reason. Maybe they think it means nothing or don’t think anybody else cares about it (especially true for small blogs). Maybe it’s a cost issue – some department in charge of a section of a website or internet service is told to cut costs by X amount, so they delete X amount of data and shut down X amount of servers, and they are going to pick the data/servers that have the least amount of users or impact.
  • Ownership changes – lots of reasons why pages or entire sites can be discarded, whether it be the domain name being sold and repurposed, or the owners changing the focus of the overall site.
  • The hosting provider goes down or changes (GeoCities, MySpace).
  • Domain names are not renewed for some reason. See #1, maybe the person who owns the domain name no longer cares or something has happened to them or they forgot to renew the domain. If you own a domain name and don’t make plans with relatives/friends, and something happens to you (or your miss the email warning that it’s expiring) it expires and somebody else can purchase it and put something else in its place.
  • Website server changes – information is moved between servers or to new servers, and maybe somebody didn’t update links properly, so the web page is looking for images or other data on a server that it can’t connect to because the network address is wrong. This happens a lot as servers are upgraded over the years.
  • Perhaps the infrastructure/software of the website is changed and along with it, how things are organized or retrieved. This also happens a lot over the years.

RootsWeb is a great example of several of the above reasons (probably all of the above reasons minus the domain name expiring), along with all of the various web pages that are (or were) hosted on it that covered everything from mailing list archives with useful information to family websites to county websites with cemetery transcriptions (that maybe didn’t make it to FindAGrave) or state or province websites.

I’m not going to rehash what happened to RootsWeb (you can see I’m still bitter about it) but if you’ve been around genealogy for several years you either have experienced problems as a result (or seen links to defunct RW pages) or heard other genealogists complaining. Another great example is past social media sites such as MySpace that some genealogists used or GeoCities, of which there were a lot of people hosting family or genealogy websites on GeoCities.

I don’t use many websites, just the main genealogy sites!

Okay, but find me a statement from any of the major genealogy services where they guarantee that their URLs are permanent and will never change. I’m not going to name names, but in recent months, we’ve seen some URL structures expanded (that’s the best word I can come up with), thereby breaking some browser extensions, or at least requiring changes to accommodate them.

We’ve also seen things like URL shorteners break or stop being used – URL shorteners are basically domain services that can change a lengthy URL with a lot of variables/words down to something that is short. This is a minor example (some genealogy services can generate extremely long URLs) but imagine adding a shortener to this:

  • https://genealogysoftware.net/tips/link-rot-in-your-gedcoms-and-your-databases/

which changes it to this:

  • https://genealogysw.com/link-rot“.

Pretty nice, right? Well, when you plugin in “https://genealogysw.com/link-rot” to your web browser, if the genealogysw.com service has changed, you will not get back to the long article on GenealogySoftware.net, you will get a 404 link, or worse, redirected to something you didn’t want to be redirected to. Also, some shorteners use randomly generated letters and numbers in place of words, so if they re-use those links, it may take you to a different site entirely.

Many people will see an extremely long URL and want to shorten it for their notes or their genealogy database and will run it through a third-party shortener (you can find plenty on Google or Bing). There is no guarantee that the shortened URL will last beyond a certain point or that the service will be around a year or 5 or 10 years later.

How do you deal with this with your genealogy program’s database?

How do you prevent link rot from affecting you in the future?

First, always assume it’s going to happen, and when you come across important or unique information, act then and there to preserve it. Because it’s on the internet, we have a tendency to assume it’ll always be there, just a few mouse clicks away. But that’s not always true.

Here’s a few examples of what you can do (and maybe it’s worth revisiting this as its own article later on):

  • Follow the guidelines when it comes to source citations, so that you can find those records or information again. Granted, this is not always possible though, since we are sometimes talking about websites with near-proprietary information compiled by one person, that might go away.
  • Try to link to primary sources/records of the larger services if these are common records. For the most part (minus RootsWeb) they are relatively stable. Thankfully they aren’t like a streaming service (YET) where content moves from one service to another every few months, so that one month you can access the 1870 census on one service, and the next month it’s no longer on that service and is instead on a second service.
  • When you come across a page on the internet that is useful to your family tree (say a hand-drawn map of a cemetery, or a list of people at a family reunion), if you can, copy the relevant information into the entries in your family tree software right then and there, and then copy the textual information into your note-taking software (you do have note-taking software, right?). Regardless of whether you can or can’t extract the data and put it in your database right away, take a screenshot and carefully name it, and at the very least, store it somewhere on your computer (preferably backed up or in the cloud). You may even want to attach that screenshot as an image to one or more records in your genealogy database. Don’t just link to that page, with the idea that you will come back someday and properly extract all useful information from that page. This happened to me and in some cases I didn’t go back to that page for a decade or more.
  • As an alternative, modern OSes and browsers will let you save to a PDF file (it may be under the “Print” feature, but you are saving) and you can save that file to your drive or cloud storage, and/or attach it to your tree (be careful of file sizes though). Also, be careful of image or website formatting problems – you may want to put the browser into some kind of “Reader” mode if that’s possible, so it’s mostly text.
  • Save an archived copy of the page. When you save it, don’t say “HTML-only” save it using “complete” or “archive”, but be careful – some images are still on the online server of the website so you may not get all images, and, yep, you guessed it, if link rot happens, you will only have the text.
  • If you are on an iPhone or iPad, and you take a screen shot of a web page, you can touch the small thumbnail after you take the screenshot, expand it, and at the top you will see “Screen” or “Full Page”. If you select “Full Page” and then “Done” it will give you the option to save the PDF to files. and then ask you where to save it. It provides a very nice PDF file, usually quite compact size-wise with images (note: the images maybe reduced in size though).
  • If the website or page in question has some important images or photos, your best bet is to download or save those photos separately when you save the page if the function exists (think of a genealogy site that gives you the download full-size image option). If they don’t give you the option or actively block you from downloading an image….that’s between you and them and whatever you find on Google.

I have dead links, what can I do?

Visit https://Archive.org and plug them into the “Wayback Machine” and if you are lucky, the Internet Archive will have a copy of that web page, either full or at least partial, and you can then save the information. There are a few other services, and I will probably write an article just on this topic in the near future.

What about GeoCities or RootsWeb?

Surprisingly, the Internet Archive does have a hefty chunk of GeoCities sites archived, but there are other sites rebuilding the GeoCities archives. As far as RootsWeb, Archive.org has a lot of it, but there are dynamically generated parts of RootsWeb that were never properly archived because of how they were generated (particularly forums and mailing lists). We will just have to hope that someday somebody can recover that data.

How can you prevent this on your own site?

If it’s not your server and your website, you can’t, but if it is, there’s some things you can do.

Without going too deep into the weeds, if it’s your family tree server/website, you can be mindful of how the site is structured and what the organizational categories and page slugs are. Take the link of this manifesto article for instance:

https://genealogysoftware.net/tips/link-rot-in-your-gedcoms-and-your-databases/

  • https://genealogysoftware.net = this site’s URL, also, genealogysoftware.net is the TLD – Top Level Domain.
  • /tips/ = category
  • /link-rot-in-your-gedcoms-and-your-databases/ = page slug

If you rebuild your site, and you rename “tips” to “guides”, make sure and redirect incoming visitors from “tips” to “guides”. Don’t rename your page slugs – some website software and search engines do a good enough job of figuring out where things moved if you re-organize. Oh, and have an informative 404 page, along with a nice sitemap so that visitors can easily figure things out.

I host my family trees online!

This is a unique problem. If you are concerned about link rot and are running online genealogy software (The Next Generation of Genealogy Sitebuilding, Humo-Genealogy, webTrees, etc.) it can be a lot more difficult since the links are dependent upon the data in the tree, and the names/IDs of the trees, etc. and those links can change over time. When you load that GEDCOM file in, or start entering data, links start being generated based on some information you’ve provided.

Link rot can easily happen to such sites if something like your photos become unlinked to a tree, so that when people pull up individual entries, they don’t see the photos. And “Link rot” may not be the best term here, because it’s not that the site is decaying (data going away), it’s that the links are simply changing.

You as a genealogist accessing such a site may simply want to link to the top-level site/page.

My advice in this instance, is to very carefully plan your site from the very beginning, do some trial runs for a few weeks before publicly launching it. Find a link structure that you’re happy with.

Have a root or front-end to the family history site that does not change and have the family trees in a sub-directory (www.myveryownfamilyhistory.com/smith-tree/, etc.). Have that top level domain (myveryownfamilyhistory.com) be a landing page or home page that can help people find what they need. If you change data (maybe change an ID or name of a tree or something) you could alter the organizational structure of the tree itself, but as long as people are able to hit your top-level page and they can find your data, you will be okay.

I want my site or another site preserved for many years to come!

If you want your site to live on after you are gone, or you don’t plan on hosting it forever, or you don’t mind it being archived and made available to other people from Archive.org, well when the “Wayback Machine” is available once again at https://archive.org, you can submit your website’s URL/links there, and the Internet Archive will back up your website.

Note: The Internet Archive, while it may have a robust (and automated) mechanism for backing up your website or blog, should not be your primary backup. You should have procedures in place to do so on your own, in case of things like what’s happened over the past week.

If you come across a useful site or blog, if they follow certain guidelines, they are most likely being backed up at the Internet Archive, and you can check that by simply plugging a web page link into Archive.org and seeing if it’s archived, and if so, how far back. If they are not, most likely they have either enabled certain settings to prevent the automatic backup, or they have directly contacted the Internet Archive to prevent it from backing them up. If they aren’t backed up, you can contact the owner and ask if they mind if you submit some links. Or you can just submit those links. That’s up to you.

I can speak from personal experience, if the owner doesn’t want it backed up, they have a mechanism for keeping their links out of the Internet Archive. Just try and find previous versions of Personal Ancestral File through an archived version of the FamilySearch website. You won’t. Because I’ve tried. A lot.

Related Links:

2 thoughts on “Link Rot in Your GEDCOMs and Your Databases”

  1. Advice for genealogist: when an archive provides permalinks use those instead of the URL in the addressbar of your browser. Permalinks – short for permanent links – come with the promise of the archive that the permalink is … permanent, so wont disappear/change. Archives usally use systems like Handle or ARK for these links. You might recognize ARK from FamilySearch which uses this system for their permanent identifiers.

    As an example: https://hdl.handle.net/21.12124/09BB7D2B09D14D9DA5BCC6CC25B388A6 leads to an records at the archive of The Hague/NL. This internet address “resolves” (at this moment) to https://haagsgemeentearchief.nl/archieven-mais/overzicht?mizig=782&miadt=59&miaet=54&micode=0335-01.3280&minr=20799132&miview=ldt, thats the URL you’ll see in the addressbar. But if this archive changes their system or domainname, this address will change, but https://hdl.handle.net/21.12124/09BB7D2B09D14D9DA5BCC6CC25B388A6 won’t (and will then redirect to the new URL).

Comments are closed.