Data Portability in Paleoanthropology

I realize what I will be addressing is several months late and not the freshest anthropology news, but with the recent news of Google and Facebook joining the data portability initiative, I think this is a opportune time to have an open discussion about paleoanthropological data and data sharing. You’re probably asking yourself, “Google and Facebook have very little to do with paleanthropology, so why is am Kambiz bringing this up?”

What Google and Facebook, as well as a lot of other companies in the ‘social networking’ business are planning to do, is allow a limited layer of people’s data be shared across different services. This is to benefit users, to prevent them from having to maintain dozens of profiles, re-add the same contacts on different services. If that confuses you, imagine a hypothetical scenario like this, when you update your Myspace profile the changes are also reflected in your Facebook profile. Likewise, say you add a contact in Flickr and you also authorize them to add you to their YouTube account. Not only does this eliminate the chore it has become to maintain so many profiles, but it also expands our ability to use these services to really network and explore new information. But this isn’t all dandy and good… there’s a whole slew of privacy issues that come about with this intiative, which I believe will soon should be hashed out. In the long run, I imagine a certain dataset that a user chooses to release, will be opened to be used in other services.

A similar initiative has been brewing in the paleoanthropological field. See, in April of 2007, lots of anthropologists gathered up in New York city for a discussion on “databases, data access, and data sharing in paleoanthropology,” and a summary of what was discussed came out in the October 2007 issue of Evolutionary Anthropology. I just read the article and it got me to think about where I can go with my Hominin Database project which I started up last year, and how paleoanthropologists can use the headway made in the corporate/web 2.0 sphere to their advantage.

Before I get into it all, let me first outline the problems faced in the anthropology world. Paleanthropologists specialize in finding fossils and analyzing them. With a bit of luck and a lot more experience and professionalism in their science, they will find a primate like a hominid. After recovering the fossil from the field, they will take it to a lab and study and compare the hominid. Ultimately, they draw conclusions and produce a publication based upon their research.

But what happens after this point isn’t so cut and dry.

Depending on where in the world the fossils where found, there’s a whole slew of bureaucratic and ethical constraints that prevent the fossil from being shared physically. Most institutions and countries where fossils are found have laws and regulations that do not allow the fossil to leave the country of origin. And to complicate this, since a whole lot of time, money, sweat and sacrifice that goes into finding and analyzing hominid fossils. Often paleoanthropologists become possessive of their find and do everything they can to prevent their fossil from being seen by others. I don’t blame them. Some worry about how they will be given credit for their find and how they can control what is concluded about the fossil in the future.

Other paleoanthropologists aren’t so possessive. They see great potential in sharing their data and have turned to the web to share. As we all know, the web is a great way to put out information have it be used by people. There are already some motivated folk who have mobilized and created databases such as the RHOI Specimen Database and Primo the NYCEP Primate Morphology database. But these projects have run into a major snag that Google, Facebook, Plaxo, etc. and other social networking sites are also facing… with all these different approaches to produce databases and share data such as fossils or people’s profiles, these projects are becoming more and more specialized or divergent in evolutionary terms. Already they have grown apart and are unable to be shared across different networks. What the big tech companies and these paleoanthropological database projects are both confronting is the challenge of creating a uniform way to share a predetermined layer of data across other networks.

I can think of one but rather archaic way of doing this and that is to agree upon and use a standard database structure with uniform field names and tables. Ideally this would have been done before people created their own databases, because as databases grow into their own structures and organizations it becomes a monumental, and sometimes tricky effort to migrate data into a new structure. I’ve had the uncomfortable pleasure of banging my head against the table on far too many occasions that I’d like to admit as I upgraded various databases into new structures. Sometimes, I’ve even lost data, which is often priceless! Furthermore, people using different database software and different web developing languages it becomes a bigger mess!

I really don’t think it anyone’s business to say, “Hey, if you wanna be part of the new Paleoanthropology 2.0, then you have to make your database like this or you have to use this database/language.” I really don’t think that will make even the most enthusiastic data sharers wanna join and since people have already made and organized their data, I think it would be too much to ask for them to completely rewrite their databases and import the old data into the new structure.

Instead, I think using technology and standardizing formats such as XML and RSS; tools currently available in reading blogs through news readers such as Google Reader, we can accomplish data sharing and still allow people to maintain their own database design. How? Currently, there’s a lot of different software platforms that drive blogs… from Blogger to WordPress to Movable Type. They all have different database structures, but how they export the data into RSS feeds is all pretty much standard, uniform and most importantly an established technology. How do I know? I read my blogs and news in my news reader and I’ve used all the different blogging platforms. I know how WordPress uses dynamic PHP usually on a SQL database, how Blogger generates static HTML files, and how Movable Type functions dynamically similar to WordPress but with Perl instead of PHP. All of these platforms produce RSS feeds with a title and content layer of the data despite the computer language or database format they use. This technology can be used to import and export various forms and structures of paleoanthropological data.

Now, I don’t know exactly how the conference on databasing in paleoanthropology concluded upon sharing data. I wasn’t there. The Evolutionary Anthropology article indicates people agreed to make a standard database structure and use a portal site, Paleoanthrportal.org to display the shared datasets. Again, I don’t think that’s the best way to go about doing things. People will still make databases however the hell they want. To ask them to conform to a uniform way isn’t as easily accomplished as it seems. Rather, having a way for them to publish their data using RSS with a uniform structure is the way to go. It won’t alter the original database, rather… during the exportation process, scripts will change field names in the database to the uniform standard of the RSS file… usually in XML format. At most this will be an addon script, run periodically on the server. Furthermore, the databases are still maintained by individual projects and no one will lose control. The RSS feeds are published regularly and aggregated with other services; in this case a network of participating paleoanthropological databases that will accurately reflect the newest datasets. This is make the data be used on other sites.

I understand this may have been way above most people’s heads. And I was just brainstorming too. I admit I have no foolproof way to do this. Also, I don’t imagine most of Anthropology.net’s readers are database administrators nor web developers, but as anthropologists… as scientists, I think we should all be wondering how we can begin to share our data. The biologists are leaving us in the dust, with the massive GenBank and SwissProt databases (which by the way, sync data even though they are run by two different governments). Lastly, these problems don’t just exist in the paleoanthropological subset of anthropology. Archaeologists and linguists also face similar challenges.

We should all acknowledge that these fossils and artifacts have been locked away for a long time, in the dirt and in the sediment, and we need to do our best to share it with the rest of the world rather than curate and ignore it. So please let me know what you think about my argument for sharing paleoanthropological data using RSS, or if you have a better idea on how to go about creating sharable data please tell us!

5 thoughts on “Data Portability in Paleoanthropology

  1. I’ve been struggling with the same problems w/ archaeological datasets, and think that you might be onto something. I’m not sure about using RSS per se, but the idea of using a common interchange format (and especially an XML one like Atom or RSS) makes a lot of sense. Such solutions, in my experience, are rarely discussed—but probably, for the reasons you state, the path of least resistance.

  2. You are on the correct path– everyone can store their data in whatever database formats they want, and then make that information available to colleagues in XML.

    Unfortunately, the definitions (schema) for RSS and Atom are far too simplistic and restricted to adequately represent all of the information paleoanthropologists would desire, so a custom XML schema should be developed.

    Simple tools would also be needed to dump the contents of each specific database into the common formal XML files, but this is not a show-stopper. It could be managed as an open-source software effort.

    There are a lot of anthropologists who are self-taught programmers, and a lot of programmers who are self-taught anthropologists… I’m sure there will be people available to do this.

  3. lugal, yeah I think making some sort of XML method, published and aggregated similar to RSS is the way to go. Aside from that, are you facing similar problems with archaeological data sets? Are people willing to have their data be shared and even hosted on other databases?

  4. Hey Paul,

    Yeah, I think I didn’t clarify in the post that I’m not saying RSS is the format to do this. Rather, I think the way RSS is distributed and aggregated is the way paleoanthropological data needs to be shared. This has to be done with a new XML schema. The paleoanthropolgical community needs to sit down and agree upon a predetermined set of fields. Then people will need to just have their databases be exported and then published. Anyways, thanks for catching that!

    I hope that paleoanthropologists will think about this way of executing their data sharing because having to reformat databases is a big pain in my opinion. Or also, if the big data portability initiative in the technology sector finds a better way, which I have faith they do, that they’ll use that model as a way of sharing paleoanthropological data.

    Kambiz

  5. Since archaeological data sets are so frequently tied to national surveys, CRM, and other such projects with legal implications, there has been a lot of ink spilled discussing data interchange between projects. But the underlying problem, that each project has its own highly customized database schema, is the same in archy as it seems to be in paleoanthro. An increasing number of projects are now putting their databases online, which facilitates data dissemination, but there aren’t any good systems that I know of for broadly applicable database schemas or other methods of data sharing between projects.

    One site that’s been getting a lot of buzz lately, hosting multiple high-profile projects’ databases is Open Context (http://www.opencontext.org/)–so the answer to your second question is “yes.” But I don’t know whether anyone is using them yet for actual research.

    Also, there’s been at least one serious attempt at general XML schema for archaeology, ArchaeoML (http://ochre.lib.uchicago.edu/index_files/ArchaeoML_Schema.htm). I haven’t had the chance to dig into it, and I don’t know if it’s still being developed, but it would be a good place for an interested archaeologist to start.

    And as for Atom/RSS, in concert with a specialized XML interchange format, it might make a lot of sense for announcing changes. That is, let each project use its own database but create the tools (as Paul suggests) for exporting the data into a common XML schema, and use RSS to announce the availability of new data to all interested parties. That way, we could use best-of-breed solutions for data storage, data dissemination, and data sharing, without diluting the quality of any single component by mashing it all together.

    Unfortunately, I doubt that a schema for the interchange of paleoanthro data would be useful for archy, or vice-versa.

Comments are closed.

A WordPress.com Website.

Up ↑

%d bloggers like this: