 
        Duke, M. and Ball, A. (2012) How to Cite Datasets and Link to Publications : A Report of the Digital Curation Centre. In: 23rd International CODATA Conference, 2012-10-27 - 2012-10-31, Taipei. Link to official URL (if available): http://codata2012.tw/ Opus: University of Bath Online Publication Store http://opus.bath.ac.uk/ This version is made available in accordance with publisher policies. Please cite only the published version using the reference above. See http://opus.bath.ac.uk/ for usage policies. Please scroll down to view the document. How to Cite Datasets and Link to Publications A Report of the Digital Curation Centre Monica Duke Alex Ball 30 October 2012 Hello, my name is Alex Ball and I work for the Digital Curation Centre in the UK with my colleague Monica Duke. The Digital Curation Centre, or DCC, is a centre of expertise in digital curation and research data management funded by JISC, which is an agency that helps to develop and maintain information systems in the higher and further education sectors. For about five years now JISC has been pushing to improve research data management in the UK, and as part of that, we at the DCC are publishing a series of guidance documents based on themes set by JISC. One of the themes is data citation, and at about this time last year we published both a Briefing Paper and a How-to Guide on the subject (slide). We have some copies here to give away and you can also download a copy (Figure 1) from our website or read it online. http://www.dcc.ac.uk/resources/how-guides/cite-datasets Figure 1: How-to Guide on data citation, on the Web As I only have twenty minutes, I’m not going to be able to go through the whole document. Instead, I’ll pick out some of the more interesting issues we came across when putting the guide together. 1 Motivation I guess I don’t need to convince anyone here about the need for data publication and citation, but to understand it we have to think about scholarly communications more generally. Journals are the big success story in this area, but what made them so popular (Figure 2)? • Awareness raising • Protection from plagiarism • Verification of results • Basis for future research • Reward models • Permanent access Figure 2: What’s great about journal papers? 1 They provided a way of communicating research results such that others could verify the results and build on them, while also ensuring authors received due credit, and in time rewards, for their work. Formal publication also meant formal archiving could take place. But as the process of conducting research has become more specialist and complicated, your average scientific journal paper can no longer contain all the information it needs to make the research reproducible (transition); we also need the underlying data. But we won’t get data routinely shared until all these things apply to data as well as to journal papers. I would argue (Figure 3) that, given time, data citations are what will make it happen, because the citation model is well understood and trusted. • Visibility for data • Protection from plagiarism • Possibility for verification of results • Data on which to base future research • Possibility for reward models • Access Figure 3: What data citations provide What should data citations look like? Well, every journal has its own idea of what a citation should look like so the important point is what a citation should include (slide). 2 Elements of a data citation Here are four standard citation styles I found in the literature: see the Guide for the full references. Which elements do they use? Author, Publication date, Title, Version, Feature, Resource type, Publisher, Identifier, Location, Unique Numeric Fingerprint. Altman and King (2007): Dataverse • Sidney Verba. 1998. “U.S. and Russian Social and Political Participation Data,” hdl:1902.4/00754 UNF:3:ZNQRI14053UZq389x0Bffg?== NORC [Producer]; data set [Type (DC)] ICPSR [Distributor]. Lawrence et al. (2008): BADC • Iwi, A. and B. N. Lawrence (2004). A 500 year control run of HadCM3. [GridSeries, http://ndg.nerc.ac.uk/csml2/GridSeries] Version 1. BADC. urn:badc.nerc.ac.uk _coapec500yr [Available from http://badc.nerc.ac.uk/data/coapec500yr]. Green (2010): OECD • OECD (2009), “Key short-term indicators”, Main Economic Indicators (database). doi: 10.1787/data-00039-en http://dx.doi.org/10.1787/data-00039-en (Accessed on 14 September 2009) 2 Starr and Gastl (2011): DataCite • Irino, T; Tada, R (2009): Chemical and mineral compositions of sediments from ODP Site 127-797. V.2. Geological Institute, University of Tokyo. Dataset. doi:10.1594/ PANGAEA.726855. http://dx.doi.org/10.1594/PANGAEA.726855 There are five elements that occur in all four styles, four of which have a long pedigree in scholarly citation: • Author • Publication date • Title • Location (= identifier) Despite the fact that we’ve had ISBNs since 1970 and online catalogues in widespread use since the mid-1980s, identifiers didn’t really start to catch on in citations until the introduction of DOIs in the last five to ten years. I’d guess this is because with things like ISBNs there is no central register that allows you to look up the item; booksellers and libraries have had to build them up for themselves. So identifiers have tended to be used, if at all, more like checksums for making sure you had the right item, rather than as a way of accessing resources. But the Web is changing all that. We now have ways of making locations persistent enough to be used as an identifier (transition). While it’s possible to do this by carefully managing URLs, it’s more usual to achieve it by using a fake location, made up of a resolver service and the identifier, that redirects to the real location. DOIs are getting the most traction for datasets that are considered ‘published’, with Handles and ARKs being used more for ephemeral datasets. • Publisher Another change wrought by the Web is that we are now used to getting scholarly content direct from there rather than from library shelves (transition). This makes the publisher more important than ever as both the host of information and the guarantor of its quality. That might seem straightforward enough, but of course it’s never as simple as that. 3 Issues and challenges Take the author, for example. Authorship is a strange concept in the concept of a dataset. More natural roles might be a compiler, or a principal investigator, or a corporate owner. Furthermore, it is far easier to rack up a silly number of contributors with datasets than with textual publications. In such cases, a simple citation like this isn’t going to cut the mustard. Most likely you’ll need some sort of microattribution approach (slide). This spreadsheet was submitted as part of the supplementary data for an article published in Nature Genetics last year. You’ll see it attributes each genetic variation in the dataset to its contributor, as identified by a Thompson Reuter ResearcherID (other contributor ID schemes are available). This was very much a proof of concept. In future we might hope for this sort of information to be made available as linked data, preferably somewhere more accessible than supplementary data, like DataCite’s metadata store. 3 1 11 1 11 2 44 2 44 • Data points 3 99 3 99 • Data tables • Data files 1 11 1 11 • Datasets 2 44 2 44 • Data collections 3 99 3 99 Figure 4: Granularity Granularity can also be an issue. Just as you might cite only a sentence or a page of an article, with data you might find yourself citing only a single data point, or a table, or a file containing several tables, or dataset made up of many files. You might want to cite a more abstract subset of data such as one of the Features I mentioned earlier, or you might want to cite a whole collection of datasets. The practical answer is: • Cite datasets at the finest level that is appropriate and for which an identifier is provided. • If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation. So, now you have an in-text citation and a bibliographic reference. Where should that reference go? • Special data resources section? • Acknowledgements? These are already mined for funder information, so could be mined for data citations as well. • Accession codes? In 2011, Nature published a data DOI for the first time (see http://dx.doi.org/10.1038/nbt.1992 – an article on the genomes of rhesus macaques), and later, in a paper on the recent outbreak of E-Coli in Germany, published the DOI for a dataset held by the Beijing Genomic Institute for the first time. In both cases the editors decided to put the citation in with accession codes rather than the reference list as the datasets hadn’t been peer-reviewed. • Reference list? This is something that’s still being worked out by the movers and shakers, but if data is to be thought of as a first-class research output, it really should be in the reference list. While we’re on this topic, there’s a related issue in the case of data reuse that if the data 4 citation is in the reference list, should it appear alongside or independently of a reference to the related article? The data might well be useless without the kind of context that a journal article provides, but in print journals with a limit on the number of references, one could be consider it a waste of a slot to include citations to both the paper and the data. This is an area where pervasive forward linking would solve a lot of problems. If publishers can be sure that when a reader follows a link to a dataset, the landing page would forward them on to the data collection paper and any other papers using it, or even other high quality documentation, they might be more open to accepting a lone data citation where it is appropriate. That is why we are recommending the following: • Include the citation in the reference list – some reference management packages now include support for datasets, which should make this easier. • When your data collection paper is published, notify the repository holding the dataset. • When you publish a paper in which you reuse a prior dataset, notify the repository holding that dataset. The other issue I want to talk about is dataset identifiers, and how they should be applied to dynamic datasets. There are two ways a dataset can be dynamic (Figure 5). The first (animate) is where the dataset is fairly stable in its extent, but points are revised every so often. A table of the masses of subatomic particles would fall under that category. • Revised datasets • Expanding datasets Figure 5: Types of dynamic datasets (Click on illustrations to animate them) The other, more common case (animate) is where a dataset is continually expanded with new data, such as with sensor data. There are three ways of making such datasets citable. 1. Differentiate versions by access date rather than ID A 2. Take time slices A B C 5 3. Take snapshots A B C The first option I know is adopted by the National Snow and Ice Data Center in the US, because first, in the disciplines they serve the dataset itself is more important than the version, and second, the Federation of Earth Science Information Partners of which they are a part believe that the identifiers they assign aren’t identifiers at all but locations, because you can resolve them to addresses.1 It’s not a view I share, and so I’m not keen on this option. The second approach really only makes sense with expanding datasets, and even then works best if the researchers tend to use one slice of the set at a time. Even so, it is possible to combine it with the first approach, or the third one which is the one I reckon is most generally suitable; if the rate of change is particularly frequent, it would probably be best to take these snapshots on demand rather than at predefined intervals. The apparent downside of the third option is that it seems to involve massive duplication of data, but there’s nothing to stop the data backend generating these snapshots on the fly from a single master sequence. There’s plenty more I could go on to talk about, but time is pressing so instead I’ll flash the headlines before your eyes. 4 Guidance for researchers When publishing a paper. . . • Deposit any data you have collected and used as evidence. • Ask for a persistent ID/URL for your deposited data. • When your data collection paper is published, notify the repository holding the dataset. When citing a prior dataset. . . • Use the data citation style required by the editor/publisher. • If no style is specified, use a standard data citation style, adapted to match the style for textual publications. • Default to writing IDs in the form of URLs if possible. • Include the citation in the reference list – some reference management packages now include support for datasets, which should make this easier. • Cite datasets at the finest level that is appropriate and for which an identifier is provided. • If that is not fine enough, provide details of the subset of data you are using at the point in the text where you make the citation. 1 http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines# Note_on_Versioning_and_Locators 6 Sneddo n et al. GigaScie http://w nce 201 ww.giga 2, 1:11 sciencej ournal.co m/conte nt/1 n ctio l Se ecia cess Sp Open Ac 5:223 s 2012, arch Note -0500/5/223 756 BMC Rese s et al. al.com/1 edcentr Edmund ww.biom http://w /1/11 EDITO GigaDB genome DAP) rghum n (R ating and tion: so d dsata cita gold standarservatios are invermstigsuggesetsth, isat in s re ew tu e n ticl eies the e te ipline s BasfordPr Adven Isp disc on, as th ished ar tation su ic xandra T & lif ci Ale ndem J Pollard , Brian Hole andccess across acs.adAemdata cietalitist of a.puFor,bleasmalilyndathtaan the soruurlece deatsaaex A sultadly ptio ions tation referenc eir berebro s , Tom a ut dat iv ce a und it ci t t any inst g datakes it essin ent t Scott C Edm ex of fer to th ion of a arch visieon theial tha a sectic ia e, pro th re re by ct D e cy y it pra n ed se in or ma ic ll cien ve h ud d scie which In , ot arch effi pica etdat ill m hoads r. Data a gintif and strese m rsittyextense ive d to improv irepe arees pape t lim earc ility of informaprtionomitiotantitoonbeithinngatclrenlegoo th of entpa tories rt of cy and, butstraints tha tion antagen ofrta they reposi s in adv on con searchto the Abstract mous e pa Rsseis sdriven by theersavas inilabevery field. Inerimaddaentcials datrantaspaeve io effo ers, seri r, m re data ention e ct and Cita ernik tabl CORR ESPON DENC E Tam P Sne 2,3,4† 1*† M by M n leti Bul n atio form for In ciety n So erica e Am of th logy chno d Te ce an Scien – 2012 uly ne/J – Ju 5 ber Num 38, me Volu er Li and > seke the anulated he ting “G itories. Ho em d st libramakin niveefitAR gE practic specifically by tama s of be calcbe reack accumula implcan an pos ve dataa-sharin dole ed ch da CThe AR)/Uben AG n ’s wor collecti es /UC licay an hed datS Psup hor 3], (N plement sear for the e ca as whpractics an CARnity establis U n A04684 po a re search . Haut e Nmu a citation I O bee s Vhas k is at data [SR com datoward data ng ale tren in th section proPces erni ic Re lopirab RE the raw etad search veasu reference May ospher ork with g de < having em me ive w into the in addition to com of re clud in m thew the Creat ejournal. l Estate, terms of Mat r for At AR). His , includ s also in pects gascienc Thus, Industria under the distribution, and t@gi as set. Po te C es scot st Tai buted al ce: e distri Street, Cen arch (U servic intere d sociCorr tricted use, esponden ors Dai Fu Access articl its unres * Ltd., 16 le ribut is an Open .0), which perm artic Kong cont Rese ch data research ent an † l This g the of Hon /by/2 Equa ral Ltd. ar pm ce, BGIthe end rg/licenses ed Cent . 1 able at rese ns. His develo . GigaScien see BioM commons.o is properly cited on is avail et al.; licen (http://creative g Kong e work r.edu NT, Hon author informati original Edmunds citatio ructur License © 2012 ded the of st >uca s Attribution um, provi Full list Common on in any medi infra nik<at ducti er may 23 TE NT e GigaS Edmund * s properly repro N CO ncing th Scott C Open Ac cess cience d atabase With the launch of GigaSc the inte gra ience jou goals to tion of manus rnal, her crip e we pro public rep mote open-d t publication wit provide insi ght into ata and ository doe h sup the accom s not exis reproducibility porting data panying and t, for the of database supporting research, GigaDB tools. Reinforc Backgr Gig ing and ound also aim data or upholding aDB, which allo tools feat s to pro Interne ws Gig t pionee vide ured in r Sir Tim the journa a home, wh aScience’s preciou en a suit s thing Ber l and bey ners-Lee and will able selves” ond. has stat last lon [1], and DC ed: ger C "Da des produc and tha ta is a pite the tion in challen n the systems the ope DataCite best areas suc tentially ges created the n-d pra mctic ata h as genom e guidel faster tha movem most ope due to ent ines. In attempts ics n promoting must still the ability to growing at rate data and ma n CC0 waiver, , data is also released much of xim s postore and cutting under the these pre be made to cap BGI’s ext izing its pot any proces goals of ential re-u legal red ensive GigaScienc cious resourc ture and safegua s it, been tape [4], se. es as pos populated computing semina e journa tion infrastr As GigaDB use sible. Wi rd as it rele with dat l to ma ucture, and cur , and transpa ased in s th the ximize asets ren ate all of dat Releasing a citable form produced by it has also ing this the sup cy, having som a reuse, disBGI, mu pre-public research por ewhere of succes data in this ch of ation. GigaDB is essentia ting data and to host nov ses el to manner (http:// dat cing of gigadb.org l, and the Gig tools surroundhas had data from e, particularly aScienc a ) is key num spu bre the e ak (als rring the database, ber to achievi dea Main tex crowdsou ng this. this lau o discussed in dly 2011 E. coli t rnch Mik 0104:H As can e Schatz’ “open-sou issue [5]) resu be seen s comme 4 outin GigaSc icle on ntary in ground rce genomics” lting in what an ience’s firs has bee and me [6]. For the raw epigenomics t issue, n term please see chanism pipeline data ava ed s surrou more on the [2], in add a research art- sea our rec this and ilable in ent cor backrch Not all the ition to NCBI respond nding data suppor es Data the epi having tion [SRP00 citation ence [7] ting dat gen Sha series, , a (totalin 5934], also has pipeline[3 omics tracks using the ring, Standardiz in the BMC Gig g 84 and aD Rerelease atio GB), B and pub the too cited in ], hosted in Gig of the sor n and Publica ls created such as lication GigaDB the pap aDB. Thi ghu in m cur er Genom Identifier) for the s datase genom through rently of these e t is link is a hep comprises ove Biology last yea e by additional , providing stab a citable DO consists r 30 dat atocellular I (Digita ed and ility, and discove asets. The r [8]. of ability carcino most imp l Object individuals 15 Tb of nor largest to be trac rability and ma dat ma ase ortantly trac journal these sam . Additional dat l and tumor raw t [9], which , citation ked in the sam eability thr ough its a derived s. ish Lib dat also be e individuals, rary and Working and e manner as and pro a from 88 e.g. tran add stan par org), the DataCi scriptome cessed from users can ed to a DO te con tnering with the dard se dat I rap sequence, sor immedia through ase Brit- project tely acc idly after their their cen ts are search tium (http:// in a sing dat generat can environ tral me le, perma ess the data The goa me tadata rep able and har acite. from this ion so nent plac l of cen vestable area, and ntal sciences, osit ory ongoing e. is tral data . Ou exemplifie izin we hav BioMed d by the g data and ma e worked citation is stil tside of the which we Central kin mouse provide to ensure closely with l quite a new lish methylom g it reproducib all data ed resu our pub * Correspo le that cita e lish tion of ment file lts. This includ necessary to rep dataset [3] in GigaScien ndence: scott@gi data foll er es the raw s, the Me licate the Estate, NT, ce, BGI-Hong Konggasciencejournal. ows rea dus fast pubd-d com Hong Kong a q epth Co. Ltd, rea soft 16 Dai Fu examples files. This and ware package, ds, bam alignStreet, Tai and the the sorghu for futu Po Indu be done strial re big to not onl data submitters m study are exc wig ima © 2012 y elle in l journa comply Sneddon with but regards to what nt l data Common et al.; licen policies. also go can reproductis Attribution Licen see BioMed Cent Authors beyond on in any se ral minnot onl medium, (http://creative Ltd. This is an comm Open provided y adhere the origin ons.org/license Access article d to distribute al work s/by/2.0), is d whic 2,5 co archde timse uc uld coun of of ext m cite eases exp ly re t of ic progre ible to resede r are, how rs tion odwo ore fou ents prers Scientif duc -t amoun honda r ca way incr ess ac thei ially gm a pro ent emdat a major rms.alThe m inng venientta experimfo idly acc Autthe sub freetial itio data me an to cite ed fo dgfor s. for as stan tent and rap ingnaizcon co vidi lierv tion of pitebepro ogn datreaco owle nto many becoper rec licada ginly can po h the sistent ip nee s,t pro urge such kn dup y des that, des videsac sc are dit the ssarciyes supporting l to as ers be tha been ece straint is the , appeal jouterna sets em way the crero ss di con tion of roug ices h rv ntliaarticles agen sucse hasg unn al syst publica arch red her tion in taucin lyt receive ardac ent One ent se ndrencedeiinr po . The rec , a form , data ofunity th da tionducerslorare rnmion. strapro in ot refe pmenIn this en d he the field em If re ctichae ni veinat m arch diss insm andr ha ta tha output d reg theda a s. data. a m m go se ve ir gi dat cite l es. e se , of dat re es ot pra de co re urc ctiv are ra nity ge c e e nowled effe oc tion meces howific in reso it focure more of mu r orlves thack ese tifierating the pemse the causgooio ientifi veral fede com m sha user data pa nt ndatpra cita es ar the iden el perly s bele illustrat s, on g sc tation al ich the of icat d pro Summ to viduwh scie ories. Y Se cialso er mp un jectspendgucre id in ntivvaize citin way titoon first tion theic reposit ng so.alIt MAR tum. ral indi, in w fundexa mm ally aradpotential citanotable tal ob serv on ntif er a citaoftion d scie tione to ince d for doi form dar SUM ct fede ar. a arch ng y form g momen its digi Pres s of dat ss s,to is ge Biology is re therent stan l cita pro dces ent se attilarlreqcouireug a OR’S reitie ple e lack hic og e of h cur impa and iona ce The on ss an tio ting dat Multi EDIT portanc ly gainin st ye pr n, thliograp Genalom a to ho datth and form ro the r their indices. rnat Cite suppor iabl d ta e bib es gen s. e pa slowomesdindata prafo e sc im cticalities sion of fo ta Ac identif ta , inte Data cent mis lt of the ta ci sorgunhum n th TA) an ctorlack ke es then. Ef rtsnt the subth ever,sethe The ly re archers, ting data arch Da ort daera lintrat tation l mm iti ce da sessed cition as a resu ]. Again, howan y d A on ons tio s invesme withi ur sev an is ta se pp dem aug as ci l ta so co ics andncy m linking Re su of rese onshors y (COD er itio but da cita accessibleit[7,8 ta cican specia ch om inem y gingngsyst re ent ideentif daiers ofnal tatiaut n IS&T asons to resear gen bett add n by practice pasist v e ci lly ed og io iv tio AS tag ns on ti th lica gi der be ct ol ta at ti ta c ized per hin pub er re in e 12 and es ci la ng of iate effe ta tra ting to form use echn to dandadlsTrela e bei mpi ote th at the 20 tific ny drecAognosit prec isedemdatatedhas fuel In] strong ral inertia d for da om lity. vertsall a ctic s co en ap bi to d uni Gen sca [10 , ite a l ra n ci an dep prom la s on ent ea of ine ma P an ne sp an ltu ne re Rec ence in meat dica re theirtati ditps[9]. rS tions manag [1], the Hu e cu and n [1]. a pa dem rs combi institu n. De tors to ta cidue dat creho il afo Inte d etected rch D ned from ans community r to adtiga and tatio pervasiv growing r Sci de to makenc in- ndards g und ed sionicity an eiving works t lon ons lear utio in ta ci Disizcus recda ta a ehol eseaneed al Cou ma wenData fohlightRthe pl key less e from the C. eleg ly available prioBro- tefrom t the the rest d dat m n Sta d attrib d to da stak m recogn on da es and aldthat on on and to hig Onizeeofsi a pag atioto adve he tey e[11 ] alsooa and free the field of gen In falsifie rd on vali ce. Bu riety of natiion Cmp ntiv erdat e fro , taking on an relate c. to broadly it phas ptan va B poses of a it tcitaistithe ince es hapsychomlog eatatte spur -Int abl es e y valuable to s have tried enciand t com ally em Project t making data ctur ie ma D acce pi not for ti a om and e TA us m d dat to ble fro vi ndl On ta C A . up onaccess da to [3] essi de nm es ag project pro was tha broa essure strufou .S.ily acc the ying ed actilODnce d initi te Ueas l Acatrust in infra uent genomics the Bermuda Rulagreetion nwas e Cscie citatio TAsmathe rt on ly stud pr Grople20 tl ts an 11 onousDeAfor The tain ingk com publica atioSub public ith thof gain seq in iona as with ing data data se in po ve 11 ti t a storehCOtion derdale laid out vesusas form[2]. issuesation T ot ser of e Nat ssatthe of lica enssus re en acti arch 20 ation ed w ug Fort Lau munity has ominics t ich pub th ctice as A n the wh h pra , se dre prom teristics io in wit or repfo be com rmory p in liatedrger ns this M ed at in tegr follow a co ac In ositdire also ho affi a la cts hrin logical science es, as outlined collab motiv matelypeens publi-hop in d In ctly sent Dryicad al ks in an char ha bio al rec as lt or ulti ts of ctic l n er ks tion ase hn es pra and ltura The wid is the a w ]. part ortant itatio resu wor regardienc the aimp o Interna heldTec ler dat similar cu [4]. sornces [12 DS is C nt inlythis eoscins al Toront follow of , n p me ed on OR to LE ard the logy tu bee scie le ts sp bio nd omeCBio from se yc mpted n has r Ge stra F fu kshostepl forw TIC KEYW thre evenes te adoptio infrastructureto tThe also atte ines published data AR S in Gen wornexdt wilgen ns om ra of fo publish Life genomics [5], but del arch l XT ha on an to lor The N ed tatio of the irec rkshop the gui es eof the ful prorepository NE ionaincentives for T cation rese cticth s. t pra hic ci Dghum bico ease Wo -access the Nat ti ent loring tionthe bes and use grap enceerof ic necessary to citafood cro Data Rel lack of easy-to an ofabs NpSFSor ta ows cita Expporting raw ilable dat>a reagem foll biblio rt a asrary l lib s: sup ospheffo T].heThisgwor ava da k ingatic mand t man the back by fields as wel ers. s E longloping e e and At oth se t ta G tim ce [13 th r ve to hav in van rm the rele ler da ur the fo ny h de ilabfo ot nity by inA the data eT P on ava list inughration easdily for ma reso researc Info ommu nity, this firstN tim prcom ilable in EX eciathro berse planle tocythe to go ataa ava ss to orponly and commu re is a tionilabrd focu ataava Cope for the s, authoricses spwor acce sed-Ddat ceseo [6]: d atthe en ir rsityk wever, rv es of the ry is g dat da integrating : annou ddon, Pet Abstract 1 ay a Datatthew S. M RIAL cited. h permits under unrestrict the terms of the ed use, distributio Creative n, and S Figure 6: The How-to Guide has been cited in the literature • Cite the exact version of the dataset you need. • When your paper is published, notify the repository holding the dataset you used. 5 Guidance for data repositories • Provide persistent IDs for the datasets you host. – The ID should remain unique. – The ID should always point to the same version. – The ID should resolve to a URL. – The URL should locate the dataset’s landing page. This URL should belong to a landing page that contains descriptive information about the dataset, as well as links or instructions for accessing it. • The explanatory metadata should not change for a dataset with a persistent ID. • IDs should only be assigned once no further changes are expected. • With dynamic datasets, provide IDs for snapshots or time slices. • Provide sample citations on dataset landing pages. • Link from landing pages to publications citing the dataset. This may require collaboration with authors and publishers. 6 Putting it into practice In the year since we published this guidance it has made quite an impression (Figure 6). It was mentioned earlier this year by Matthew Mayernik in the Bulletin of the American 7 Society for Information Science and Technology2 in the same breath as the guidelines put out some months earlier by the Federation of Earth Science Information Partners (ESIP). Once they saw them, ESIP themselves called our guide ‘the most useful guide’ on data citation (transition).3 The correspondence paper ‘Adventures in Data Citation’ by Edmunds, Pollard, Hole, and Basford uses the guide as shorthand for best practice in data citation,4 as does the editorial in the inaugural issue of the data journal GigaScience.5 Most recently (transition), Thomson Reuters refer to it in their essay on the selection policy for their new Data Citation Index.6 It is good to see the guidelines being used in practice, but the landscape is developing all the time. So we’re keeping a watchful eye on the evolution of data citation practices, and hope to bring out an updated version of the guide in the first half of next year. Monica Duke, Alex Ball. DCC/UKOLN, University of Bath. http://www.ukoln.ac.uk/ukoln/staff/ Except where otherwise stated, this work is licensed under Creative Commons Attribution 2.5 Scotland: http://creativecommons.org/licenses/by/2.5/ scotland/ The DCC is funded by JISC. For more information, please visit http://www.dcc.ac.uk/ 2 http://www.asis.org/Bulletin/Jun-12/JunJul12_MayernikDataCitation.html 3 http://wiki.esipfed.org/index.php/Interagency_Data_Stewardship/Citations/provider_guidelines# Introduction_and_Summary 4 http://dx.doi.org/10.1186/1756-0500-5-223 5 http://dx.doi.org/10.1186/2047-217X-1-11 6 http://wokinfo.com/media/pdf/DCI_selection_essay.pdf 8
© Copyright 2025