Production Manual Project Acronym: OpenUp! Grant Agreement No: 270890 Project Title: Opening up the Natural History Heritage for Europeana C2.4.3 - OAI-PMH Interface final version - Production manual Revision: Version 1 Authors: Astrid Höller AIT Forschungsgesellschaft mbH Gerda Koch AIT Forschungsgesellschaft mbH Odo Benda AIT Forschungsgesellschaft mbH Project co-funded by the European Commission within the ICT Policy Support Programme Dissemination Level P Public C Confidential, only for members of the consortium and the Commission Services Revision History Revision Date Author Organisation Description Draft 17.6.2013 A. Höller AIT First Version (Draft) Draft 19.08.2013 A. Höller AIT Adding The harvest does not start and/or there is no progress in the console window Draft 20.08.2013 A. Höller AIT Adding “Quick Guide” Draft 28.11.2013 A. Höller AIT Adding The gbif log file can not be opened (permission denied). Draft 12.12.2013 A. Höller, O. AIT Benda Revision of Parameters and EDM Version 1 28.02.2014 A. Höller, G. AIT Koch final version Statement of Originality This deliverable contains original unpublished work except where clearly indicated otherwise. Acknowledgement of previously published material and of the work of others has been made through appropriate citation, quotation or both. Distribution Recipient Date Version Accepted YES/NO Table of Contents Description of Work ................................................................................................................................. 1 The GBIF Harvesting and Indexing Toolkit (HIT) ......................................................................................... 2 User Interface of HIT..................................................................................................... 2 Adding a new bioDatasource and harvesting it .................................................................. 5 Pentaho Kettle (Data Transformation) .................................................................................................... 16 Databases.................................................................................................................. 16 Creating a folder structure ........................................................................................... 17 01-transform.............................................................................................................. 19 02-validate ................................................................................................................ 23 03-oai-import ............................................................................................................. 25 The OAI-Provider ................................................................................................................................... 25 Logging in .................................................................................................................. 25 Adding a new collection ............................................................................................... 26 Advanced search ........................................................................................................ 29 Browse ...................................................................................................................... 31 Error handling ........................................................................................................................................ 33 Harvesting ................................................................................................................. 33 The bio datasource can not be saved ........................................................................................................ 33 After harvesting the metadata updater no operator is created ............................................................... 33 The harvest does not start and/or there is no progress in the console window ...................................... 35 No inventoried list is created .................................................................................................................... 36 No name ranges file is created .................................................................................................................. 36 Why less than 100 % of the target records are harvested? ...................................................................... 36 Why more than 100 % of the target records are harvested?.................................................................... 37 Error in the response document ............................................................................................................... 38 Installing the Biocase Provider on IIS Server ............................................................................................. 38 The gbif log file can not be opened (permission denied). ......................................................................... 39 An error concerning the title of the datasource occurs ............................................................................ 39 Transforming ............................................................................................................. 40 An error occurs when trying to execute a transformation........................................................................ 40 OAI-Import ................................................................................................................ 41 The Transformation stops during the import ............................................................................................ 41 The imported collection can not be found on the OAI-Provider platform ................................................ 42 I. Quick Guide .................................................................................................................................... 44 GBIF-HIT Harvester................................................................................................................................. 44 Existing data source .......................................................................................................... 44 New data source ............................................................................................................. 44 Pentaho ................................................................................................................................................. 44 Existing data source .......................................................................................................... 44 New data source ............................................................................................................. 44 OAI-Provider .......................................................................................................................................... 45 Existing data source .......................................................................................................... 45 New data source ............................................................................................................. 45 Conclusion ............................................................................................................................................. 45 II. List of Figures ................................................................................................................................. 46 Description of Work This document illustrates the complete procedure of harvesting, transforming and uploading data during the OpenUp! project. This includes harvesting datasources from the data provider BioCASe with the GBIF Harvesting and Indexing Toolkit (HIT), transforming the harvested ABCD records with Pentaho Kettle and finally uploading the created ESE records on the OAI-Provider-platform with the Zebra information management system. Figure 1 gives an overview of the whole process. As can be seen the data has to pass six steps before it is finally delivered to Europeana. 1. Message from Data Provider that a new datasource is available 2. Harvesting the datasource with the GBIF-HIT Harvester 3. Transforming the ABCD files into ESE records with Pentaho Kettle (Data Transformation) 4. Informing the OpenUp! Meta Data Management that the data is transformed 5. (Informing us if the data was correct) 5. Uploading the records on the OAI-Provider-platform 6. Deliver the data to Europeana 3 4 5 6 2 1 Raw data Figure 1 Diagram showing the infrastructure of the OpenUp! project with its main steps In the next chapters an example datasource will be processed step by step. In this document we are concentrating on the three Action steps (compare AIT, 2014 C2.4.3 p. 1 Figure 1): The HIT Harvester (step 2), Pentaho Kettle (step 3) and the OAI-PMH-Service (step 5). The GBIF Harvesting and Indexing Toolkit (HIT) User Interface of HIT1 When going to http://ait117:8080/hit/ the following window can be seen (see Figure 2). Figure 2 Logging in the GBIF Harvesting and Indexing Toolkit (HIT) After logging in (in the upper right corner) the interface of the HIT Harvester can be seen (see Figure 3). 1 http://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual 17th July, 2013 AIT, 2014 C2.4.3 p. 2 Figure 3 The HIT user interface There are five main sections: Datasources, Jobs, Console, Registry and Report. The tab used at the moment is always green (like Datasources in Figure 3), the others are grey. In Datasources all datsources that are available can be seen. The orange datasources are metadata updaters, the green ones operators. For both one of the following protocols can be chosen: DIGIR, BioCASE, TAPIR or DwC Archive. PLEASE NOTE: In this project ONLY the BioCASE protocol is used. An operator is only created when a datasource has been created and the metadata updater has been successfully harvested. This case will be described in the next chapter. When clicking on the Jobs tab all jobs that have been started or jobs that are waiting for execution can be seen. The jobs are listed with their Job ID, name, description, their creation and their starting date (see Figure 4). If one ore more jobs shall be stopped one id can be filled in and the “kill” button is pressed. It is also possible to check the “all” option or to reschedule a job. AIT, 2014 C2.4.3 p. 3 Figure 4 The Jobs section When a Job has been started its progress can be watched in the Console section. Every few seconds the log messages of the application are being refreshed with date and time (see Figure 5). Figure 5 The Console section with the Log Event List The Registry tab is used to synchronise with the GBIF Registry. Before clicking on “schedule” the datasources can be filtered by endorsing Node or organisation name (see Figure 6). AIT, 2014 C2.4.3 p. 4 Figure 6 The Registry tab Finally a report in the Report section can be written or generated (see Figure 7). Again there are different options for filtering the result. Figure 7 Writing or generating a report Adding a new bioDatasource and harvesting it First of all click on “add bioDatasource” in the lower right corner (see Figure 8). AIT, 2014 C2.4.3 p. 5 Figure 8 Clicking on “add bioDatasource” After doing this the datasource has to be configured (see Figure 9). The name of the bioDatasource, the name of the provider, the URL and the factory class need to be filled in. It is very important to choose BioCASe in the drop-down-menu of “Factory class”. Typing in the name of the country is optionial. In this example it is done. When everything has been filled in correctly click on “save” and the datasource should now appear in orange in the datasource list (see Figure 10). Figure 9 Adding a new datasource AIT, 2014 C2.4.3 p. 6 Figure 10 The newly added datasource “Sahlberg” PLEASE NOTE: It is easier to find the newly created datasource by clicking on “Recently Added” at the top line. Now tick the box in front of the datasource “Sahlberg” to select this metadata updater. Then click on “schedule”. When switching to the Jobs tab the two Jobs can be seen waiting to be executed: “issueMetadate” and “scheduleSynchronisation” (see Figure 11). AIT, 2014 C2.4.3 p. 7 Figure 11 Job list after scheduling the metadata updater for “Sahlberg” When switching to the Console tab (see Figure 12 ) not only the progress of the Jobs can be seen but also error messages if something is missing (marked red). When the Jobs have been finished they must not appear anymore in the Job list. When going to Datasources again you can see that a “Sahlberg” operator has been created (see Figure 13). AIT, 2014 C2.4.3 p. 8 Figure 12 The Log Event List after scheduling the metadata updater “Sahlberg” Figure 13 The newly created operator “Sahlberg – Sahlberg” Now it is time to gather records from the data provider. To achieve this select the (green) operator “Sahlberg – Sahlberg” (tick the box) and click on “schedule”. Right after this there should be six Jobs in the list: Inventory, processInventoried, search, processHarvested, synchronise and extract (see Figure 14). The order of these operations is essential for a correct harvesting process. AIT, 2014 C2.4.3 p. 9 Figure 14 The Job list after scheduling the operator “Sahlberg – Sahlberg” During the Inventory operation a list of all scientific names occurring in the datasource is generated. One can follow this process in the Console section (see Figure 15). Figure 15 Console section during the Inventory operation As can be seen in Figure 16 an inventory_request and an inventory_response is created (see Figure 17). Both are saved in /opt/hit (…) – the harvest directory determined during the HIT installation. AIT, 2014 C2.4.3 p. 10 Figure 16 The inventory_request of the Inventory operator Figure 17 The inventory_response of the Inventory operator Figure 18 shows the result of the processInventoried operation: the text document inventoried.txt with an alphabetical list of all scientific names. AIT, 2014 C2.4.3 p. 11 Figure 18 Alphabetical list of all scientific names Another document containing all the name ranges that were constructed is created too: nameRanges.txt (see Figure 19). AIT, 2014 C2.4.3 p. 12 Figure 19 The nameRanges.txt document After this it is time for the search operation. In this phase the later with Pentaho Kettle transformed abcd records are created. There is always a search_request and a search_response (see Figure 20). Figure 20 The search operation creates search_requests and search_responses AIT, 2014 C2.4.3 p. 13 If the response was encoded using ABCD, there is one core file after the processHarvested operation: unit_records.txt (see Figure 21). It contains a header line with column names, with each line representing a single Unit (record) element. Figure 21 The unit_records.txt file In addition six files all relating back to the core file are created during this process: image_records.txt - a text file containing a header line with column names, with each line representing a multimedia record relating to a given Unit (record) element. identifier_records.txt - a text file containing a header line with column names, with each line representing an identifier record (i.e. GUID) relating to a given Unit (record). identification_records.txt - a text file containing a header line with column names, with each line representing an Identification element relating to a given Unit (record) element. higher_taxon_records.txt - a text file containing a header line with column names, with each line representing higher taxon elements relating to some Unit (record) element. link_records.txt - a text file containing a header line with column names, with each line representing a link record (i.e. URL) relating to a given Unit (record) element. typification_records.txt - a text file containing a header line with column names, with each line representing a typification record (i.e. type status) relating to a given Unit (record) element. Finally there are the synchronisation and the extraction operations. During the synchronisation the data is updated and old data is deleted (see Figure 22). AIT, 2014 C2.4.3 p. 14 Figure 22 The synchronisation and the extractions operations in the Console section The extraction operation creates the ABCD records as search_responses with continuing numbers in .gz format (see Figure 23). Figure 23 The result of the extraction process AIT, 2014 C2.4.3 p. 15 When there are no more Jobs in the Job list the Harvesting process with HIT is finished. A folder structure with the root directory /opt/hit/ and the search_responses should have been created (compare Figure 23). Pentaho Kettle (Data Transformation) Pentaho Data Integration (PDI, also called Kettle) is the component of Pentaho responsible for the Extract, Transform and Load (ETL) processes.2 Pentaho Kettle is used to transform the ABCD records into correct ESE files. The complete process in Pentaho is categorized in three steps: 1. transform 2. validate 3. oai-import This structure is also represented in the Pentaho repository (see Figure 24). Figure 24 Repository structure in Pentaho Before starting Jobs and Transformations it is useful to understand the database structure behind Pentaho. Databases Figure 25 shows the database “etl” with the four tables “Biocase_Harvest_to_ESE”, “Biocase_Harvest_to_ESE_result”, “Biocase_Harvest_to_ESE_tasks” and “BGBM_Media_URLS”. The fields for each table are listed in Figure 25. All the Jobs in Pentaho are based on the table “Biocase_Harvest_to_ESE”. The Job parameters need to be adapted before starting transforming the data. These parameters are all saved in “Biocase_Harvest_to_ESE”. All the correct finished ESE records are saved in the table “Biocase_Harvest_to_ESE_result”. It contains the transformation results. In the table “Biocase_Harvest_to_ESE_tasks” all tasks per Job (transform, validate, oai-import) are saved. It shows also the error messages if something goes wrong during the transformation. Finally there is the table “BGBM_Media_URLs” where all media data sources (images) are saved. It has the function of a lookup table. 2 http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+%28Kettle%29+Tutorial 17th July, 2013 AIT, 2014 C2.4.3 p. 16 Figure 25 Structure of the “etl” database with four tables Creating a folder structure The three main folders have sub-folders representing the countries from the content providers. When a new datasource has been harvested a hierarchically correct folder has to be created in every of the three main folders. Remember our example datasource we have harvested before – “Sahlberg” from the University of Finland. First of all create a new country folder named “Finland” in the main folders “transform”, “validate” and “oaiimport” – if there is not already one (see Figure 26). AIT, 2014 C2.4.3 p. 17 Figure 26 Creating a “Finland” folder in every category As can be seen there are already a few countries in every category. The Transformations or Jobs do never change, no matter which datasource is processed. So an existing Job can be copied and saved as a new one. The only thing that must be adapted before starting the Jobs are the Job Parameters. Before that three Jobs must be created – one for every category. The names of the Jobs are consistent. The “transform”-Job is named after the collection name (see Figure 27 for our example “Sahlberg”). Figure 27 Two Finnish Jobs in the transform category In the “validate” folder the Jobs are named after the collection plus the word “validate”. Between the collection name and “validate” you have to type the symbol “#” (see Figure 28). Important: The three Jobs for one datasource MUST have the same name (the collection name). Everything behind the “#” symbol is ignored by the system. AIT, 2014 C2.4.3 p. 18 Figure 28 The Finnish Jobs in the validate category Now one Job is missing for the oai-import. Again the name of the Job has the same collection name followed by “# oai import” (see Figure 29). Figure 29 The Finnish Jobs in the oaiimport category 01-transform In the “Sahlberg” example the Job in the “transform” directory looks like shown in Figure 30. Figure 30 The Job “Sahlberg” in the transform category The Job parameters can be opened by double-clicking on the orange Job icon in the middle and switching to the last tab called “Parameters” (see Figure 31). Figure 31 shows the parameters for a Transformation in ESE. AIT, 2014 C2.4.3 p. 19 Figure 31 Parameters for Transformation of “Sahlberg” in ESE In Figure 31 the Parameters are already filled in correctly. First of all it has to be defined whether the collection is “Restricted” or “Unrestricted” (Parameter number 4). This is done by typing Y (for Yes, it is restricted) or N (for No, it is not restricted = unrestricted) in the correct value field. Parameter number 5 is the collection identifier. COLLECTION_NAME:CONTENT_PROVIDER:COUNTRY It has always the same pattern: PLEASE NOTE: For the collection identifier only capital letters are used. As can be seen in Figure 31 the collection identifier of the example collection “Sahlberg” is SAHLBERG:UH:FINLAND. This collection identifier is also added on the OAI-Provider platform (see Adding a new collection). Parameter number 6 shows the base directory /opt/hit defined in the installation process of the HIT harvester (compare Fehler! Verweisquelle konnte nicht gefunden werden.). The “dataset_name” and the “dataset_uddi_key” (Parameter 7 and 8) are taken from the SQL database “Biocase_Harvest_to_ESE” (see Figure 32, compare Figure 25). Figure 32 The columns “dataset_name” and “dataset_uddi_key” in “Biocase_Harvest_to_ESE” AIT, 2014 C2.4.3 p. 20 Parameter 12 is the variable ${Internal.Job.Name}. Therefore it is important that the three Jobs for one collection have the same name. The Parameter “idzebra_dir” is the zebra directory. Parameter 20 indicates whether the Transformation is done in ESE or EDM (compare Figure 33). Figure 33 Parameters for Transformation of “Sahlberg” in EDM In Figure 33 the differences to the ESE parameters have been highlighted. First the idzebra directory is adapted to the “oai-provider-edm”. Instead of “no” “vocabulary_service_uri” the URL http://ait117:8080/Vocabulary/rest/~Mapping/NHMW_common_name/perform is filled in. Finally the “EDM” parameter is switched to “Y” for yes. When everything has been filled in correctly click “OK”. The Job is started by clicking on the “Play” symbol (see Figure 34) and then on “Launch” (see Figure 35). Figure 34 Starting the Job “Sahlberg” AIT, 2014 C2.4.3 p. 21 Figure 35 Launching a Job The “validate” Job must not be started before the “transform” Job is finished. It is very important to keep the order 01-transform, 02-validate, 03-oaiimport. The result of this first Job are XML files in ABCD format in the folder “extracted” (see Figure 36). Figure 36 ABCD records in the folder “extracted” after running the “Sahlberg” Job AIT, 2014 C2.4.3 p. 22 02-validate When the first Job is finished the Job “Sahlberg # validate” can be opened and started (see Figure 37). Figure 37 The Job “Sahlberg # validate” This Job simulates ESE validation by copying the records in the “ESEvalidated” directory. PLEASE NOTE: The “validate” function is not used at the moment. Before the oai-import is started – which can be quite time-consuming with link validation – the transformed records can be controlled under http://ait117/analyse/tasks.php. This application shows the executed Pentaho Jobs (see Figure 38). Figure 38 Analysis of Pentaho Jobs When clicking on the “EA” link in the column “Job_ID” an error analysis is done (see Figure 39). AIT, 2014 C2.4.3 p. 23 Figure 39 Error analysis of the “Sahlberg” transformation Another possibility to control the data is with phpMyAdmin. In the table “Biocase_Harvest_to_ESE_result” every transformed record is listed (see Figure 40). Figure 40 Controlling the transformed data with phpMyAdmin AIT, 2014 C2.4.3 p. 24 03-oai-import Finally open the Job “Sahlberg # oai import” in the 03-oaiimport directory (see Figure 41). Start this Job after the “validate” Job is finished. Figure 41 The Job “Sahlberg # oai import” When this is done the work with Pentaho Kettle is done. Now correct ESE records should have been created that can be controlled on the OAI-Provider-platform. The OAI-Provider The OAI-Provider platform can be reached by typing http://ait117/oai-provider/index.php into the internet browser. The ESE records that have been uploaded with the Pentaho Job “Sahlberg # oai import” can be seen there. PLEASE NOTE: It is exactly the same with the OAI-Provider-EDM which can be opened by typing http://ait117/oai-provider-edm/index.php in the browser. Logging in To log in click on one of the “Login” options shown in Figure 42. Figure 42 Logging in AIT, 2014 C2.4.3 p. 25 When clicking on one of these links the following window appears (see Figure 43). Figure 43 Login window When “User account” and “Password” have been typed in there is the possibility to check the option “Remember me” to avoid logging in every time the OAI-Provider platform is opened. When clicking finally on “Login” the following message indicates the login was successful (see Figure 44). Figure 44 “You are now logged in as admin” Adding a new collection The OAI-Provider platform has an “Admin Area” (see Figure 45). Figure 45 Entering the Admin Area When clicking on “Admin Area” the following window opens (see Figure 46). AIT, 2014 C2.4.3 p. 26 Figure 46 The Admin Area To add a new collection select the second icon “Collections” (compare Figure 46). The option “Edit Collection” appears (see Figure 47). Figure 47 “Edit Collections” When clicking on “Edit Collections” an alphabetical list of all collections can be seen (see Figure 48). Every collection includes three parts and every part has to be created separately. Figure 48 List Collections AIT, 2014 C2.4.3 p. 27 On the top right corner (compare Figure 48) is an icon for adding a new collection (the left one). When clicking on this symbol the following mask can be seen (see Figure 49). Figure 49 Adding a new collection The only field used is “Collection Identifier” on top. For the example SAHLBERG:UH:FINLAND three “new” collections are created. The first one with the collection identifier SAHLBERG; the second one with SAHLBERG:UH and the third one with SAHLBERG:UH:FINLAND. Every collection is saved separately with the disc symbol in the top right corner (compare Figure 49). When this is done a hierarchical structure has been created (see Figure 50). Figure 50 Newly added collection Sahlberg To have a look at the newly added records (via Pentaho) the “Advanced search” or the “Browse” function can be used. AIT, 2014 C2.4.3 p. 28 Advanced search The “Advanced search” is started by clicking on the link on top of the page (see Figure 51). Figure 51 Starting the “Advanced Search” The query can be simply typed in the search box and started by clicking on “Go” (see Figure 52). Furthermore it can be defined in which field the search term should appear (see Figure 53). Figure 52 Searching for “Sahlberg” Figure 53 Using the “in the field” search option AIT, 2014 C2.4.3 p. 29 If help is needed during the research the “Lookup” function can be used (see Figure 54). Figure 54 Looking up titles of the collection “Sahlberg” When the query is finished the “Go” button can be clicked and the results are listed (see Figure 55). Figure 55 Result of “Advanced Search” AIT, 2014 C2.4.3 p. 30 When clicking on one of the result records the ESE record with the different fields can be seen (see Figure 56). When switching to the “Info” tab the collection information can be controlled (see Figure 57). Figure 56 Displaying the ESE record Figure 57 Displaying the collection information Browse The “Browse” function can be used to find records as well. As can be seen in Figure 58 the records can be browsed by Europeana Data Provider, Collection, Europeana Type, Europeana Rights and OAI published. In brackets the number of records is shown. Figure 58 Browsing the records AIT, 2014 C2.4.3 p. 31 To check which records of “Sahlberg” are valid two queries are combined in the “Advanced Search”. First the name of the collection is chosen. Additionally the records must be “OAI published” (see Figure 59). Figure 59 Looking for valid records AIT, 2014 C2.4.3 p. 32 Error handling The aim of this chapter is to show possible error scenarios that can occur during the OpenUp! process (see Figure 1 for an overview of the process). In this section the process is divided into the three parts Harvesting, Transforming, and OAI-import. On the next pages potential errors and solutions are illustrated. Harvesting The bio datasource can not be saved When one is trying to edit an already existing datasource and save it afterwards the error message “Invalid field value for field "bioDatasource.lastHarvested". could appear (see Figure 60). Figure 60 Error message when editing a bio datasource In this case the bio datasource needs to be deleted and a new one has to be created. After harvesting the metadata updater no operator is created After creating a new bio datasource the metadata operator (orange colour) has to be harvested. Not till then an operator is created (green colour) (see Figure 61, first and second line for example). AIT, 2014 C2.4.3 p. 33 Figure 61 Metadata updaters and operators First try clicking on “Recently Added” to make sure, that the bio datasource has not been created yet (see Figure 62). Figure 62 Clicking on “Recently Added” When the datasource is not in the list there is most likely an error concerning the access point URL or the BioCASE protocol. The access URL should be checked again to exclude spelling mistakes. When copying and pasting the URL into a browser a BioCASE protocol should appear (see Figure 63). AIT, 2014 C2.4.3 p. 34 Figure 63 BioCASE protocol The BioCASE protocol should be checked again. If there is a mistake the data provider needs to be contacted. Furthermore the hit log file can be controlled to find out why the operator has not been created. The harvest does not start and/or there is no progress in the console window When the console window does not change even though a Job has been started first make sure the dynamic view is active. If it is not click on “switch to Dynamic View” (see Figure 64). Figure 64 Switching to Dynamic View If this is not the case the Tomcat server may need a restart. This is done by typing “sudo service tomcat6 restart” into a terminal window (see Figure 65). This may take a few seconds. AIT, 2014 C2.4.3 p. 35 Figure 65 Restarting the tomcat server No inventoried list is created3 Often the reason no inventoried list could be generated is because the inventory response was empty. From the "Console" tab, the xml requests and responses can be checked directly from the browser. The integrity of the inventoried list is paramount to the success of subsequent harvesting operations. Ideally the list of scientific names in the inventoried file will contain no duplicates, and arrange the scientific names alphabetically. If the list does not have these characteristics, double check the inventory response(s) to ensure that the names are in fact returned in order. No name ranges file is created4 The only reason that the name ranges file couldn't be generated is if the inventoried list of scientific names was empty, or all scientific names were invalid. Note that scientific names containing SQL breaking characters such as "&" are still included, but the breaking characters are replaced automatically. Therefore at this level there is a data quality check on scientific names, and any errors are outputted as log messages to the "Console". Often the reason a harvest does not retrieve 100% of a dataset/resource's records is that not all records are covered by the name ranges that have been generated. From the expanded BioDatasource in the BioDatasources list in the "Datasources" tab, you can view the name ranges file directly from within the browser. Compare this file against the inventoried.txt file for any inconsistencies. Why less than 100 % of the target records are harvested? 5 There are different reasons why records are dropped (see Figure 66). 3 https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.1_Inventory 20th August, 2013 4 https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.2_Process_inventoried 20th August, 2013 5 http://code.google.com/p/gbif-indexingtoolkit/wiki/FAQ#Why_do_I_harvest_LESS_than_100%_of_the_target_records? 19th July, 2013 AIT, 2014 C2.4.3 p. 36 Figure 66 Dropped records are shown in red It could be because of a problem constructing the name ranges file. To ensure that the proper name ranges were constructed the name ranges file should be examined for any peculiarities. There could also have been a problem parsing some of the XML responses. The logs can be examined to see that there were no parsing errors. Moreover check the actual search response(s), to see that they in fact contain records and that these correspond to the appropriate name range. Why more than 100 % of the target records are harvested?6 One source of inflated record count in Darwin-Core archives can be illegal line terminating characters (see Figure 67). A record containing such a character would break in two and appear to the parser as two lines with an insufficient number of columns. Consequently these two lines would be replaced by blank lines but still appear in the record count turning a single line into two. One could search for lines containing line terminating characters ”inside” the records and remove these. Figure 67 Additionally harvested records are shown in purple 6 http://code.google.com/p/gbif-indexingtoolkit/wiki/FAQ#Why_do_I_harvest_MORE_than_100%_of_the_target_records? 19th July, 2013 AIT, 2014 C2.4.3 p. 37 Error in the response document That could have a number of reasons: One of the table/column names that were set up in the configuration does not exist (because it was renamed or removed) or the credentials used by the BPS do not have sufficient privileges. Simply copy the SQL statement and execute it manually on the database with a regular database client that will show you the detailed error message returned by the DBMS.7 For each database you want to publish, you need to set up a BioCASe data source (do not mix that up with ODBC data sources on Windows machines). The resulting BioCASe web service is uniquely identified by its URL, which is a combination of the BioCASe installation’s URL and the name of the data source. So if you made your installation available at http ://www.foobar.org/biocase during the installation process and set up a data source named Herbar, the URL of the BioCASe web service would be http ://www.foobar.org/biocase/pywrapper.cgi?dsa=Herbar (data source names can be case sensitive, depending on your server’s operating system).8 Figure 68 shows a possible error message: Figure 68 SQL error in a response document The most common reason that a search response is invalid, is that it contains an XML breaking character. When a name range representing 500 records fails, for example, it could be due to a single invalid record and as a result the other 499 records do not get harvested. In an effort to harvest as many records as possible, and help the user identify where the breaking characters are found, the system will break a request that fails into several smaller requests. Keep a careful eye on the output log messages for which responses are invalid, and provide feedback to the data publisher which will help them improve the quality of their dataset.9 Installing the Biocase Provider on IIS Server10 It is possible to run BioCASe on a IIS server (Microsoft HTTP server) but there are some important points to remember: 1) on IIS version 7.5, the maximum length in the HTTP GET part of the query string is limited to 2048 bytes, while the length of the URL is limited to 4096. This may prevent the harvesting of a dataset. This problem can be solved easily by updating the configuration of the IIS, but is hard to identify. 2) Some versions of the IIS disallow by default the submission of accented characters in HTTP GET queries, especially if IIS is associated to one of the following services: 7 http://wiki.bgbm.org/bps/index.php/Debugging 19th July 2013 8 http://wiki.bgbm.org/bps/index.php/DatasourceSetup 19th July 2013 9 https://code.google.com/p/gbif-indexingtoolkit/wiki/UserManual#5.5.1.3_Harvest 20th August, 2013 10 http://open-up.cybertaxonomy.africamuseum.be/forum_topic/issues_when_installing_biocase_provider_iis_server 19th July 2013 AIT, 2014 C2.4.3 p. 38 -Microsoft Exchange -ISA server -Microsoft Forefront Threat Management Gateway This setting may prevent the harvesting of your dataset as IIS generates an error message when a scientific name with accented characters is being harvested, instead of publishing the data. If the BPS provider is placed behind an Microsoft Exchange/ISA server/Microsoft Forefront Threat Management Gateway the problem can be solved by changing the following setting: 1. Start the ISA Server or Microsoft Forefront Threat Management Gateway, Medium Business Edition Management tool. 2. Expand ServerName, where ServerName is the name of your ISA Server or Microsoft Forefront Threat Management Gateway, Medium Business Edition computer. 3. Click Firewall Policy, click the Web publishing rule that you created to publish the Exchange Server computer for access by OWA users, and then click Edit Selected Rule. 4. Click the Traffic tab, click Filtering, and then click Configure HTTP. 5. Click to clear the Block high-bit characters check box, and then click OK two times. 6. Click Apply to update the firewall policy, and then click OK The gbif log file can not be opened (permission denied). The error message can be seen in Figure 69. Figure 69 Could not open gbif log even file When this error occurs the rights for the directory concerned need to be changed. An error concerning the title of the datasource occurs Figure 70 shows the error message in the HIT log. AIT, 2014 C2.4.3 p. 39 Figure 70 Error message concerning the datasource title One possible cause of the problem is the use of the “&” symbol in DataSets/DataSet/Metadata/Description/Representation/Title. It can be replaced by “and” or if a shorter version is needed by “+”. Transforming An error occurs when trying to execute a transformation During the execution of a transformation in Pentaho the output can be seen by clicking on “Logging” (see Figure 71) Figure 71 Logging section with output Marked in Figure 71 is the error message “Check if archive exists and is harvested (result = [false])”. The most likely reason is that the parameter values dataset_name or dataset_uddi_key of the transformation are not correct (see Figure 72). AIT, 2014 C2.4.3 p. 40 Figure 72 Adapting the parameters The bio datasource table should be checked again to make sure there are no mistakes (see Figure 73). Figure 73 Checking the table “biodatasource“ OAI-Import The Transformation stops during the import When a great number of records have been imported the zebra index can have trouble. From time to time it is therefore necessary to rebuild the zebra index. This is done in Pentaho with the Transformation “rebuildidzebra-index” (see Figure 74) that can be found under OpenUp>>programs>>tools. AIT, 2014 C2.4.3 p. 41 Figure 74 Opening the Transformation “rebuild-idzebra-index” This Transformation is started like every other else with the “Play” symbol and the “Launch” button (see Figure 75 Figure 75 Rebuilding the zebra index Rebuilding the zebra index take at least some hours. The imported collection can not be found on the OAI-Provider platform One reason could be the unique identifier created on the platform differs from the one filled in the parameters in Pentaho. It is very important when creating a new collection to use the same ID in Pentaho (compare 01-transform and Adding a new collection). When creating the ID on the OAI-Provider platform each part of it has to be created separately (see Figure 76). AIT, 2014 C2.4.3 p. 42 Figure 76 Creating the ID for a collection That means, for example, the collections “BATS”, “BATS:ETI” and “BATS:ETI:NETHERLANDS” are created and saved separately (compare Figure 76). In Pentaho the ID has to be filled in completely (BATS:ETI:NETHERLANDS, see Figure 77). Figure 77 Parameter “collection_name“ with the complete ID AIT, 2014 C2.4.3 p. 43 I. Quick Guide GBIF-HIT Harvester Existing data source 1) Harvesting Metadata Updater (orange colour) 2) Harvesting the Operator (green colour) estimated time: 5 minutes New data source 1) Adding a new bioDatasource 2) Harvesting Metadata Updater (orange colour) 3) Harvesting the Operator (green colour) estimated time: 10 minutes Pentaho Existing data source 1) Execute Pentaho transform-Job 2) Execute Pentaho validate-Job 3) Execute Pentaho import-Job estimated time: 5 minutes New data source 1) Creating a new directory 2) Creating Pentaho transform-Job 3) Fill in the parameters 4) Execute the transform-Job AIT, 2014 C2.4.3 p. 44 5) Creating Pentaho validate-Job 6) Executing Pentaho validate-Job 7) Creating Pentaho import-Job 8) Executing Pentaho import-Job estimated time: 15 minutes OAI-Provider Existing data source 1) Using the Advanced Search or Browse function to control the data estimated time: 5 minutes New data source 1) Creating a new collection in the Admin Area 2) Using the Advanced Search or Browse function to control the data estimated time: 10 minutes Conclusion total time existing data source: 15 minutes total time new data source: 35 minutes AIT, 2014 C2.4.3 p. 45 II. List of Figures Figure 1 Diagram showing the infrastructure of the OpenUp! project with its main steps .............. 1 Figure 2 Logging in the GBIF Harvesting and Indexing Toolkit (HIT) ........................................... 2 Figure 3 The HIT user interface ............................................................................................. 3 Figure 4 The Jobs section ..................................................................................................... 4 Figure 5 The Console section with the Log Event List ................................................................ 4 Figure 6 The Registry tab ..................................................................................................... 5 Figure 7 Writing or generating a report .................................................................................. 5 Figure 8 Clicking on “add bioDatasource” ................................................................................ 6 Figure 9 Adding a new datasource ......................................................................................... 6 Figure 10 The newly added datasource “Sahlberg” ................................................................... 7 Figure 11 Job list after scheduling the metadata updater for “Sahlberg” ..................................... 8 Figure 12 The Log Event List after scheduling the metadata updater “Sahlberg” .......................... 9 Figure 13 The newly created operator “Sahlberg – Sahlberg” .................................................... 9 Figure 14 The Job list after scheduling the operator “Sahlberg – Sahlberg” ............................... 10 Figure 15 Console section during the Inventory operation ....................................................... 10 Figure 16 The inventory_request of the Inventory operator..................................................... 11 Figure 17 The inventory_response of the Inventory operator................................................... 11 Figure 18 Alphabetical list of all scientific names .................................................................... 12 Figure 19 The nameRanges.txt document ............................................................................. 13 Figure 20 The search operation creates search_requests and search_responses ........................ 13 Figure 21 The unit_records.txt file ....................................................................................... 14 Figure 22 The synchronisation and the extractions operations in the Console section ................. 15 Figure 23 The result of the extraction process ....................................................................... 15 Figure 24 Repository structure in Pentaho ............................................................................. 16 Figure 25 Structure of the “etl” database with four tables ....................................................... 17 Figure 26 Creating a “Finland” folder in every category .......................................................... 18 Figure 27 Two Finnish Jobs in the transform category ............................................................ 18 AIT, 2014 C2.4.3 p. 46 Figure 28 The Finnish Jobs in the validate category ................................................................ 19 Figure 29 The Finnish Jobs in the oaiimport category ............................................................. 19 Figure 30 The Job “Sahlberg” in the transform category ......................................................... 19 Figure 31 Parameters for Transformation of “Sahlberg” in ESE ................................................ 20 Figure 32 The columns “dataset_name” and “dataset_uddi_key” in “Biocase_Harvest_to_ESE” ... 20 Figure 33 Parameters for Transformation of “Sahlberg” in EDM ............................................... 21 Figure 34 Starting the Job “Sahlberg” .................................................................................. 21 Figure 35 Launching a Job .................................................................................................. 22 Figure 36 ABCD records in the folder “extracted” after running the “Sahlberg” Job ..................... 22 Figure 37 The Job “Sahlberg # validate” ............................................................................... 23 Figure 38 Analysis of Pentaho Jobs ...................................................................................... 23 Figure 39 Error analysis of the “Sahlberg” transformation ....................................................... 24 Figure 40 Controlling the transformed data with phpMyAdmin ................................................. 24 Figure 41 The Job “Sahlberg # oai import”............................................................................ 25 Figure 42 Logging in .......................................................................................................... 25 Figure 43 Login window ...................................................................................................... 26 Figure 44 “You are now logged in as admin” ......................................................................... 26 Figure 45 Entering the Admin Area ...................................................................................... 26 Figure 46 The Admin Area .................................................................................................. 27 Figure 47 “Edit Collections” ................................................................................................. 27 Figure 48 List Collections .................................................................................................... 27 Figure 49 Adding a new collection ........................................................................................ 28 Figure 50 Newly added collection Sahlberg ........................................................................... 28 Figure 51 Starting the “Advanced Search” ............................................................................ 29 Figure 52 Searching for “Sahlberg” ...................................................................................... 29 Figure 53 Using the “in the field” search option ..................................................................... 29 Figure 54 Looking up titles of the collection “Sahlberg” ........................................................... 30 Figure 55 Result of “Advanced Search” ................................................................................. 30 Figure 56 Displaying the ESE record ..................................................................................... 31 AIT, 2014 C2.4.3 p. 47 Figure 57 Displaying the collection information ...................................................................... 31 Figure 58 Browsing the records ........................................................................................... 31 Figure 59 Looking for valid records ...................................................................................... 32 Figure 60 Error message when editing a bio datasource ......................................................... 33 Figure 61 Metadata updaters and operators .......................................................................... 34 Figure 62 Clicking on “Recently Added” ................................................................................ 34 Figure 63 BioCASE protocol ................................................................................................. 35 Figure 64 Switching to Dynamic View ................................................................................... 35 Figure 65 Restarting the tomcat server ................................................................................ 36 Figure 66 Dropped records are shown in red ......................................................................... 37 Figure 67 Additionally harvested records are shown in purple .................................................. 37 Figure 68 SQL error in a response document ......................................................................... 38 Figure 69 Could not open gbif log even file ........................................................................... 39 Figure 70 Error message concerning the datasource title ........................................................ 40 Figure 71 Logging section with output .................................................................................. 40 Figure 72 Adapting the parameters ...................................................................................... 41 Figure 73 Checking the table “biodatasource“ ........................................................................ 41 Figure 74 Opening the Transformation “rebuild-idzebra-index” ................................................ 42 Figure 75 Rebuilding the zebra index ................................................................................... 42 Figure 76 Creating the ID for a collection .............................................................................. 43 Figure 77 Parameter “collection_name“ with the complete ID .................................................. 43 AIT, 2014 C2.4.3 p. 48
© Copyright 2025