The proposal “linked.swissbib.ch” was accepted by SUC P2 – („Scientific information: access, processing and safeguarding“) as one of the projects for building the future infrastructure to access scientific digital information.
The application was submitted by Haute école spécialisée de Suisse occidentale HES-SO -Haute Ecole de Gestion, Genève in cooperation with HTW Chur and Universitätsbibliothek Basel (project swissbib).
One of the aims of the project is the transformation of the current swissbib content into a Semantic Data format (RDF) to enable and simplify the linking of swissbib data with external information.
Finally the created services by linked.swissbib.ch should run on the infrastructure provided by swissbib.ch to facilitate scientists and general library users access to the linked information. Additionally the infrastructure should be available and usable for any other project or service who wants to work on similar questions (catchphrase: “swissbib as a working bench”)
I. General requirements for future tools and the infrastructure to be created by linked.swissbib.ch (and swissbib classic)
a) The software has to transform any kind of meta-data structure into another (meta-data) structure.
People related to meta-data transformations in the library world (scripting knowledge is helpful but not necessary) should be able to define work-flows for their own purposes – recently I heard that the phrase “biblio information scientist” is sometimes used as an expression for such a group of people.
The mechanisms should be available on a desktop (for example a person in development of a workflow for metadata transformations) and at the same time it should be possible to use the same artifact on a cluster to process millions of items in reasonable time. Actually the classic swissbib content consists of around 20 million records (merged). We are in charge to include licensed article meta-data in the swissbib data-hub. This content (potentially between 30 -150 million descriptions) should also be transformed and serialized into a semantic RDF format.
II. Possible software alternatives to reach the defined goals
A. Infrastructure developed by the Culturegraph project (German National library)
1) How are the main components of Culturegraph related to the general requirements defined above?
The framework makes the solution re-usable and expandable. It defines general classes and interfaces which could be used by others (swissbib too) to extend the solution in a way it is required for their aims.
As an example for this: The metafacture-core component provides commands which are combined to FLUX – workflows (we will see the description of this later). These commands were extended by HBZ by some commands of their own to serve their needs.
Beside new commands being part of the workflow you can define new types as functions being part of the Metamorph domain specific language.
The same mechanism to write specialized functionality with dedicated commands beside the metafacture-core commands is used by the Culturegraph project itself. There already exist connectors for SQL databases, NoSQL stores like the often used Mongo store and to include Mediawiki content as part of the meta-data transformations. All these repositories are available on Github
b) (Meta)-Morph transformation language
My guess (at the moment – I’m still in the phase of evaluation): Metamorph enables you to define at least most of the transformation required for our needs. For most of the work,knowledge of programming languages isn’t necessary. You have to know your meta-data structures and how you want to transform it.
In the mid-term perspective I have in mind to use this kind of solution not only for transformations into RDF but for already existing processes in the classic swissbib service too (CBS FCV transformations and Search Engine – SOLR / Elasticsearch – document processing for example)
I would describe FLUX as a “macro language” (similar to macros in office programs) for the definition of processing pipelines.Here you can find an example.
The following picture will give you an overview and idea how heterogeneous article meta-data (provided by publishers) could be normalized in processes based on Metafacture.
To support the creation of FLUX work-flows (especially for non developers) HBZ implemented an Eclipse extension. This extension gives you suggestions for commands (together with their arguments) as part of whole work-flows. Additionally you can start a FLUX process under control of the extension. Fabian Steeg (HBZ and creator of the extension) wrote a helpful summary of the
metafacture-ide and metafacture-core out of his point of view.
2) Additional relations to the defined goals
As I tried to describe: the main parts of Metafacture (and extended components based on the core framework) addresses the requirements of the goals in a), b) and f)
(meta-data transformation / reusable, expandable and future proofed / not only for software developers)
But what about the other defined goals:
– The software is Open Source, originally developed by the Deutsche Nationalbibliothek and available via Github and Maven
– The software (Metafacture) isn’t in itsphase of infancy and a community has been established around it. Currentlydeployed as version 2.0,it is the heart of the linked data service provided by DNB. It is already re-used by other institutions for their services (especially HBZ for lobid.org serviceand SLUB Dresden for their EU financed “Data Management Platform for the Automatic Linking of Library Data„)
– The infrastructure should have the potential being very scalable.
Transformation processes of meta-data are currently often based on traditional techniques like relational databaseswhich is proven and mature. E.g. a lot of sophisticated meta-data transformation processes were created by swissbib classic in the last 5 years.
We don’t have in mind to throw away thesepossibilities and collected experiences in the past. And even for the remarkable larger amount of meta-data for licensed articles we think it makes sense and it is possible to use it by increasing our (virtual) hardware.
But (probably soon) there comes the point in time where newer possibilities should be used (at least in parallel). These possibilities are often labeled with the catchphrase “Big Data” and behind these catchphrases are mostly Hadoop for distributed processing and distributed (NoSQL) storages like Hbase.
Because the Metafacture work-flows are stream-based (combining several modules in a chain) it is well prepared to run on a Hadoop Cluster. Culturegraph has already done this with their component called metafacture-cluster.
This technique made it possible for them to bring together all the bibliographic resources of all the library networks in Germany and to build clusters of equal or similar data (which is a foundation of their Linked Data service) – really fast. This is actually done by swissbib classic within our current data-hub based on traditional technologies (see above). Here you can find a more detailed view on the architecture of the current swissbib.
We have to keep this aspect (processing of really large amounts of data) in mind. Perhaps not immediately when it comes to article data but with greater probability if research data should be incorporated into a data-hub like swissbib.
B. Toolset developed by LibreCat (with Catmandu as the Backbone)
I don’t have any experience with LibreCat.
One sentence I still have in my mind:
Because Culturegraph is based on Java they (Culturegraph) use a lot of XML while Librecat (implemented in Perl) can do the same pretty much faster directly in the code.
I guess it’s true as long asit’s true… there are some misconceptions from my point of view:
a) There is no use of XML because the core of Metafacture is implemented in Java. Java and XML aren’t correlated in any way. Metafacture uses XML because it’s the mean (not more) to express the DSL (DomainSpecific Language – Metamorph) used to define meta-data transformations. This makes it possible for people without advancedprogramming skills to define such transformations on a higher level.
b) a clear separation of implementation and definitions for meta-data transformations (the rules) is not unimportant to make the system in the long term better maintainable although it’s probably faster and easier to understand ‚for the developer itself‘ to implement transformations in Perl.
– maybe a very subjective point of view:
Today I wouldn’t implement a system with Perl. If a dynamic scripting language my recommendation would be something like Python or Ruby. Although things change pretty fast nowadays my impression is that Perl isn’t often used by younger people. It might be different in the ‚current‘ library world. But my assumption is it’s going to change in the future.
– Especially when it comes to the processing of large amounts of data – I wouldn’t use dynamic scripting languages. Execution performance is one thing. The other: the integration with clusters for large amounts of data is less supported.
Nevertheless: we should keep an eye on the development in the project. A noticeable group of people (able to contribute) is interested in the project and as I tried to depict it in my picture above:
I guess we could integrate it (if reasonable) in an infrastructure like Metafacture.
Another aspect: If it is possible to re-use the work for normalization of licensed article meta-data done by GBV we have to be “Java compatible”. If I’m not wrong the work is done on the basis of this programming language.