SoftwareKG_Social is a knowledge graph that contains information about software mention statements from more than 51,000 scientific articles from the social sciences. It enables analysis on the provenance of the research results, the attribution of the developers, and software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and the role of open source software in science at a general base.
We employed a neural network based classifier for the automatic identification of software usage statements by sequence tagging. For training, we used a 3 step approach based on transfer learning. First, we employed publicly available word-embeddings that were trained on scientific articles, then, we used suggestive labels in a silver standard corpus (SSC), and finally we used a gold standard corpus (GSC) created by manual annotation. A handcrafted set of rules based on prior knowledge from the literature was implemented in the Snorkel data programming framework, which was subsequently used to create the suggestive labels for the entire corpus. A bi-LSTM-CRF was trained on both SSC and GSC, which was then used to identify software mentions in the entire corpus. Variations in spelling of software names were linked based on DBpedia and hand-crafted rules. Additional information about the availability of the software and its source code was collected manually.
In the following figure, the data model of SoftwareKG_Social is illustrated.
Property | Value |
---|---|
Triples | 3,998,194 |
Resources | 1,013,216 |
Distinct types | 5 |
Distinct properties | 25 |
Publications | 51,165 |
Softwares | 20,227 |
Authors | 334,944 |
Organizations | 473,229 |
Software mentions | 133,651 |
Free software (free/not free) | 133 (75/58) |
Source code available (available/not available) | 133 (52/81) |
The latest release of SoftwareKG_Social can be downloaded at https://doi.org/10.5281/zenodo.3571350.
A SPARQL endpoint is available to send SPARQL queries and retrieve results from SoftwareKG.
https://data.gesis.org/softwarekg/sparql
Example 1: Requesting all mentions per software per year. (Result)
SELECT ?n ?y (count(?n) as ?count) WHERE {
?s rdf:type <http://schema.org/SoftwareApplication> .
?s <http://schema.org/name> ?n .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?n ?y
HAVING (count(?n) > 1)
ORDER by DESC(?count)
Example 2: Requesting all mentions of software which is available free per year. (Result)
SELECT count(?n) as ?count ?free ?y WHERE {
?s rdf:type <http://schema.org/SoftwareApplication> .
?s <http://data.gesis.org/softwarekg/freeAvailable> ?free .
?s <http://schema.org/name> ?n .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y ?free
Example 3: Requesting all mentions of software which source code is available per year. (Result)
SELECT count(?n) as ?count ?source ?y WHERE {
?s rdf:type <http://schema.org/SoftwareApplication> .
?s <http://data.gesis.org/softwarekg/sourceAvailable> ?source .
?s <http://schema.org/name> ?n .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y ?source
Example 4: Requesting all mentions of the software OpenBUGS per year. (Result)
SELECT ?y (count(?y) as ?count) WHERE {
?s <http://schema.org/name> "OpenBUGS" .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y
Example 5: Requesting all mentions of the software WinBUGS per year. (Result)
SELECT ?y (count(?y) as ?count) WHERE {
?s <http://schema.org/name> "WinBUGS" .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y
Example 6: Requesting number of software mentions per year for free software in Wikidata that has been replaced by another software.
(Results)
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX bd: <http://www.bigdata.com/rdf#>
SELECT DISTINCT ?software ?y (count(?y) AS ?count) ?itemLabel ?item_s_1Label WHERE
{
{
?s rdf:type <http://schema.org/SoftwareApplication> .
?s <http://schema.org/name> ?software .
?s <http://schema.org/sameAs> ?link .
?m <http://data.gesis.org/softwarekg/software> ?s .
?p <http://schema.org/mentions> ?m .
?p <http://purl.org/dc/elements/1.1/date> ?y .
FILTER (regex(str(?link), "wikidata" ))
}
{
SERVICE <https://query.wikidata.org/bigdata/namespace/wdq/sparql>
{
?item p:P31/wdt:P279* ?item_s_0Statement .
?item_s_0Statement ps:P31/wdt:P279* wd:Q178285 .
?item p:P1366 ?item_s_1Statement .
?item_s_1Statement ps:P1366 ?item_s_1.
?item rdfs:label ?itemLabel.
FILTER(LANG(?itemLabel) = "en").
?item_s_1 rdfs:label ?item_s_1Label.
FILTER(LANG(?item_s_1Label) = "en").
OPTIONAL {?item wdt:P487 ?itemUnicodecharacter .}
OPTIONAL {?item wdt:P31 ?iteminstanceof .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
}
BIND (STRAFTER(str(?link), "https://www.wikidata.org/wiki/") as ?l1)
BIND (STRAFTER(str(?item), "http://www.wikidata.org/entity/") as ?l2)
BIND (STRAFTER(str(?item_s_1), "http://www.wikidata.org/entity/") as ?l3)
FILTER (STR(?l1) = STR(?l2) || STR(?l1) = STR(?l3))
}
GROUP BY ?software ?y ?itemLabel ?item_s_1Label
ORDER BY DESC(?y)
The following figure shows the absolute and relative amounts of the 10 most common software per year.
Absolute frequency of the usage of commercial, free, and open source software is shown in the figure below.
The following figure shows the absolute usages of OpenBUGS and WinBUGS.
The source code of all components of SoftwareKG_Social is available on GitHub at https://github.com/dave-s477/softwareKG/.
The dataset is published under a Creative Commons Attribution 4.0 license.
Schindler D., Zapilko B., Krüger F. (2020) Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In: Harth A. et al. (eds) The Semantic Web. ESWC 2020. Lecture Notes in Computer Science, vol 12123. Springer, Cham. https://doi.org/10.1007/978-3-030-49461-2_16
Please provide your feedback and any comments by sending an email to softwarekg (at) gesis (dot) org
David Schindler, ORCID ID: 0000-0003-4203-8851, University Rostock, https://www.int.uni-rostock.de/
Benjamin Zapilko, ORCID ID: 0000-0001-9495-040X, GESIS - Leibniz Institute for the Social Sciences (Germany), https://www.gesis.org/
Frank Krüger, ORCID ID: 0000-0002-7925-3363, University Rostock, https://www.int.uni-rostock.de/