Go to GESIS-Homepage
Go to homepage

SoftwareKG - Social

Investigating Software Usage in the Social Sciences - A Knowledge Graph Approach


Description

SoftwareKG_Social is a knowledge graph that contains information about software mention statements from more than 51,000 scientific articles from the social sciences. It enables analysis on the provenance of the research results, the attribution of the developers, and software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and the role of open source software in science at a general base.

Software mention extraction

We employed a neural network based classifier for the automatic identification of software usage statements by sequence tagging. For training, we used a 3 step approach based on transfer learning. First, we employed publicly available word-embeddings that were trained on scientific articles, then, we used suggestive labels in a silver standard corpus (SSC), and finally we used a gold standard corpus (GSC) created by manual annotation. A handcrafted set of rules based on prior knowledge from the literature was implemented in the Snorkel data programming framework, which was subsequently used to create the suggestive labels for the entire corpus. A bi-LSTM-CRF was trained on both SSC and GSC, which was then used to identify software mentions in the entire corpus. Variations in spelling of software names were linked based on DBpedia and hand-crafted rules. Additional information about the availability of the software and its source code was collected manually.

Data model

In the following figure, the data model of SoftwareKG_Social is illustrated.

Statistics

Property Value
Triples 3,998,194
Resources 1,013,216
Distinct types 5
Distinct properties 25
Publications 51,165
Softwares 20,227
Authors 334,944
Organizations 473,229
Software mentions 133,651
Free software (free/not free) 133 (75/58)
Source code available (available/not available) 133 (52/81)

Dataset

Dataset files

The latest release of SoftwareKG_Social can be downloaded at https://doi.org/10.5281/zenodo.3571350.

SPARQL endpoint

A SPARQL endpoint is available to send SPARQL queries and retrieve results from SoftwareKG.

https://data.gesis.org/softwarekg/sparql

Example 1: Requesting all mentions per software per year. (Result)

SELECT ?n ?y (count(?n) as ?count) WHERE {
 ?s rdf:type <http://schema.org/SoftwareApplication> .
 ?s <http://schema.org/name> ?n .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?n ?y
HAVING (count(?n) > 1)
ORDER by DESC(?count)

Example 2: Requesting all mentions of software which is available free per year. (Result)

SELECT count(?n) as ?count ?free ?y WHERE {
 ?s rdf:type <http://schema.org/SoftwareApplication> .
 ?s <http://data.gesis.org/softwarekg/freeAvailable> ?free .
 ?s <http://schema.org/name> ?n .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y ?free

Example 3: Requesting all mentions of software which source code is available per year. (Result)

SELECT count(?n) as ?count ?source ?y WHERE {
 ?s rdf:type <http://schema.org/SoftwareApplication> .
 ?s <http://data.gesis.org/softwarekg/sourceAvailable> ?source .
 ?s <http://schema.org/name> ?n .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y ?source

Example 4: Requesting all mentions of the software OpenBUGS per year. (Result)

SELECT ?y (count(?y) as ?count) WHERE {
 ?s <http://schema.org/name> "OpenBUGS" .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y

Example 5: Requesting all mentions of the software WinBUGS per year. (Result)

SELECT ?y (count(?y) as ?count) WHERE {
 ?s <http://schema.org/name> "WinBUGS" .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
}
GROUP BY ?y

Example 6: Requesting number of software mentions per year for free software in Wikidata that has been replaced by another software.
(Results)

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wds: <http://www.wikidata.org/entity/statement/>
PREFIX wdv: <http://www.wikidata.org/value/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX bd: <http://www.bigdata.com/rdf#>
SELECT DISTINCT ?software ?y (count(?y) AS ?count) ?itemLabel ?item_s_1Label WHERE {{
 ?s rdf:type <http://schema.org/SoftwareApplication> .
 ?s <http://schema.org/name> ?software .
 ?s <http://schema.org/sameAs> ?link .
 ?m <http://data.gesis.org/softwarekg/software> ?s .
 ?p <http://schema.org/mentions> ?m .
 ?p <http://purl.org/dc/elements/1.1/date> ?y .
 FILTER (regex(str(?link), "wikidata" ))}
 {
  SERVICE <https://query.wikidata.org/bigdata/namespace/wdq/sparql> {
  ?item p:P31/wdt:P279* ?item_s_0Statement .
  ?item_s_0Statement ps:P31/wdt:P279* wd:Q178285 .
  ?item p:P1366 ?item_s_1Statement .
  ?item_s_1Statement ps:P1366 ?item_s_1.
  ?item rdfs:label ?itemLabel.
  FILTER(LANG(?itemLabel) = "en").
  ?item_s_1 rdfs:label ?item_s_1Label.
  FILTER(LANG(?item_s_1Label) = "en").
  OPTIONAL {?item wdt:P487 ?itemUnicodecharacter .}
  OPTIONAL {?item wdt:P31 ?iteminstanceof .}
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } }
 }
 BIND (STRAFTER(str(?link), "https://www.wikidata.org/wiki/") as ?l1)
 BIND (STRAFTER(str(?item), "http://www.wikidata.org/entity/") as ?l2)
 BIND (STRAFTER(str(?item_s_1), "http://www.wikidata.org/entity/") as ?l3)
 FILTER (STR(?l1) = STR(?l2) || STR(?l1) = STR(?l3))
}
GROUP BY ?software ?y ?itemLabel ?item_s_1Label
ORDER BY DESC(?y)

Analysis results

The following figure shows the absolute and relative amounts of the 10 most common software per year.

Absolute frequency of the usage of commercial, free, and open source software is shown in the figure below.

The following figure shows the absolute usages of OpenBUGS and WinBUGS.

Source code

The source code of all components of SoftwareKG_Social is available on GitHub at https://github.com/dave-s477/softwareKG/.

License

The dataset is published under a Creative Commons Attribution 4.0 license.

Publications

Schindler D., Zapilko B., Krüger F. (2020) Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In: Harth A. et al. (eds) The Semantic Web. ESWC 2020. Lecture Notes in Computer Science, vol 12123. Springer, Cham. https://doi.org/10.1007/978-3-030-49461-2_16

Contact

Please provide your feedback and any comments by sending an email to softwarekg (at) gesis (dot) org

About Us

David Schindler, ORCID ID: 0000-0003-4203-8851, University Rostock, https://www.int.uni-rostock.de/
Benjamin Zapilko, ORCID ID: 0000-0001-9495-040X, GESIS - Leibniz Institute for the Social Sciences (Germany), https://www.gesis.org/
Frank Krüger, ORCID ID: 0000-0002-7925-3363, University Rostock, https://www.int.uni-rostock.de/