PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Revealed relatedness between individual surnames: A focus on the spatial co-occurrence of surnames makes this paper distinct from previous studies. The bulk of the literature typically concerns pairwise comparisons between spatially defined populations based on the (di)similarity of their respective surname compositions [5], [6], [14], [15]. Here, we apply two modifications of the measure of pairwise relatedness used in very different context of the analysis of international trade [16]. These measures are novel in the context of surname analysis and we have found them to work better for our purposes than the traditional “genetic distance” measures such as Lasker or Neís indices (outlined in [17]). The approach adopted here is a departure from previous research in the sense that the spatial distributions of individual surnames are the key input; regional patterns emerge as groupings in the surname space. Such approaches seek to establish the extent to which two or more geographic areas share the same pool of surnames and therefore offer comparisons between spatial units rather than the surnames themselves. With traditional methods, broad surname regions can be reliably produced but at the risk of subsuming some of the smaller groups of surnames with non-contiguous spatial patterning. Migrant surnames may, for example, be well-represented in these smaller groups and therefore more easily isolated than when using a traditional measure to produce more aggregate results. Improved granularity comes at the expense of increased computing overheads and a far more complex result (due to its larger number of comparisons), but we feel that capability to handle and interpret such outputs is increasing all the time and, as such, the methodology will become more widely applicable. The first step in defining a surname spatial similarity measure is the selection of an appropriate form of input data for describing the occurrence of individual surnames in particular regions. A simple consideration of the absolute numbers of bearers would be inappropriate in the present context because the size of subpopulations of individual surnames varies immensely. A better metric that accounts for both the spatial concentration and the ubiquity of individual surnames is the location quotient (LQ). For individual surnames (i) and regions (r), respectively, it can be expressed as:(1)where Fi,r stands for the absolute number of bearers of the surname i in the region r. The LQi,r compares the relative share of people with the surname i in the population of the region r relative to the share of this surname in the whole population at a more aggregate level. An LQi,r >1 indicates that the surname in question is more prevalent in the region r than in the whole population (below we simply say that the surname concentrates in the region r). In the second step, the LQ is used for the expression of the pairwise measures of revealed relatedness between surnames. For this paper the Jaccard and Dice similarity measures were examined. Here the Jaccard establishes the number of regions where both of the two analyzed surnames are concentrated relative to the number of regions where at least one of them concentrates. The Jaccard measure of the revealed relatedness between the two surnames i and j when focusing on their co-occurrence over r regions is defined as:(2)where the nominator accounts for the number of regions that satisfy both LQi,r >1 and LQj,r >1, while the denominator refers to the number of regions satisfying at least one of these inequalities. The measure falls between 0 and 1 with the upper bound signifying that the two surnames in question are concentrated solely in identical regions. In this context, the first asymmetric Dice measure captures the probability that surname i concentrates in the region r conditional to the concentration of surname j in the same region:(3)(4) Similarly, the second Dice measure calculates the probability that surname j concentrates in the region r conditional to the concentration of surname i in the same region:(5)(6) For the present purpose we need a symmetric measure of relatedness and thus consider the smaller from the two asymmetric Dice measures presented above. As such, we define the symmetric Dice measure of revealed relatedness between the surnames i and j as:(7) The appropriateness of the above defined Jaccard and Dice measures has not been tested with surname data. We therefore sought to establish the possible impacts of differing population sizes of individual surnames. We undertook a number of Monte Carlo simulation tests to establish the properties of the indices in this respect (Text S1). We found that the Dice coefficient is slightly less sensitive to the size differences and more stable in terms of smaller fluctuations in results obtained from repeatedly generated pseudorandom data. In general we have noted that both of the measures can serve well for our purposes and we undertook all of our calculations for both the indices. However, because of space limitations, the graphical results presented and their associated analysis use the Dice coefficient only. Given the specification of our analysis described below, the sets of surnames linked by the 50,000 highest pair-wise observations of Jaccard and Dice measures, respectively, calculated at the more detailed level of municipalities (that is where a higher discrepancy may be expected) are 80% identical. Given the large sample of surnames analyzed it was necessary to run a series of computationally intensive calculations to obtain an extensive matrix of surname-surname proximity observations (nearly 200 million in the first stage of our analysis as described below). Such a matrix tends to be very sparse with a large number of zero or negligible observations and very few more significant observations. It is therefore conducive to data mining through network analysis (the matrix can also be referred to as the weighted adjacency matrix as in [18]). We thus consider the network of surnames in terms of an undirected graph where nodes (or vertices) correspond to individual surnames and links (or edges) between them refer to the most significant measures of revealed relatedness (Di,j has been applied for the results presented below). As stated above, we consider this network as an appealing representation of the Czech surname space. It can be examined both globally in terms of its aggregate patterns, its shape, or the number of communities, and locally through extracting the positions of individual surnames or their groups. Both of these aspects are important with respect to our inductive analysis that is driven by an expectation of detectable clusters or communities and surnames with strong internal and relatively weak external relatedness. For the network visualization we used Cytoscape, open source software suitable for handling large complex networks [19]. A force-directed algorithm with consideration of weights linearly proportional to our measure of revealed relatedness appeared to produce the most effective network layout (for description of the force-directed layout used in Cytoscape software see http://cytoscapeweb.cytoscape.org/documentation/layout). With this the network can be conceptualised as a physical system where nodes (surnames) influence each other via attracting forces with strengths proportional to their revealed relatedness. The algorithm minimizes the energy of the physical system and assigns the nodes with positions in two-dimensional space accordingly. For the network visualisation to be interpretable, the majority of negligable links should be removed. A threshold of Di,j (denoted as d) determined by, for example, inspecting the frequency distribution of the proximity observations provides a logical criterion. Considering a certain d, a surname space visualisation consists of n surnames and m surname-surname relatedness links, when:(8)(9) This provides the basis to defining some simple local and global characteristics of the surname network, similarly to basic measures used in the network analysis [18]. An important local parameter pertaining to each node is the node degree. It is the number of links that connect the node in question to other nodes in the network. Here the degree of a surname i is denoted as ki and it corresponds to the number of its revealed relatedness links to other surnames equal or above chosen d:(10) This measure is particularly interesting in the present context because it can be considered as a simple measure of the node centrality. A high ki implies that surname in question co-occurs (concentrates in similar regions) with many other surnames within a given surname space or its sub-space. In other words, a high ki indicates that a surname i is highly embedded in the surname space or its sub-space (which is understood here as any contiguous part of the surname network, defined for example by a selection of adjacent nodes or links) and that it can be considered an examplar of a local population. In addition, two basic global parameters of a surname space can be introduced in terms of the mean surname degree (c) expressed as:(11)and the surname space density (ρ): that is the proportion of actual number of links in the surname space relative to the maximum possible number of links: (12)Both c and ρ are valuable metrics measuring the extent of aggregate relatedness among surnames within a given surname space (or its sub-space). As such, they provide interesting information about the extent of internal population homogeneity. This paper draws on a unique dataset containing the occurrence of individual surnames in each of the 6,253 Czech municipalities derived from the 2009 Central Population Register (produced by the Czech Ministry of the Interior). The data cover all those with permanent residence; that is Czech nationals and foreigners staying on a long-term basis. The 10,705,763 individuals listed share 362,125 unique surnames. It is conventional in Czechia to have male and female variants of the same surname. Both exhibit almost identical spatial distributions negating the need to include both forms and so the female variants were omitted. This dramatically reduced the volume of data. Fortunately, nearly all Czech feminine derivatives are easily distinguishable by the suffix “–á”. The exceptions are comparatively more frequent for certain surnames typical of eastern Moravia and Silesia [20] and among rare surnames (see Table 1). Although a few of these exceptional cases have been included into the analyzed sample, it does not have any significant effect on results because the location quotient (as described above) compares relative population shares. Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0048568.t001 Frequency distribution of all surnames and feminine derivatives with suffix “–á”. For the analysis of co-occurrence we decided to work with male surnames with a frequency exceeding 49 bearers in the whole country. With this filter applied the data comprised 15,487 most frequent male surnames and 4,347,283 individuals corresponding to 83% of total male population. The cut-off was chosen in the light of the following: firstly, the size distribution of surnames is heavily right skewed and the inclusion of less frequent names would make our analysis excessively computationally intensive (as described below); secondly, and more importantly, the consideration of less frequent surnames would considerably increase a risk of contamination of results by random co-occurrences of rare surnames; thirdly, we also noted that the spatial distribution of rare surnames in Czechia is quite uneven with significantly higher shares of such surnames in peripheral areas and especially in the region of Silesia (basic information about regional division of the country and main migratory processes that have shaped its current ethnic structure is provided in Text S2 and Figure S1). Therefore, the inclusion of rare surnames would disproportionately enlarge the parts of surname space that depict surnames concentrated in these regions. As previously noted, the scope of the proposed study has been constrained by the computational intensity of the analysis and the nature of Czech administrative geography in this context. Initially, we attempted the analysis directly at the finest spatial level of municipalities. However, these spatial units were too fragmented and differing in population sizes to the extent that small numbers became an issue. Instead, we opted for a two-stage procedure (Table 2). In the first stage we undertook the analysis using a set of larger spatial units corresponding to 206 micro-regions (so called municipalities with extended competence). Importantly, the delineation of these units coincides relatively well with historical and socio-economic processes and they can be considered as functional socio-geographical micro-regions. The first stage of our analysis highlighted a smaller sub-sample of “important” surnames in terms of those most frequently co-occurring over these regions. As described in Table 2 and discussed in more detail below, in this way 5,660 of the potentially most interesting surnames (that is 36% of the original sample equivalent to nearly half of the male population) linked by the most significant pairwise measures of relatedness were identified. This set of surnames was then analyzed in the second stage of our analysis focusing on co-occurrence in 6,244 municipalities (the original set of municipalities contained 6,253 units but in nine of them none of 5,660 surnames indentified in the first stage of our analysis is concentrated). This approach is based on the assumption that the pairs of surnames with high co-occurrence in larger regions will also have a higher probability of being found together in smaller regions. For the second stage, the three largest municipalities (in terms of population size) including Praha, Brno, and Ostrava were excluded from the analysis as we expect many “random” co-occurrences to be found, thus adding noise to the results. Table data removed from full text. Table identifier and caption: 10.1371/journal.pone.0048568.t002 Description of samples of surnames and spatial units in the first and second stage of analysis. *Refer to individuals bearing surnames included in the analysed samples of surnames. We expect that the consideration of co-occurrence indices in more aggregate spatial units can provide us with a “global” picture, whilst analysis at the level of municipalities will lead to more fragmented network identifying more accurately the pairs and communities of individual surnames with the highest probability of being factually related.
rdf:type