nif:isString
|
-
We had access to CDRs for one Indian operator for the period from January 1 to March 31, 2013. Only summary statistics from the CDRs were provided to us: social network information and daily customer counts at various cell towers located at the Kumbh. Caller IDs were anonymized, and no individual-level characteristics were provided to us aside from billing area codes and whether or not a prepaid or postpaid plan was used. This dataset contains records of 146 million (145, 736, 764) texts and 245 million (245, 252, 102) calls for a total of 390 million (390, 988, 866) communication events. Given the logistical impossibility of collecting demographic, linguistic, or cultural attributes of Kumbh participants at scale, we based our investigation of homophily on a marker that acts as a proxy for these covariates, namely, cell phone area codes. The area codes correspond to different states of India, and as a result of India’s States Reorganization Act of 1956 these divisions summarize demographic variability along linguistic origin, ethnic agglomeration, and preexisting social bonds and boundaries. Though officially India has more than 23 states, we adhere instead to the 23 functional state divisions used by the service provider. While CDRs readily lend themselves to studying social networks and social homophily, to investigate spatial homophily we additionally acquired access to the cell tower IDs at the Kumbh venue. Combined with the latitude and longitude of each of the 207 towers at the site, many of which were brought into the venue prior to the start of the festival in anticipation of the large influx of people, we were able to infer the caller’s location (at the time of phone-based communication) with relatively high spatial resolution. The grid that divides the Kumbh site into regions around each cell tower, called the Voronoi tessellation, groups all points on the map closest to each cell tower. The birds-eye view of Allahabad in Fig 1 shows the estimated attendance on one of the busiest and most favorable days for ritual bathing in the Ganges river.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0156794.g001 Cell phone usage around the cell towers at the Kumbh during its busiest day.The heat map polygons represent the Voronoi tessellation around the cell towers that occupied the site of the Kumbh Mela event in Allahabad, India. Cell towers with no activity are removed from the analysis and their Voronoi cells are merged into neighboring active cell towers. Map data used to produce the river traces: Google, DigitalGlobe.
Extrapolating population measures from CDRs has become feasible in recent years due to the rapid increase in the prevalence of cell phones. While CDRs provide raw counts of cell phone users, to estimate attendance, these numbers need to be adjusted by (i) overall prevalence of cell phones in India, (ii) the state-specific market shares of our provider, (iii) the probability of daily use for a person known to be present at the venue, and (iv) the probability of phone non-use during a person’s entire stay at the venue. First, regarding overall phone prevalence, 71.3% of people in India had a wireless subscription in 2013 [25]. Second, regarding market share, the number of unique handsets are counted on a daily basis for each of 23 distinct states of India (See S1 Text), as defined by the service provider, and each count is extrapolated from the service provider’s market share in the given states. The service provider’s market share varies widely state by state (range 13.7%, 42.6%). It is important to use state-specific market share, because if average market share is used instead, the state-specific attendance counts can be off by more than a factor of 2. These handset counts are added together for each day before extrapolating to the general population. Third, regarding daily use, it is likely that many Kumbh attendees who use their phone at least once do not use their phone every day while at the festival. If not addressed, this would bias our population estimate downwards. By tracking phone activity, length of stay can be estimated based on the time period a person’s phone is active while at the Kumbh. Based on this, we estimate the percentage of customers who use their phone on any given day during their stay conditional on them using their phone at least once during their stay to be 40.4%. (Note that this quantity applies to daily estimates, not to cumulative estimates. See S1 Text). Fourth, regarding non-use, the probability of a person not using his or her phone during the entire stay at the venue is difficult to account for; these individuals are not visible in the observed data, and yet the proportion of non-users could potentially be substantial given that many visitors from outside regions would have to pay roaming fees, which likely leads them to minimize their phone use. To overcome this difficulty, we first examine four available daily population projections [26], each for a different day, and calibrate the proportion of non-users such that our resulting daily estimate for that same day is most consistent with the four daily projections. We obtain an estimate of 40.6% for non-use (coincidentally similar to 40.4% obtained above for daily use) and we use this estimate to adjust both cumulative and daily attendance.
A social network is constructed between customers who used their phone at the Kumbh. A network edge is assumed between any two people who communicated with one another at any point over the course of the Kumbh. To study how a state’s extent of social homophily is related to its level of representation, defined as the number of people present from the state divided by the total Kumbh attendance, we select a measure that results in consistent estimates of homophily regardless of state representation. The measure of social homophily considered in refs. [13, 27] applied to our setting would define homophily for any given state as the proportion of ties that involve two participants from that state, but due to measuring absolute differences instead of relative differences, the homophily for states with small representation would be biased downwards due to their small proportions. A standard stochastic block model (SBM) approach [28] applied to our setting would assume an equal likelihood of forming network ties between any two participants from the same state. However, if this model is misspecified and there exist additional social structure within each state (within each block), as is almost certainly the case, then this approach is likely biased in the opposite direction and overestimates the social homophily in states with lower representation. To see the reason for this, consider the case where state A sends only a single group of friends to the Kumbh, whereas state B sends 100 different groups of friends. A random pair selected from state A will have a much higher likelihood of being friends than will a random pair from state B, even if social homophily is equally strong within the friendship groups of the two states. The biases of both these methods are discussed in further detail in S1 Text. To circumvent these problems, we shift our focus from dyads to same-state connected triples, sets of three nodes from the same state that are connected either by two edges, resulting in an open triple, or three edges, resulting in a closed triple. The rationale behind this choice is that the three nodes in a connected triple can be assumed to belong to the same social group whether the triple is open or closed. By considering the propensity for same-state connected triples to be closed, we can gain insight into how densely connected the social groups are in which these triples are embedded. This approach is a way of sampling pairs of nodes from the same social group even when the social groups themselves are unobserved. The proportion of triples that are closed provides a natural measure of social homophily (see Fig 2). This measure is commonly referred to as the global clustering coefficient or the transitivity index [29] calculated over each state-specific network. When studying social homophily we ignore the attendees from the local state where the Kumbh is held, eastern Uttar Pradesh, because the social behavior of the locals is likely not comparable to those from the other 22 states. While visitors from other states are all present for the same purpose of participation in the Kumbh, this is not true for the locals, many of whom were employed to help run the Kumbh in various roles. Outsider phone usage will likely be exclusively for coordinating purposes at the event, due to the cost of roaming calls. On the other hand, locals use their phones much more freely and for everyday purposes. Ignoring residents from the local state whose phone use is likely different from all other states, there are 1, 630, 553 connected triples in the full Kumbh social network.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0156794.g002 Schematics of homophily measures (A) and call detail records (B).For homophily measures (A), the three dotted lines represent spatial boundaries for the Voronoi tessellation around the cell towers, separating the shaded region into three Voronoi cells, in two (a low and high homophily) examples. The solid lines denote which nodes are in communication in the social network, either through voice call or text message. In the context of spatial homophily, two nodes are considered nearby if and only if they both are in the same spatial region (Voronoi cell) on the same day. The size of Voronoi cells range from as small as a 1/4km2 to as large as 20km2. For the call detail records (B), analysis of spatial homophily uses all pairwise communication events involving at least one customer of our operator who is present at the Kumbh, whereas analysis of social homophily only considers the ties between customers of our operator.
Letting Cijk = 1 if the (i, j, k) triple is closed and Cijk = 0 if it is open, and let Rijk be the state of the three nodes in the triple, with Wr as the proportion of the total cumulative Kumbh population by March 31, 2013, that belongs to state r = 1, …, 23. Across the 22 non-local states, Wr ranges from 0.018% to 7.45%, thus varying over 2.5 orders of magnitude. We fit the following regression model over all connected triples: logit(pr(Cijk=1))=β0+β1log10WRijk(1) The model requires independence between observations for accurate inference, and because the same individual can be involved in multiple triples, this independence does not hold. The estimate β^1 from Eq (1) is still unbiased, but its standard error and the P-value for the two-sided test of the null hypothesis β1 = 0 will not be correct if this dependence is ignored. Taking advantage of the large sample size, for accurate inference we select a random subset of triples where we do not allow the same individual to appear in more than one triple.
Let ncrd be the number of customers near cell tower c from state r on day d of the Kumbh, and let Nrd=∑c=1Cncrd be the total number of customers from state r at the Kumbh on day d, where the sum is taken over all C cell towers. To avoid double-counting, if a person uses multiple cell towers on the same day, only the first cell tower is recorded. The probability that any two given individuals from the state r are nearby on the day d is: prd=1Nrd∑c=1Cncrdncrd-1Nrd-1(2) Here two people are defined to be “nearby” on a particular day when they are both located in the same Voronoi cell on that day, using the cell tower designation mentioned above. The intuition behind Eq 2 is that, given the location of one person, the probability a different randomly selected person from their state is in the same Voronoi cell is (ncrd − 1)/(Nrd − 1). The probability in Eq 2 has the desirable property of not scaling with state representation Wr if spatial homophily is kept constant. To see this, suppose that we hold constant how a particular state’s attendees are spread out over the cell towers of the Kumbh, i.e. suppose we fix ncrd/Nrd. If we then increase the number of people present at the Kumbh from that state, prd will stay essentially unchanged with a negligible increase, because (a ⋅ ncrd − 1)/(a ⋅ Nrd − 1) > (ncrd − 1)/(Nrd − 1) for any a > 1. This property is essential if we wish to evaluate the relationship between spatial homophily and state attendance/representation. Finally, let QrA=∑d=190prd/90 be the probability that any two given individuals from state r are nearby averaged over all 90 days. To evaluate busy, or high volume, days, we consider the three days with the highest attendance. We grouped each of these three days together along with the two days that preceded each and the two days that followed each, leading to a set of 15 days we labeled as high volume days. The remaining 75 days were grouped together to form the set of low volume days. We let QrH be the average of the prd over the high volume days, QrL be the average of the prd over the low volume days, and we defined QrD=QrH/QrL to be the ratio of spatial homophily when comparing high volume days to low volume days.
|