nif:isString
|
-
The NBA season is divided into two: regular season and playoffs. In the regular season, each of the teams plays home games and away games, combined to a total of games in a season. For each of these games a “play by play” data is collected and is available through the NBA official website. This data lists all the events that have an influence on the game such as throw attempts, fouls, rebounds, assists, etc. along with the names of the players involved and the time of the event. There are three types of free throws in a basketball game - all come after a foul was committed. The resulting penalty is a series of either or consecutive free throws. A player which is awarded a free throw, gets an uninterrupted attempt to score a basket from a predetermined distance. Since this distance is the same in all the basketball courts and the defending skills of the opposing team and players do not affect the outcome, one is lead to a conclusion that free throws outcome are a measure of the skill of the individual player at that point in time. This makes free throws an excellent candidate to find a “hot hand” provided that such phenomenon exists. Since we were interested only in free throws data the relevant information was extracted from the entire database of five NBA regular seasons (). The data set used in our analysis after cleaning it for various items (see next section), is constructed of a total of free throw attempts consisting of single free throw attempts, pairs of free throw attempts taken by different players and triplets of free throws attempts taken by different players from a total games over consecutive seasons. The first step in analyzing the data was to clean it from all types of errors and inconsistencies: The data of games (out of the games played in season ) was not complete on the NBA website and therefore was excluded. records of the two throws data points had only one entry (the first or second trial). Hence were suspected to be typing errors and were moved into the single throw sets.In some cases two players from the same team share the same last name. In most of these cases the player ID or the initial of the first name helps in telling them apart but in several individual cases the data was still ambiguous: in all of these cases we simply ignored this data for the current analysis (sums into throws out of the throws exist in the entire data set, of them were part of two throws sets).All together, less than of all data points were ignored for the current analysis. There is no reason to believe that the resulting data is biased in any sense due to the cleaning procedure. The cleaned data is available in Dataset S1.
As mentioned above there are three types of of free throws sequences: a single attempt, a sequence of two consecutive attempts and a sequence of three consecutive attempts. For all the two attempts sequences of every player, , we measured the success rate for the first and second free throws attempts and denoted them by and respectively. Throughout this text lower case letters denote single individual properties. Then we measured the conditioned success rate in the second free throw attempt (conditioned the first throw went in/out) denoted by and respectively. The average of the players success rates and conditional probabilities were calculated and denoted by an upper case letter using the same notation, , , and . We have also measured the success rates and conditional success rates for the entire data set aggregated over all players and denoted it by and . An equivalent procedure was done with the data of the three consecutive free throw attempts. The results were then tested for statistical significance for two measures: Non-stationarity (NS): the change in success rate as the consecutive attempt number increases.Conditional probability (CP): the change in success rate of the second attempt for a given results of the previous attempt (for a sequence of three free throw attempts the same was done with the third attempt as well).Both of these measures can be studied with the aid of the hypergeometric distribution. In order to test the NS one can think of hits as “white balls” and misses as “black balls” and put them all in one urn after labeling them as first or second attempt. Since the null assumption is that there is no systematic deviation in the probability of success between the first and the second attempts, one can sample, without replacement, one half of the total number of throws (first and second attempts combined) and check how many hits (white balls) are in the sample. The null assumption implies that the number of hits in the first or second attempt should be consistent with a random sample from this hypergeometric probability distribution function. In the case of testing the change in CP one can think of putting all the second attempts as balls in the urn (hits are the “white balls” and misses are the “black balls”). This time the number of balls that are drawn, without replacement, is the number of hits in the first throw. Once again, the null assumption, which states that the result of the second attempt is independent of the result of the first attempt, implies that the number of times one gets hits in both throws will agree with a random sample from this hypergeometric distribution function. We describe the hypergeometric distribution function with the following parameters: (number of white balls in the urn), (number of black balls in the urn), (sample size) and (the number of white balls in the sample) and the data variables by as the number of times the results Hit-Hit, Hit-Miss, Miss-Hit, Miss-Miss, and the total number pairs respectively. Thus, the formulation of the NS hypergeometric distribution function is,(1)while the formulation of the CP hypergeometric distribution function is,(2) After calculating these measures, in principal, one can calculate for each individual player, the value resulting from an exact Fisher test or an exact Bernard test. Both tests have their own disadvantages in accuracy and more importantly, since the distributions are discrete it is not so easy to analyze the collection of results for all individuals and deduce from it a resulting “ value” or “ value of the values” [16]. Hence, we decided to take two independent approaches to estimate the probability of obtaining the observed collection of result for all individuals just by chance (“ value”). The first, computationally faster, approach involves estimating a “ value” for each individual player; the value is the distance (including sign) of the observed value, , from the expected value, , for the hypergeometric distribution in units of its standard deviation,(3)where the subscript is the notations we use to distinguish between the two measures defined above ( for non-stationarity and for conditional probability). When calculating for the aggregated data the total number of free throw attempts is large enough and the distribution of can be approximated well by a normal distribution with zero mean and a variance of , from which the resulting value can be extracted. In cases where we are interested in the statistical significance of the collection of values of the players, one can look at their (the individual 's) mean value: from the definition (eq. 3), assuming the 's of the different players are independent, one should expect the mean value to be and a variance to be . Following this, we define which in turn can use the normal approximation to obtain a “” value by(4)where is the Gauss error function and is for and elsewhere. A positive value means “hot hand” while a negative number indicates “cold hand”. This approach is fast but rely on a normal approximation. A more accurate, though computationally intensive, approach, is permutation approach: first, we reshuffle the second throws of each individual. After reshuffling, we calculate the value for each individual for the reshuffled data and record the mean value for this reshuffled realization. Repeating this procedure many times results in a collection of mean values (one for each realization/reshuffle), each corresponds to independent second throws from the first ones. The last step in order to estimate the correspondent value is to rank the results of the random reshuffles and see what fraction of these show larger values than the actual observed value from the original data. This fraction times two (since we are calculating a two tailed test) is an estimate of the value. In the current analysis we have made such realizations for each player in each season.
|