2021.12.19 11:05

Think stats pdf download

This second edition of the well-received book has been extensively revised: All code is now written in Python version 3. In addition, the two first chapters of the previous edition have been extended and split up into five new chapte Each chapter gives you the complete source code for a new game, and then teaches the programming concepts from the examples.

This book also has an introduction to making games with 2D graph Coffee Break Python Slicing. Puzzle-based learning is an active learning technique. Professor Downey is an expert writer with over 12 books under his belt.

This particular book is very comprehensive. The author guides an engineer with minimal statistical knowledge into the intrinsicness of statistics.

Professor Downey started Comprehensiveness rating: 5 see less. Professor Downey started the book with basic concepts of exploratory data to distribution, plotting and effect size, moves to probability mass functions, and cumulative distribution.

Then he untangles the complicated subject of modeling distributions and probability density functions. From Chapter 7 he starts a journey to hypothesis testing and regression analysis.

The concepts of hypothesis testing and regression analysis are not simple, so he begins by explaining the relationship between variables, demonstrating the relationships with scatter plots. He moves to explaining concepts like correlation, covariance and linear dependency Pearson correlation coefficient. From this chapter, he moves to explaining sample distributions and sampling bias. By now the student has a strong understanding of sample distribution and ready to learn about hypothesis testing.

During the chapter in hypothesis testing, he describes the most common methods to perform hypothesis testing to compare two different groups. In chapter 10, the author explains basic concepts necessary to understand regression like least square, residuals, goodness of fit and weighted resampling. That ideal is hard to achieve in practice, but people who conduct surveys come as close as they can. The NSFG is not representative; instead it is deliberately oversampled.

The designers of the study recruited three groups—Hispanics, African Americans and teenagers—at rates higher than their representation in the U. Of course, the drawback of oversampling is that it is not as easy to draw conclusions about the general population based on statistics from the survey.

We will come back to this point later. Importing the Data The code and data used in this book are available from GitHub. Each line in the file is a record that contains data about one pregnancy. The format of the file is documented in FemPreg. For example, here are a few lines from FemPreg. This dictionary describes two variables: caseid is a character string that represents the respondent ID; pregorder is a one-byte integer that indicates which pregnancy this record describes for this respondent.

The code you downloaded includes thinkstats2. ReadStataDct takes the name of the dictionary file and returns dct, a FixedWidthVariables object that contains the information from the dictionary file. In addition to the data, a DataFrame also contains the variable names and their types, and it provides methods for accessing and modifying the data.

The result is an Index, which is another pandas data structure. The result is a Series, yet another pandas data structure.

A Series is like a Python list with some additional features. In this example the indices are integers from 0 to , but in general they can be any sortable type. The elements are also integers, but they can be any type. The last line includes the variable name, Series length, and data type; int64 is one of the types provided by NumPy.

If you run this example on a bit machine you might see int The result of the index operator is an int64; the result of the slice is another Series. Variables We have already seen two variables in the NSFG dataset, caseid and pregordr, and we have seen that there are variables in total. For the explorations in this book, I use the following variables:. The code 1 indicates a live birth.

For outcomes other than live births, this field is blank. It is a floating- point value that indicates the number of people in the U. If you read the codebook carefully, you will see that many of the variables are recodes, which means that they are not part of the raw data collected by the survey; they are calculated using the raw data.

Recodes are often based on logic that checks the consistency and accuracy of the data. In general it is a good idea to use recodes when they are available, unless there is a compelling reason to process the raw data yourself. Transformation When you import data like this, you often have to check for errors, deal with special values, convert data into different formats, and perform calculations.

These operations are called data cleaning. In the data file, agepreg is encoded as an integer number of centiyears. So the first line divides each element of agepreg by , yielding a floating-point value in years. The replace method replaces these values with np. But dealing with missing data will be a recurring issue.

Validation When data is exported from one software environment and imported into another, errors might be introduced. And when you are getting familiar with a new dataset, you might interpret data incorrectly or introduce other misunderstandings. If you take time to validate the data, you can save time later and avoid errors. One way to validate data is to compute basic statistics and compare them with published results.

For example, the NSFG codebook includes tables that summarize each variable. Comparing the results with the published table, it looks like the values in outcome are correct.

The counts for 6, 7, and 8 pounds check out, and if you add up the counts for and , they check out, too. But if you look more closely, you will notice one value that has to be an error, a 51 pound baby! To deal with this error, I added a line to CleanFemPreg: df.

This statement replaces invalid values with np. The expression in brackets yields a Series of type bool, where True indicates that the condition is true. When a Boolean Series is used as an index, it selects only the elements that satisfy the condition. Interpretation To work with data effectively, you have to think on two levels at the same time: the level of statistics and the level of context.

The iteritems method enumerates the index row number and caseid for each pregnancy. If you are not familiar with defaultdict, it is in the Python collections module. Using this list as an index into df. Instead of printing the whole Series, I selected the values attribute, which is a NumPy array. The outcome code 1 indicates a live birth. Code 4 indicates a miscarriage; that is, a pregnancy that ended spontaneously, usually with no known medical cause.

Statistically this respondent is not unusual. Miscarriages are common and there are other respondents who reported as many or more. But remembering the context, this data tells the story of a woman who was pregnant six times, each time ending in miscarriage.

Her seventh and most recent pregnancy ended in a live birth. If we consider this data with empathy, it is natural to be moved by the story it tells.

Each record in the NSFG dataset represents a person who provided honest answers to many personal and difficult questions. At the same time, we have an obligation to consider the people represented by the data, and to afford them respect and gratitude. In the repository you downloaded, you should find a file named chap01ex.

If IPython is installed, it should launch a server that runs in the background and open a browser to view the notebook. Open chap01ex. Some cells are already filled in, and you should execute them.

Other cells give you instructions for exercises you should try. A solution to this exercise is in chap01soln. Exercise Create a file named chap01ex. You might want to start with a copy of nsfg. The variable pregnum is a recode that indicates how many times each respondent has been pregnant. You can also cross-validate the respondent and pregnancy files by comparing pregnum for each respondent with the number of records in the pregnancy file.

You can use nsfg. MakePregMap to make a dictionary that maps from each caseid to a list of indices into the pregnancy DataFrame. The best way to learn about statistics is to work on a project you are interested in.

Governments are good sources because data from public research is often freely available. If it seems like someone has already answered your question, look closely to see whether the answer is justified. There might be flaws in the data or the analysis that make the conclusion unreliable. In that case you could perform a different analysis of the same data, or look for a better source of data. If you find a published paper that addresses your question, you should be able to get the raw data.

Many authors make their data available on the web, but for sensitive data you might have to write to the authors, provide information about how you plan to use the data, or agree to certain terms of use. Be persistent! Glossary anecdotal evidence Evidence, often personal, that is collected casually rather than by a well-designed study. One of the best ways to describe a variable is to report the values that appear in the dataset and how many times each value appears.

The most common representation of a distribution is a histogram, which is a graph that shows the frequency of each value. In Python, an efficient way to compute frequencies is with a dictionary.

The result is a dictionary that maps from values to frequencies. The result is a Counter object, which is a subclass of dictionary. But for this book I created a class, Hist, that represents histograms and provides the methods that operate on them. Freq 2 2. Freq 4 0. Values [1, 5, 3, 2]. To loop through the values in order, you can use the built-in function sorted: for val in sorted hist. Values : print val, hist.

Freq val. Or you can use Items to iterate through value-frequency pairs: for val, freq in hist. Items : print val, freq. Plotting Histograms For this book I wrote a module called thinkplot. It is based on pyplot, which is part of the matplotlib package. The code in this chapter is in first. When you start working with a new dataset, I suggest you explore the variables you are planning to use one at a time, and a good way to start is by looking at histograms.

In this section I use these variables to demonstrate some features of histograms. The expression in brackets is a boolean Series that selects rows from the DataFrame and returns a new DataFrame. Hist live. Hist hist thinkplot. When the argument passed to Hist is a pandas Series, any nan values are dropped.

Figure shows the result. The most common value, called the mode, is 7 pounds. The distribution is approximately bell-shaped, which is the shape of the normal distribution, also called a Gaussian distribution. In theory we expect this distribution to be uniform; that is, all values should have the same frequency. In fact, 0 is more common than the other values, and 1 and 15 are less common, probably because respondents round off birth weights that are close to an integer value.

The mode is 21 years. The distribution is very roughly bell-shaped, but in this case the tail extends farther to the right than left; most mothers are in their 20s, fewer in their 30s. Figure shows the histogram of prglngth, the length of the pregnancy in weeks. By far the most common value is 39 weeks.

The left tail is longer than the right; early babies. Histogram of the ounce part of birth weight. Outliers Looking at histograms, it is easy to identify the most common values and the shape of the distribution, but rare values are not always visible. Hist provides methods Largest and Smallest, which take an integer n and return the n largest or smallest values from the histogram: for weeks, freq in hist. Smallest 10 : print weeks, freq.

In the list of pregnancy lengths for live births, the 10 lowest values are [0, 4, 9, 13, 17, 18, 19, 20, 21, 22]. Values below 10 weeks are certainly errors; the most likely explanation is that the outcome was not coded correctly.

Values higher than 30 weeks are probably legitimate. Between 10 and 30 weeks, it is hard to be sure; some values are probably errors, but some represent premature babies. On the other end of the range, the highest values are: weeks count 43 44 46 45 10 46 1 47 1 48 7 50 2. Most doctors recommend induced labor if a pregnancy exceeds 42 weeks, so some of the longer values are surprising.

In particular, 50 weeks seems medically unlikely. And it depends on what analysis you are planning to perform. In this example, the motivating question is whether first babies tend to be early or late. When people ask this question, they are usually interested in full-term pregnancies, so for this analysis I will focus on pregnancies longer than 27 weeks.

First Babies Now we can compare the distribution of pregnancy lengths for first babies and others. Hist firsts. Hist others. PrePlot 2 thinkplot. PrePlot takes the number of histograms we are planning to plot; it uses this information to choose an appropriate collection of colors. Finally, I adjust the axis to show only data between 27 and 46 weeks. But they are not the best choice for comparing two distributions.

In the next chapter we address this problem using probability mass functions. Some of the characteristics we might want to report are: central tendency Do the values tend to cluster around a particular point? Statistics designed to answer these questions are called summary statistics. By far the most common summary statistic is the mean, which is meant to describe the central tendency of the distribution.

Sometimes the mean is a good description of a set of values. For example, apples are all pretty much the same size at least the ones sold in supermarkets.

So if I buy 6 apples and the total weight is 3 pounds, it would be a reasonable summary to say they are about a half pound each. Suppose I grow several varieties in my garden, and one day I harvest three decorative pumpkins that are 1 pound each, two pie pumpkins that are 3 pounds each, and one Atlantic Giant pumpkin that weighs pounds. In this example, there is no meaningful average because there is no typical pumpkin.

Variance If there is no single number that summarizes pumpkin weights, we can do a little better with two numbers: mean and variance. The square root of variance, S, is the standard deviation. If you have prior experience, you might have seen a formula for variance with n - 1 in the denominator, rather than n. We will come back to this in Chapter 8.

For all live births, the mean pregnancy length is Variance of pregnancy length is 7. Effect Size An effect size is a summary statistic intended to describe wait for it the size of an effect.

For example, to describe the difference between two groups, one obvious choice is the difference in the means. Mean pregnancy length for first babies is In fact, without observing a large number of pregnancies, it is unlikely that anyone would notice this difference at all. Another way to convey the size of the effect is to compare the difference between groups to the variability within groups. In this example, the difference in means is 0.

To put that in perspective, the difference in height between men and women is about 1. Reporting Results We have seen several ways to describe the difference in pregnancy length if there is one between first babies and others. How should we report these results? The answer depends on who is asking the question.

A scientist might be interested in any real effect, no matter how small. A doctor might only care about effects that are clinically significant; that is, differences that affect treatment decisions. A pregnant woman might be interested in results that are relevant to her, like the probability of delivering early or late.

How you report results also depends on your goals. If you are trying to reassure a patient, you might choose statistics that put the differences in context.

Of course your decisions should also be guided by professional ethics. But you should also do your best to make your reports honest, and to acknowledge uncertainty and limitations. Based on the results in this chapter, suppose you were asked to summarize what you learned about whether first babies arrive late. Which summary statistics would you use if you wanted to get a story on the evening news? Which ones would you use if you wanted to reassure an anxious patient? In the repository you downloaded, you should find a file named chap02ex.

Other cells give you instructions for exercises. Follow the instructions and fill in the answers. A solution to this exercise is in chap02soln. For the following exercises, create a file named chap02ex. You can find a solution in chap02soln. The mode of a distribution is the most frequent value; see Wikipedia. Write a function called Mode that takes a Hist and returns the most frequent value.

As a more challenging exercise, write a function called AllModes that returns a list of value-frequency pairs in descending order of frequency. How does it compare to the difference in pregnancy length? Glossary distribution The values that appear in a sample and the frequency of each.

The code for this chapter is in probability. Pmfs Another way to represent a distribution is a probability mass function PMF , which maps from each value to its probability. A probability is a frequency expressed as a fraction of the sample size, n.

To get from frequencies to probabilities, we divide through by n, which is called normalization. Or we can use the Pmf class provided by thinkstats2. Like Hist, the Pmf constructor can take a list, pandas Series, dictionary, Hist, or another Pmf object. The Pmf is normalized so total probability is 1. For example, the methods Values and Items work the same way for both. The biggest difference is that a Hist maps from values to integer counters; a Pmf maps from values to floating-point probabilities.

Prob 2 0. Incr 2, 0. Mult 2, 0. If you modify a Pmf, the result may not be normalized; that is, the probabilities may no longer add up to 1. Total 0. Total 1. Pmf objects provide a Copy method so you can make and modify a copy without affecting the original.

My notation in this section might seem inconsistent, but there is a system: I use Pmf for the name of the class, pmf for an instance of the class, and PMF for the mathematical concept of a probability mass function. Bar graphs are most useful if the number of values in the Pmf is small. This option is most useful if there are a large number of values and the Pmf is smooth. This function also works with Hist objects. Figure shows PMFs of pregnancy length for first babies and others using bar graphs left and step functions right.

Figure PMF of pregnancy lengths for first babies and others, using bar graphs and step functions. SubPlot 2 thinkplot. By plotting the PMF instead of the histogram, we can compare the two distributions without being misled by the difference in sample size.

Based on this figure, first babies. PrePlot takes optional parameters rows and cols to make a grid of figures, in this case one row of two figures. The figure on the left displays the Pmfs using thinkplot. Hist, as we have seen before. The second call to PrePlot resets the color generator. Then SubPlot switches to the second figure on the right and displays the Pmfs using thinkplot.

I used the axis option to ensure that the two figures are on the same axes, which is generally a good idea if you intend to compare two figures.

Other Visualizations Histograms and PMFs are useful while you are exploring data and trying to identify patterns and relationships. Once you have an idea what is going on, a good next step is to design a visualization that makes the patterns you have identified as clear as possible.

In the NSFG data, the biggest differences in the distributions are near the mode. In this code, weeks is the range of weeks; diffs is the difference between the two PMFs in percentage points. Figure shows the result as a bar chart. This figure makes the pattern clearer: first babies are less likely to be born in week 39, and somewhat more likely to be born in weeks 41 and For now we should hold this conclusion only tentatively.

We used the same dataset to identify an apparent difference and then chose a visualization that makes the difference apparent. Difference, in percentage points, by week. At many American colleges and universities, the student-to-faculty ratio is about But students are often surprised to discover that their average class size is bigger than There are two reasons for the discrepancy:. The first effect is obvious, at least once it is pointed out; the second is more subtle.

Suppose that a college offers 65 classes in a given semester, with the following distribution of sizes: size count 5- 9 8 8 14 4 6 12 8 3 2. But if you survey a group of students, ask them how many students are in their classes, and compute the mean, you would think the average class size was bigger.

For each class size, x, we multiply the probability by x, the number of students who observe that class size. The result is a new Pmf that represents the biased distribution. In the biased distribution there are fewer small classes and more large ones. The mean of the biased distribution is Distribution of class sizes, actual and as observed by students.

It is also possible to invert this operation. An alternative is to choose a random sample of students and ask how many students are in their classes. You can also provide row names. The set of row names is called the index; the row names themselves are called labels. If you know the integer position of a row, rather than its label, you can use the iloc attribute, which also returns a Series. My advice: if your rows have labels that are not simple integers, use the labels consistently and avoid using integer positions.

Exercises Solutions to these exercises are in chap03soln. Something like the class size paradox appears if you survey children and ask how many children are in their family. Families with many children are more likely to appear in your sample, and families with no children have no chance to be in the sample.

Now compute the biased distribution we would see if we surveyed the children and asked them how many children under 18 including themselves are in their household. Plot the actual and biased distributions, and compute their means. As a starting place, you can use chap03ex. To test these methods, check that they are consistent with the methods Mean and Var provided by Pmf. To address this version of the question, select respondents who have at least two babies and compute pairwise differences.

Does this formulation of the question yield a different result? Books Video icon An illustration of two cells of a film strip. Video Audio icon An illustration of an audio speaker. Audio Software icon An illustration of a 3. Software Images icon An illustration of two photographs. Images Donate icon An illustration of a heart shape Donate Ellipses icon An illustration of text ellipses. Think Stats v 2.

Lucas May's Ownd

0コメント

1000 / 1000