The population is the entire group that the
researchers are interested in. Because it is usually too costly to gather the
data for the entire population, researchers will collect data from a sample,
representing a subset of the population.
A parameter is a true quantity for the entire
population, while a statistic is what is calculated from the sample. A
parameter is about a population and a statistic is about a sample. Remember: p
goes with p and s goes with s.
Two common
summary quantities are mean(for
numerical variables) and proportion(for
categorical variables).
Finding a
good estimate for a population parameter requires a random sample; do not generalize
from anecdotal
evidence.
There are
two primary types of data collection: observationalstudies
and experiments.
In an experiment,
researchers impose a treatment to look for a causal relationship between the
treatment and the response. In an observationalstudy,
researchers simply collect data without imposing any treatment.
Remember: Correlationis not causation! In other words, an association
between two variables does not imply that one causes the other. Proving a
causal relationship requires a well-designed experiment.
In an
observational study, one must always consider the existence of confounding
factors. A confounding
factoris a “spoiler variable”
that could explain an observed relationship between the explanatory variable
and the response. Remember: For a variable to be confounding it must be
associated with both the explanatory variable and the response variable.
When taking
a sample from a population, avoid convenience samples and volunteer samples,
which likely introduce bias. Instead, use a random sampling method.
Generalizations from a sample can be made to a population only if the sample is
random. Furthermore, the generalization can be made only to the population from
which the sample was randomly selected, not to a larger or different
population.
Random
sampling from the entire population of interest avoids the problem of
under-coverage bias. However, response bias and non-response bias can be present
in any type of sample, random or not.
In a simple
random
sample, every individual as well as every group of individuals has
the same probability of being in the sample. A common way to select a simple
random sample is to number each individual of the population from 1 to N. Using
a random digit table or a random number generator, numbers are randomly
selected without replacement and the corresponding individuals become part of
the sample.
A systematic
random sampleinvolves choosing
from of a population using a random starting point, and then selecting members
according to a fixed, periodic interval (such as every 10th member).
A stratified
random sampleinvolves randomly
sampling from every strata, where the strata should correspond to a variable
thought to be associated with the variable of interest.
This ensures
that the sample will have appropriate representation from each of the different
strata and reduces variability in the sample estimates.
A cluster random
sampleinvolves randomly
selecting a set of clusters, or groups, and then collecting data on all
individuals in the selected clusters. This can be useful when sampling clusters
is more convenient and less expensive than sampling individuals, and it is an
effective strategy when each cluster is approximately representative of the
population.
Remember:
Individual strata should be homogeneous (self-similar), while individual
clusters should be heterogeneous (diverse). For example, if smoking is
correlated with what is being estimated, let one stratum be all smokers and the
other be all non-smokers, then randomly select an appropriate number of
individuals from each strata. Alternately, if age is correlated with the
variable being estimated, one could randomly select a subset of clusters, where
each cluster has mixed age groups.
0 comments