Population and Sample in data science and statistics

A gentle introduction to population, sample, and their characteristics in statistics.

Sanjay Nandakumar
5 min readMar 27, 2020
Photo by Jacek Dylag on Unsplash

“Facts are stubborn, but statistics are reliable” –Mark Twain

The role of population plays a major role in statistics and data science. Moreover, without drawing populations and samples, the whole world of building statistics and data science might have gone into no existence.

Data being the most foundational building block of all analysis, It is important to know how data are getting segregated, collected, and sampled before moving statistical analysis. In this article, I will discuss the population and sample from the perspective of statistics and data science.

Population

It is the collection of a specified group of similar objects, individuals, or entities that have some common observable characteristics in them. Out of which, each object is termed an “Elementary unit”.

Example- Let’s consider we have a list consisting of the name of all the employees in a company, It is nothing but a population. Out of which each employee will be considered as an elementary unit.

Types of Population

Photo by Jake Hills on Unsplash

Finite population

This is a type of population in which the number of elementary units is exactly quantifiable.

Example- Books in a university library.

Infinite population

In this type of population, The count of elementary units is not quantifiable to most certainty.

Example- Population of a country. The population of a country is not certainly quantifiable most of the time while approximation can be done. This is because each second the number of deaths and births is changing over time.

Real population

This is such a type of population that is mostly based on real-time data and the information is concrete and reliable. This population does not require approximation or hypothetical data.

Example- Employees working in a company.

Hypothetical population

This can be a finite or infinite imaginary population designed by a researcher. Here mostly, the researcher will take a real-time scenario and apply his/her common hypothesis or assumptions to draw the structure and information of a population.

Example- Possible outcomes of a die if rolled ’n’ times.

Sample

Photo by João Silas on Unsplash

A part of the population drawn according to a rule or plan for concluding characteristics is called a sample.

Example-Imagine an XYZ company that has around 50k employees. To do some analysis based on the information of these employees, It is practically difficult for researchers concerning time and money with all of 50k employees. The best possible way is to select 5k people (or any random number) from this population and collect the data from these employees to do the analysis. This random count of employees selected from the entire population is called a Sample. This data analysis will be done by the researchers on a hypothesis that whatever inferences they get from these 5k people will apply to the entire population itself.

Sample size

The number of items in a sample is called a sample size. In the above example, Out of 50k employees, 5k was selected for analysis and that makes the sample size 5k.

Characteristics of the sample

A sample should follow certain characteristics to make it fit for data analysis. Research done on a wrong sample will result in wrong inferences and these may contradict the behavior of the entire population resulting in dangerous consequences.

1. Representativeness

A sample should represent the overall behavior of a population. Imagine the situation in the above example in which 5k employees are selected out of 50k employees. If in the original population, there are 30k men and 20k women but in the sample, there were only female employees present (which is the sample size). Any analysis done on this sample will do not represent the overall behavior of the population.

2. Homogeneity

Homogeneity is nothing but the matching of behavior in multiple samples. If we derive multiple samples from a population, It is expected that all samples infer somewhat the same conclusions about the population.

Imagine if we want to calculate the mean salary of the 50 k employees and we have 3 samples each of a 5k sample size.

· Sample 1 has a mean salary of $40k

· Sample 2 has a mean salary of 38k

· Sample 3 has a mean salary of $41k

We can say that these samples are homogeneous since all samples are giving approximately equal information regarding the salary of the employees.

What if the result is like this,

· Sample 1 has a mean salary of $40k

· Sample 2 has a mean salary of 15k

· Sample 3 has a mean salary of $100k

Here, the researcher will not able to determine the approximate salary of a person in the company due to data volatility.

3. Adequacy

The number of sampling units in a sample should be adequate for doing the research.

In the above example, Out of 50k employees, It will be not effective if draw a sample of sample size 5 or 6 for doing research.

4. Similar regulating conditions

There should be a similar way of selecting samples if there is a need for multiple samples.

In the above example, Out of 50k employees, a sample of 5k employees was chosen at random and if we are selecting another sample it’s should be also chosen randomly. Any kind of pre-conditions for selecting the elementary unit should not be encouraged.

If Sample 1 of sample size 5k is chosen at random but we are creating sample 2 of the same sample size for the same data analysis but we chose only female employees in sample 2. This will affect the homogeneity of the samples and will end up in incorrect inferences.

Some important terminologies

Sampling unit

Similar to the elementary unit, each element in the sample is called a sampling unit. Here out of 5k employees, each of the employees will be a sampling unit.

Sampling frame

A complete list of sampling units, maps, or other acceptable material, which represents the population to be sampled is called the sampling frame.

Let’s consider if we have a list of salary details of 50k people in a company.

Here,

Each salary is a data point that is nothing but the sampling unit.

The details of the salary will be collected from each of the 50k employees. That means they are the information providers. This makes each of the 50k employees an observational unit.

A list of salaries with its entire subcomponent including Provident Fund deduction, House rent allowances, Bonuses will become the sampling frame.

In the next part of this article, I have written about the types of sampling –Probabilistic and Non- Probabilistic. How they can be created and used for statistical research.

URL — Sampling in statistical research -Part 1

Thanks for reading !!!

--

--

No responses yet