Page 2 :
BUSINESS STATISTICS, COURSE CONTRIBUTOR, , PROOF READING BY, , Dr. Ritika Sambyal, Asstt. Professor, Department of Commerce, Udhampur Campus, University of Jammu, , Prof. Sandeep Kour, Tandon, Co-ordinator M. Com, Room No. 111, Ist Floor, D.D.E., University of Jammu, , © Director of Distance Education, University of Jammu, Jammu, 2020., , All rights reserved. No part of this work may be reproduced in any form, by, mimeograph or any other means, without permission in writing from the DDE,, University of Jammu., The script writer shall be responsible for the lesson/script submitted to the DDE, and any plagiarism shall be his/her entire responsibility, Printed by : S.K. Printers/2020/, 2
Page 3 :
DIRECTORATE OF DISTANCE EDUCATION, UNIVERSITY OF JAMMU, M.COM. FIRST SEMESTER (NCBCS), BUSINESS STATSTICS, Course : M.COM-E115, Credit : 4, Time : 3.00 Hrs, , Max Marks : 100 Marks, External : 80 Marks, Internal : 20 Marks, , (Syllabus for Examinations to be held in Dec 2019 onwards), Objective : To acquaint the students about the various concepts and techniques to business, statstics along with their application to be prlblems associated with the field of trade and, business., Page No., UNIT-I, DATA COLLECTION & SAMPLING METHODS, (5 to 90 ), Concept and role of business statistics, Source of data - Secondary Sources, Primary, Data Collection Methods - Questionnaire, Interview, Observation; Types of scales Nominal,, ordinal, Interval and ratio scales. Sampling - Concept and essentails; Non-probability, sampling methods - Convenience, Judgement, Quota and snowball sampling; Probability, sampling - Simple random, Systematic, Stratified and cluster sampling; Sampling and non, sampling errors; Basics of data feeding and anlaysis software - SPSS., UNIT-II, ASSOCIATION OF ATTRIBUTES, (90 to 170), Concept of association of attributes; Consistency of data; Association and disassociation;, Methods of attributes - Comparison method, proportion method, Yule's coefficient of, association, coefficient of colligation, coefficient of contingency; Partial and multiple, correlation and regression analysis., UNIT-III, PROBABILITY AND ANALYSIS OF VARIANCE, (171 to 264), Concept of probability, basic terms in probability, additioin theorem, multiplication theorem;, Theoretical frequency distributions: Elementary knowledge of normal, binominal and possion, distributions and their application to business problems; Anlaysis of variance:Concept,, assumptions, one way and two way classifications., UNIT-IV, RELIABILITY, VALIDITY AND HYPOTHESIS TESTING, (265 to 341), Reliability; Concept, types; Validity: Concept, types, Hypothesis: Concept and types of, hypothesis: Null and alternative hypothesis, Type I and type II errors, critical region, level, of signficance, p value; Large sample tests for population mean and population proportion;, 3
Page 4 :
Parametric tests : F-test, Z-test, t-test; Non-Parametric Tests-Chi square, Mann Whitney, and Kruskal Wallis test., STATISTICAL TABLES, (342 to 347), BOOKS RECOMMENDED, 1., Statistics for Management - Levis, Richard and David S Rubin, Prentice Hall, Delhi., 2., Business Statistics - Levin and Brevson, Pearson Education, New Delhi., 3., Statistics for Business and Economics - Anderson, Sweeney and Williams,, Thompson, New Delhi., 4., Statistics for Business and Economics - Hooda, R.P., Macmillan, New Delhi., 5., Statisitcs for Business & Economics - Heniz, Kohler, Harper Collins., 6., Quantitative approach to Mangerial Decisions - Hien, L.W, Prentice Hall, New, Jesery India Delhi., 7., Statistics for Business & Economics - Lawrence B. Morse, Harper Collins., 8., Statistics for Buinsess and Economics - Mc Clave, Benson and Sincich, Eleventh, Edition, Prentice Hall Publication., 9., Orgnization Behaviour - Ricky Griffin & Georgy Moorehead, Hongh Co, Boston., 10., Organisation Behaviour - Griffin, Ricky W. Houghton Miflin Co, Boston., 11., Organization Behaviour - Hellregel, Don, Jon W Slocum, Jr. and richard, W.Woodman, South Western College Publishing, Ohio., 12., Managment of Organisational Behaviou : Utilising Human Resources - Hersey,, Paul, Kenneth H. Blanchard and Dewey E.Jonson, Prentice Hall, New Delhi., MODE OF EXAMINATION, The paper consists of two sections. Each section will cover the whole of the syllabus, without repeating the question in the entire paper., Section A : It will consist of eight short answer questions, selecting two from, each unit. A candidate has to attempt any six and answer to each question, shall be within 200 words. Each question carries four marks and total, weightage to this section shall be 24 marks., Section B : It will consist of six essay type questions with answer to each, question within 800 words. One question will be set atleast from each unit, and the candidate has to attempt four. Each question will carry 14 marks, and total weightage shall be 56 marks., 4
Page 5 :
MODEL TEST PAPER, BUSINESS STATISTICS, Duration of examination: 3 hours, , M. Marks: 80, SECTION A, , Attempt any six questions. Each question carries four marks. Answer to each question, should be within 200 words., 1. Suppose 2% of the items made by a factory are defective. Find the probability, that there are 3-defective items in a sample of 100 items? (Given, e-2 = 0.135)., 2. What are the different methods of collection of data? Why are personal interviews, usually preferred to questionnaire?, 3. What do you understand by theoretical distribution? Discuss their utility in statistics., 4. Discuss the large sample test for testing the equality of two population means., 5. In a group of 800 students, the numbers of married are 320. Out of 240 students, who failed, 96 belonged to the married group. Find out whether the attributes, marriage and failure are independent, by using Yule’s coefficient of colligation., 6. Explain multiple correlation and multiple regression with the help of an example., 7. Distinguish between stratified sampling and cluster sampling., 8. Define Student’s t-statistic. State briefly the important properties of t-distribution., , SECTION B, Attempt any four questions. Each question carries 14 marks. Answer to each question, should be within 800 words., 1. What do you mean by a questionnaire? What is the difference between a, questionnaire and a schedule? State the essential points to be remembered in, drafting a questionnaire., 5
Page 6 :
2. i) What is binomial distribution? State its important properties., ii) Five dice are thrown simultaneously. If the occurrence of an even number in a, single dice is considered as a success. Find the probability of atmost 3- success?, 3. Distinguish between sampling and non-sampling errors. What are their sources?, How these errors can be controlled?, 4. Differentiate the following pairs of concepts:, (i), , Statistic and Parameter, , (ii), , Critical Region and Region of acceptance, , (iii), , Null and Alternative hypothesis, , (iv), , Type I and Type II errors, , 5. An instructor of mathematics wishes to determine the relationship of grades on a, final examination to grades on two quizzes given during the semester. Costing X1,, X2, and X3 the grades of a students on the first quiz, seconds quiz and final, examination respectively, he made the following computations for a total of 120, students:, X1 = 6.8,, , X2 = 0.7,, , X3 = 74, , S1 = 1.0,, , S2 = 0.80,, , S3 = 9.0, , r12 = 0.60,, , r13= 0.70,, , r23 = 0.65, , 6. What is ANOVA? Outline various steps in carrying out one way and two way, ANOVA. Also, explain conditions necessary for ANOVA., , 6
Page 7 :
M.Com. I, , Course No. M.Com-E 115, , Unit- I, , Lesson No. 1, DATA COLLECTIONA ND SAMPLING METHODS, , CONCEPT AND ROLE OF BUSINESS STATISTICS AND SOURCES OF DATA, STRUCTURE, 1.1 Introduction, 1.2 Objectives, 1.3 Concept of Business Statistics, 1.4 Role of Business Statistics, 1.4.1 Importance of Business Statistics, 1.4.2 Functions of Business Statistics, 1.4.3 Limitations of Business Statistics, 1.5 Sources of Data, 1.6 Summary, 1.7 Glossary, 1.8, , Self Assessment Questions, , 1.9 Lesson End Exercise, 1.10 Suggested Reading, , 7
Page 8 :
1.1, , INTRODUCTION, , Business statistics like many areas of study, has its own language. It is important to begin, our study with an introduction of some basic concepts in order to understand and, communicate about the subject. We begin with a discussion of the word statistics. The, word statistics has many different meanings in our culture. Webster’s Third New International, Dictionary gives a comprehensive definition of statistics as a “science dealing with the, collection, analysis, interpretation, and presentation of numerical data”. Also, the study of, business statistics is important, valuable, and interesting. However, because it involves a, new language of terms, symbols, logic, and application of mathematics, it can be at times, overwhelming., The study of statistics can be organised in a variety of ways. One of the main ways is to, subdivide statistics into two branches: descriptive statistics and inferential statistics. If a, business analyst is using data gathered on a group to describe or reach conclusions about, that same group, the statistics are called descriptive statistics. For example, if an instructor, produces statistics to summarise a class’s examination effort and uses those statistics to, reach conclusions about that class only, the statistics are descriptive. Many of the statistical, data generated by businesses are descriptive. They might include number of employees on, vacation during June, average salary at the Denver office, corporate sales for 2009, average, managerial satisfaction score on a company-wide census of employee attitudes, and average, return on investment for the Lofton Company for the years 1990 through 2008., Another type of statistics is called inferential statistics. If a researcher gathers data from, a sample and uses the statistics generated to reach conclusions about the population from, which the sample was taken, the statistics are inferential statistics. The data gathered from, the sample are used to infer something about a larger group. Inferential statistics are, sometimes referred to as inductive statistics. The use and importance of inferential statistics, continue to grow. One application of inferential statistics is in pharmaceutical research., Some new drugs are expensive to produce, and therefore tests must be limited to small, samples of patients. Utilising inferential statistics, researchers can design experiments with, small randomly selected samples of patients and attempt to reach conclusions and make, inferences about the population., Market researchers use inferential statistics to study the impact of advertising on various, market segments. Suppose a soft drink company creates an advertisement depicting a, dispensing machine that talks to the buyer, and market researchers want to measure the, impact of the new advertisement on various age groups. The researcher could stratify the, 8
Page 9 :
population into age categories ranging from young to old, (Means we have to select stratas, randomly), and use inferential statistics to determine the effectiveness of the advertisement, for the various age groups in the population. The advantage of using inferential statistics is, that they enable the researcher to study effectively a wide range of phenomena without, having to conduct a census., 1.2, , OBJECTIVES, , After going through this lesson, you would be able to:, , , define the concept of business statistics., , , , explain the role and importance of business statistics., , , , describe different sources of data., , 1.3, , CONCEPT OF BUSINESS STATISTICS, , In the beginning, it may be noted that the word ‘statistics’ is used rather curiously in two, senses plural and singular. In the plural sense, it refers to a set of figures or data. In the, singular sense, statistics refers to the whole body of tools that are used to collect data,, organise and interpret them and, finally, to draw conclusions from them. It should be noted, that both the aspects of statistics are important if the quantitative data are to serve their, purpose. If statistics, as a subject, is inadequate and consists of poor methodology, we, could not know the right procedure to extract from the data the information they contain., Similarly, if our data are defective or that they are inadequate or inaccurate, we could not, reach the right conclusions even though our subject is well developed., A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii), Statistics may rightly be called the science of averages, and (iii) statistics is the science of, measurement of social organism regarded as a whole in all its manifestations. Boddington, defined as: Statistics is the science of estimates and probabilities. Further, W.I. King has, defined Statistics in a wider context, the science of Statistics is the method of judging, collective, natural or social phenomena from the results obtained by the analysis or, enumeration or collection of estimates. Seligman explored that statistics is a science that, deals with the methods of collecting, classifying, presenting, comparing and interpreting, numerical data collected to throw some light on any sphere of enquiry. Spiegal defines, statistics highlighting its role in decision-making particularly under uncertainty, as follows:, statistics is concerned with scientific method for collecting, organising, summa rising,, presenting and analyzing data as well as drawing valid conclusions and making reasonable, decisions on the basis of such analysis. According to Prof. Horace Secrist, Statistics is the, 9
Page 10 :
aggregate of facts, affected to a marked extent by multiplicity of causes, numerically, expressed, enumerated or estimated according to reasonable standards of accuracy,, collected in a systematic manner for a pre-determined purpose, and placed in relation to, each other., From the above definitions, we can highlight the major characteristics of statistics as follows:, (i), , Statistics are the aggregates of facts. It means a single figure is not statistics. For, example, national income of a country for a single year is not statistics but the, same for two or more years is statistics., , (ii), , Statistics are affected by a number of factors. For example, sale of a product, depends on a number of factors such as its price, quality, competition, the income, of the consumers, and so on., , (iii), , Statistics must be reasonably accurate. Wrong figures, if analysed, will lead to, erroneous conclusions. Hence, it is necessary that conclusions must be based on, accurate figures., , (iv), , Statistics must be collected in a systematic manner. If data are collected in a, haphazard manner, they will not be reliable and will lead to misleading conclusions., , (v), , Collected in a systematic manner for a pre-determined purpose, , (vi), , Lastly, Statistics should be placed in relation to each other. If one collects data, unrelated to each other, then such data will be confusing and will not lead to any, logical conclusions. Data should be comparable over time and over space., , Most of the information around us is determined with help of statistics; e.g., weather, forecasts, medical studies, quality testing, stock markets etc. Therefore, Business Statistics, involves the application of statistical tools in the area of marketing, production, finance,, research and development, manpower planning etc. to extract relevant information for the, purpose of decision making. Business managers use statistical tools and techniques to, explore almost all areas or business operations of public and private enterprises, 1.4, , ROLE OF BUSINESS STATISTICS, , Statistics play an important role in business. A successful businessman must be very quick, and accurate in decision making. He knows that what his customers wants, he should, therefore, know what to produce and sell and in what quantities. Statistics helps businessman, to plan production according to the taste of the customers, the quality of the products can, also be checked more efficiently by using statistical methods. Hence, all the activities of, 10
Page 11 :
businessman based on statistical information. He can make correct decision about the, location of business, marketing of the products, financial resources etc., 1., , In Business – It helps to make swift decisions by providing useful information, about customer trends and variations, cost customer trends and variations, price, customer trends and variations etc., , 2., , In Mathematics – It helps in describing measurements and providing accuracy of, theories., , 3., , In Economics – It helps to find relationship between two variables like demand, and supply, cost and revenue, imports and exports and helps to establish relationship, between inflation rate, per capita income, income distribution etc., , 4., , In Accounts – It helps to discover trends and create projections for next year., , 5., , In Physics – It helps to compute distance between objects in space., , 6., , Research – It helps in formulating and testing hypothesis., , 7., , Government – Government takes help of statistics to make budgets, set minimum, wages, estimate cost of living etc., , 1.4.1 Importance of Business Statistics, These days statistical methods are applicable everywhere. There is no field of work in, which statistical methods are not applied. According to A .L. Bowley, “A knowledge of, statistics is like a knowledge of foreign languages or of Algebra, it may prove of use at any, time under any circumstances”. The importance of the statistical science is increasing in, almost all spheres of knowledge, e g., astronomy, biology, meteorology, demography,, economics and mathematics., Economic planning without statistics is bound to be baseless., Statistics serve in administration, and facilitate the work of formulation of new policies., Financial institutions and investors utilise statistical data to summaries the past experience., Statistics are also helpful to an auditor, when he uses sampling techniques or test checking, to audit the accounts of his client. The importance of business statistics can be summarised, through the following points:, , 11
Page 12 :
1., , Deal with uncertainties by forecasting seasonal, cyclic and general economic, fluctuations., , 2., , Helps in Sound Decision making by providing accurate estimates about costs,, demand, prices, sales etc., , 3., , Helps in business planning on the basis of sound predictions and assumptions., , 4., , Helps in measuring variations in performance of products, employees, business, units etc., , 5., , It allows comparison of two or more products, business units, sales teams etc., Helps in identifying relationship between various variables and their effect on each, other like effect of advertisement on sales., , 6., , Helps in validating generalisations and theoretical concepts formulated by managers., , 1.4.2 Functions of Statistics, The functions of statistics are as follows:, 1., , It presents fact in a definite form: Numerical expressions are convincing and,, therefore, one of the most important functions of statistics is to present statement, in a precise and definite form., , 2., , It simplifies mass of figures: The data presented in the form of table, graph or, diagram, average or coefficients are simple to understand., , 3., , It facilitates comparison: Once the data are simplified they can be compared, with other similar data. Without such comparison the figures would have been, useless., , 4., , It helps in prediction: Plans and policies of organisations are invariably formulated, in advance at the time of their implementation. Knowledge of future trends is very, useful in framing suitable policies and plans., , 5., , It helps in formulating and testing hypothesis: Statistical methods like z-test,, t-test, x2-test are extremely helpful in formulating and testing hypothesis and to, develop new theories., , 6., , It helps in the formulation of suitable policies: Statistics provide the basic, material for framing suitable policies. It helps in estimating export, import or, production programmes in the light of changes that may occur., , 7., , Statistics indicates trend behaviour: Statistical techniques such as Correlation,, Regression, Time series analysis etc. are useful in forecasting future events., 12
Page 13 :
1.4.3 Limitations of Business Statistics, The scope of the science of statistics is restricted by certain limitations:, 1., , Statistics deals only with quantitative characteristics: Statistics are numerical, statements of facts. Data which cannot be expressed in numbers are incapable of, statistical analysis. Qualitative characteristics like honesty, efficiency, intelligence, etc. cannot be studied directly., , 2., , Statistics deals with aggregates not with individuals: Since statistics deals, with aggregates of facts, the study of individual measurements lies outside the, scope of statistics., , 3., , Statistical laws are not perfectly accurate: Statistics deals with such, characteristics which are affected by multiplicity of causes and it is not possible to, study the effect of these factors. Due to this limitation, the results obtained are not, perfectly accurate but only an approximation., , 4., , Statistical results are only an average: Statistical results reveal only the average, behaviour. The Conclusions obtained statistically are not universally true but they, are true only under certain conditions., , 5., , Statistics is only one of the methods of studying a problem: Statistical tools, do not provide the best solution under all circumstances., , 6., , Statistics can be misused: The greatest limitation of statistics is that they are, liable to be misused. The data placed to an inexperienced person may reveal, wrong results. Only persons having fundamental knowledge of statistical methods, can handle the data properly., , 1.5, , SOURCES OF DATA, , For studying a problem statistically first of all, the data relevant thereto must be collected., The numerical facts constitute the raw material of the statistical process. The interpretation, of the ultimate conclusion and the decisions depend upon the accuracy with which the data, are collected. Unless the data are collected with sufficient care and are as accurate as is, necessary for the purposes of the inquiry, the result obtained cannot be expected to be, valid or reliable., Statistical data are the basic raw material of statistics. Data may relate to an activity of our, interest, a phenomenon, or a problem situation under study. They derive as a result of the, process of measuring, counting and/or observing. Statistical data, therefore, refer to those, aspects of a problem situation that can be measured, quantified, counted, or classified., 13
Page 14 :
Any object subject phenomenon, or activity that generates data through this process is, termed as a variable. In other words, a variable is one that shows a degree of variability, when successive measurements are recorded. In statistics, data are classified into two, broad categories: quantitative data and qualitative data. This classification is based on the, kind of characteristics that are measured. Quantitative data are those that can be quantified, in definite units of measurement. These refer to characteristics whose successive, measurements yield quantifiable observations. Depending on the nature of the variable, observed for measurement, quantitative data can be further categorised as continuous and, discrete data. (i) Continuous data represent the numerical values of a continuous variable., A continuous variable is the one that can assume any value between any two points on a, line segment, thus representing an interval of values. The values are quite precise and close, to each other, yet distinguishably different. All characteristics such as weight, length, height,, thickness, velocity, temperature, tensile strength, etc., represent continuous variables. Thus,, the data recorded on these and similar other characteristics are called continuous data. It, may be noted that a continuous variable assumes the finest unit of measurement. Finest in, the sense that it enables measurements to the maximum degree of precision. (ii) Discrete, data are the values assumed by a discrete variable. A discrete variable is the one whose, outcomes are measured in fixed numbers. Such data are essentially count data. These are, derived from a process of counting, such as the number of items possessing or not, possessing a certain characteristic. The number of customers visiting a departmental store, every day, the incoming flights at an airport, and the defective items in a consignment, received for sale, are all examples of discrete data., Qualitative data refer to qualitative characteristics of a subject or an object. A characteristic, is qualitative in nature when its observations are defined and noted in terms of the presence, or absence of a certain attribute in discrete numbers. These data are further classified as, nominal and rank data., (i), , Nominal data are the outcome of classification into two or more categories of, items or units comprising a sample or a population according to some quality, characteristic. Classification of students according to sex (as males and 6 females),, of workers according to skill (as skilled, semi-skilled, and unskilled), and of, employees according to the level of education (as matriculates, undergraduates,, and post-graduates), all result into nominal data. Given any such basis of, classification, it is always possible to assign each item to a particular class and, make a summation of items belonging to each class. The count data so obtained, are called nominal data., 14
Page 15 :
(ii), , Rank data, on the other hand, are the result of assigning ranks to specify order in, terms of the integers 1,2,3, ..., n. Ranks may be assigned according to the level of, performance in a test. a contest, a competition, an interview, or a show. The, candidates appearing in an interview, for example, may be assigned ranks in integers, ranging from I to n, depending on their performance in the interview. Ranks so, assigned can be viewed as the continuous values of a variable involving performance, as the quality characteristic., , Before starting the collection of data, it is necessary to know the sources from which the, data are to be collected. Data sources could be seen as of two types, viz., secondary and, primary. The two can be defined as under:, (i), , Secondary data: They already exist in some form: published or unpublished- in, an identifiable secondary source. They are, generally, available from published, source(s), though not necessarily in the form actually required. Secondary data, analysis can save time that would otherwise be spent on collecting data, Particularly, in case of quantitative data. Also analysts of social and economic change. Consider, Secondary as essential, because sometimes it is impossible to conduct a new, survey that can adequately capture past change/ or development., , (ii), , Primary data: Those data which do not already exist in any form, and thus have, to be collected for the first time from the primary source(s). By their very nature,, these data require fresh and first-time collection covering the whole population or, a sample drawn from it. This type of information is collected specifically for the, purpose of our research project in hand. An advantage of primary data is that it is, specifically tailored to our research need., , The original compiler of the data is the primary source. For example, the office of the, Registrar General will be the primary source of the decennial population census figures. A, secondary source is the one that furnishes the data that were originally compiled by someone, else. The sources of data are also classified according to the character of the data yielded, by them., Therefore, the data which are gathered from the primary source is known as primary data, and the one gathered from the secondary source is known as secondary data. When an, investigator is making use of figures which he has obtained by field enumeration, he is said, to be using primary data and when he is making use of figures which he has obtained from, some other source, he is said to be using secondary data., 15
Page 16 :
An investigator has to decide whether he will collect fresh (primary) data or he will compile, data from the published sources. The decision to collect primary or secondary data would, depend upon factors such as, source from which they have been obtained; their true, significance; completeness and method of collection., In addition to the above factors, there are some other factors to be considered while, making choice between the primary or secondary data:, (i) Nature and scope of enquiry., (ii) Availability of time and money., (iii) Degree of accuracy required and, (iv) The status of the investigator i.e., individual, Pvt. Co., Govt. etc., However, it may be pointed out that in certain investigations both primary and secondary, data may have to be used, one may be supplement to the other., The primary methods of collection of statistical information are the following:, 1., , Direct Personal Observation,, , 2., , Indirect Personal Observation, , 3., , Schedules to be filled in by informants, , 4., , Information from Correspondents, and, , 5., , Questionnaires in charge of enumerators, , The particular method that is decided to be adopted would depend upon the nature and, availability of time, money and other facilities available to the investigation., The methods of collecting secondary data are classified into published as well as unpublished, sources., 1.6, , SUMMARY, , Statistics play an important role in business. A successful businessman must be very quick, and accurate in decision making. So, all the activities of the businessman based, on statistical information. He can make correct decision about the location of business,, marketing of the products, financial resources etc., , Statistical studies are extremely important in our everyday life. Statistics are the method, of conducting a study about a particular topic by collecting, organising, interpreting, and, finally presenting data. Some major areas relying on statistics include government, education,, science, and large companies., 16
Page 17 :
1.7, , , GLOSSARY, Business : Business is the activity of making one’s living or making money by, producing or buying and selling products (such as goods and services)., , , , Statistics: Statistics is a branch of mathematics dealing with data collection,, organisation, analysis, interpretation and presentation. In applying statistics to, for, example, a scientific, industrial, or social problem, it is conventional to begin with, a statistical population or a statistical model process to be studied., , , , Business Statistics: Business statistics is the science of good decision making in, the face of uncertainty and is used in many disciplines such as financial analysis,, econometrics, auditing, production and operations including services improvement, and marketing research., , , , Data: Data are individual pieces of factual information recorded and used for the, purpose of analysis. It is the raw information from which statistics are created., , , , Primary Data: Data observed or collected directly from first-hand experience., , , , Secondary Data: Published data and the data collected in the past or other, parties are called as secondary data., 1.8, , SELF ASSESSMENT QUESTIONS, , i. Which one of these statistics is unaffected by outliers? (Tick () the correct option):a. Mean, , b. Inter quartile range, , c. Standard deviation, , d. Range, , ii. A list of 5 pulse rates is: 70, 64, 80, 74, 92. What is the median for this list?, a. 74, , b. 76, , c. 77, , iii. Which of the following would indicate that a dataset is not bell-shaped?, a. The range is equal to 5 standard deviations., b. The range is larger than the inter quartile range., c. The mean is much smaller than the median., d. There are no outliers, , 17, , d. 80
Page 19 :
6., , Discuss the role of business statistics., ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, , 1.10 SUGGESTED READING, , , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , , , Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., , , , Anderson, Sweeney and Williams: Statistics for Business and EconomicsThompson, New Delhi., , , , Levin and Brevson: Business Statistics, Pearson Education, New Delhi., , , , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , 19
Page 20 :
M.Com. I, , Course No. M.Com-115, , Unit- I, , Lesson No. 2, , DATA COLLECTION METHODS AND TYPES OF MEASUREMENT SCALES, STRUCTURE, 2.1 Introduction, 2.2 Objectives, 2.3 Methods of Collecting Primary Data, 2.3.1 Questionnaire, 2.3.2 Interview, 2.3.2.1 Personal Interview, 2.3.2.2 Telephone Interview, 2.3.3 Observation, 2.3.4 Other Methods of collecting Primary Data, 2.4 Methods of Collecting Secondary Data, 2.4.1 Characteristics of Secondary Data, 2.4.2 Advantages and Disadvantages, 2.5, , Types of Measurement Scales, 2.5.1 Nominal, 20
Page 21 :
2.5.2 Ordinal, 2.5.3 Interval, 2.5.4 Ratio, 2.6, , Summary, , 2.7, , Glossary, , 2.8, , Self Assessment Questions, , 2.9, , Lesson End Exercise, , 2.10 Suggested Reading, 2.1, , INTRODUCTION, , We have already discussed the meaning of primary and secondary data in lesson 1. But,, for researcher mere meaning is not serve any purpose. The researcher must know the, sources of collecting primary as well as secondary data. Further, millions of numerical data, are gathered in business every day, representing myriad items. For example, numbers, represent costs of items produced, geographical locations of retail outlets, weights of, shipments, and rankings of subordinates at yearly reviews. All such data should not be, analysed in the same way statistically because the entities represented by the numbers are, different. For this reason, the business researcher needs to know the level of data, measurement represented by the numbers being analysed. The disparate use of numbers, can be illustrated by 40 and 80, which could represent the weights of two objects being, shipped, the ratings received on a consumer test by two different products, or football, jersey numbers of a fullback and a wide receiver. Although 80 pounds is twice as much as, 40 pounds, the wide receiver is probably not twice as big as the fullback. Averaging the, two weights seems reasonable, but averaging the football jersey numbers makes no sense., The appropriateness of the data analysis depends on the level of measurement of the data, gathered. The phenomenon represented by the numbers determines the level of data, measurement., Therefore, in this lesson, we will discuss methods of collecting primary as well as secondary, data and also, different scales used for measuring collected data., 21
Page 22 :
2.2, , OBJECTIVES, , After studying this lesson, you will be able to:, •, , understand various methods of collecting primary data., , •, , know the sources of collecting secondary data., , •, , describe different scales of measurement used in statistics., 2.3 METHODS OF COLLECTING PRIMARY DATA, , Statistical data can be categorised in a number of ways, including primary versus secondary., Primary data refer to those generated by a researcher for the specific problem or decision, at hand. Survey research, experimentation, and observational research are among the, most popular methods for collecting primary data., 2.3.1. Questionnaire, Also referred to as the data collection instrument, the questionnaire is either filled out, personally by the respondent or administered and completed by an interviewer. Under this, method, a list of questions pertaining to the survey is prepared and sent to the various, informats. This questionnaire contains questions and provides space for answers. The, questionnaire may contain any of three types of questions:, (1) Multiple choice, in which there are several alternatives from which to choose;, (2) Dichotomous, having only two alternatives available (with “don’t know” or “no opinion”, sometimes present as a third choice); and, (3) Open-ended, where the respondent is free to formulate his or her own answer and, expand on the subject of the question., In general, multiple choice and dichotomous questions can be difficult to formulate, but, data entry and analysis are easily accomplished. The reverse tends to be true for openended questions, where a respondent may state or write several paragraphs in response, to even a very short question. Because they give the respondent an opportunity to fully, express his feelings or describe his behaviours, open-ended questions are especially useful, in exploratory research., , 22
Page 23 :
Proper wording of each question is important, but often difficult to achieve. A number of, problems can arise, including the following:, (1) The vocabulary level may be inappropriate for the type of person being surveyed;, (2) The respondent may assume a frame of reference other than the one the researcher, intended;, (3) The question may contain “leading” words or phrases that unduly influence the response;, and, (4) The respondent may hesitate to answer a question that involves a sensitive topic., Examples of each situation follow:, Inappropriate vocabulary level, Poor wording “Have you patronised a commercial source of cinematic entertainment within, the past month?”, Problem Vocabulary level will be too high for many respondents., Better wording “Have you gone to a movie within the past month?”, Confusing frame of reference, Poor wording “Are you in better shape than you were a year ago?”, Problem to what does “shape” refer-physical, financial, emotional?, Better wording (if the desired frame of reference is physical) “Are you in better physical, condition than you were a year ago?”, “Leading” words/phrases, Poor wording “To help maintain the quality of our schools, hospitals, and public services,, do you agree that taxes should be increased next year?”, Problems “Schools, hospitals, and public services” has an emotional impact., Also, “do you agree” suggests that you should agree., Better wording “Do you support an increase in taxes next year?”, Sensitive topic, Poor wording “How much money did you make last year?”, Problem “This question requests detailed, personal information that respondents are hesitant, to provide. The question is also very blunt”., Better wording “Which of the following categories best describes your income last year?”, 23
Page 24 :
The preceding represents only a few of the pitfalls that can be encountered when designing, a questionnaire. In general, it’s a good idea to pre-test the questionnaire by personally, administering it to a small number of persons who are similar to the eventual sample members., 2.3.2. Interview, Interviews are a key qualitative data collection method for social research. They are mainly, useful in cases where there is a need to attain highly personalised data. In research there, are generally two types of interviews:, 2.3.2.1 The Personal Interview, In the personal interview, an interviewer personally secures the respondent’s cooperation, and carries out what could be described as a “purposeful conversation” in which the, respondent replies to the questions asked of her. The personal interview tends to be relatively, expensive compared to the other approaches, but offers a lot of flexibility in allowing the, interviewer to explain questions, to probe more deeply into the answers provided, and, even to record measurements that are not actually asked of the respondent. For example,, along with obtaining responses, the interviewer may record such things as the respondent’s, gender, approximate age, physical characteristics, and mode of dress., 2.3.2.2 The Telephone Interview, The telephone interview is similar to the personal interview, but uses the telephone instead, of personal interaction. Telephone interviews are especially useful for obtaining information, on what the respondent is doing at the time of the call (e.g., the television program, if any,, being watched). Also, use of the telephone makes it possible to complete a study in a, relatively short span of time., 2.3.3 Observation, Observation is the collection of primary data through observing people, their actions and, the situations they are in. Observation may be the easiest research to do. Typically,, observation is also the most cost effective method. Observation can also give us data, that people aren’t usually willing to tell you themselves, such as their feelings, emotions,, attitudes or the motives behind their buying decisions., , 24
Page 25 :
How does observation work? It’s extremely simple. Take a restaurant franchise owner., He may be planning on opening another location. He may also have little or no money to, pay for marketing research. However a lot of the data he needs he can collect himself., He can get into his car and drive around town, observing the traffic patterns. He can see, where his clientele goes to shop. He can see what time the traffic appears. He can call, real estate agents and ask them for lease prices for different properties. He can drive, around and look for areas that don’t have his type of restaurant, looking for areas of little, competition. He can do all of this for just the cost of the gas in his car., a), , Structured (descriptive) and Unstructured (exploratory) observation: When, an observation is characterised by careful definition of units to be observed then, style of observer, conditions for observation and selection of pertinent data of, observation it is a structured observation. When there characteristics are not thought, of in advance or not present it is a unstructured observation., , b), , Participant, Non-participant and Disguised observation: When the observer, observes by making himself more or less, the member of the group he is observing,, it is participant observation but when the observer observes by detaching him, from the group under observation it is non participant observation. If the observer, observes in such a manner that his presence is unknown to the people he is observing, it is disguised observation., , 2.3.4 Other Methods of Collecting Primary Data, 1. Survey Research, In survey research, we communicate with a sample of individuals in order to generalise on, the characteristics of the population from which they were drawn., Surveys are the most common method of collecting Primary Data and the best way to get, the descriptive information that you need for your marketing intelligence. Simply put,, surveys collect data by asking other people a series of questions about their personal, knowledge, emotions, attitudes, preferences, and buying behaviours. Surveys can provide, you a wealth of data. There is always a golden nugget, a piece of data that can give you, the insight you need to figure out the direction of your next campaign., In a mail survey, a mailed questionnaire is typically accompanied by a cover letter and a, postage-paid return envelope for the respondent’s convenience. A good cover letter will, 25
Page 26 :
be brief, readable, and will explain who is doing the study, why it’s being done, and why, it’s important for the reader to fill out and return the enclosed questionnaire., Survey research may lead to different kinds of errors. These errors may be described as, sampling error, response error, and non-response error. Sampling error, discussed below,, is a random error. It can also be described as non directional or non systematic, because, measurements exhibiting random error are just as likely to be too high as they are to be too, low. On the other hand, response and non response errors are both of the directional, or, systematic type., 2. Experimentation after observation, Primary Data can also be collected via experimentation. Experimentation is the practice, of gathering data by selecting matched groups of people, giving them different treatments, or scenarios, controlling related factors in their environments, and checking for differences, in their responses. Experimentation gives us what we call “causal” data. Causal data, helps us explain cause and effect relationships. Experimenting helps us trying to answer, “why” someone is doing something, and what influences their buying behaviour., A common example of experimentation is price testing. To the buyer, price will be the final, emotional factor that determines whether or not they will give us their hard earned money., Depending on the product and market segment, price may be the most important factor., How do you know what price is the right price? You have to test it. Many companies will, test certain prices when collecting primary data on a new menu item that is being developed., How do you think McDonalds knows how much to charge for a Big Mac? They tested, how much they can charge for that Big Mac, looking for that magic number that will, provide the most sales and the most profit., 2.4 METHODS OF COLLECTING SECONDARY DATA, Secondary data have been gathered by someone else for some other purpose. The, secondary data are readily available from the other sources and the researcher can obtain, data from the sources both internal and external to the organisation. The internal sources, of secondary data are:, • Sales Report, • Financial Statements, • Customer details, like name, age, contact details, etc., 26
Page 27 :
• Company information, • Reports and feedback from a dealer, retailer, and distributor, • Management information system, There are several external sources from where the secondary data can be collected. These are:, • Government censuses, like the population census, agriculture census, etc., • Information from other government departments, like social security, tax records, etc., • Business journals, • Social Books, • Business magazines, • Libraries, • Internet, where wide knowledge about different areas is easily available., Further, secondary data can be qualitative and quantitative. The qualitative data can be, obtained through newspapers, diaries, interviews, transcripts, etc., while the quantitative, data can be obtained through a survey, financial statements and statistics., Also, sometimes secondary data may either be classified published data or unpublished, data. Published data are available in; Publications of government, technical and trade, journals, reports of various businesses, banks etc., public records, statistical or historical, documents. Unpublished data may be found in letters, diaries, unpublished biographies or, work., 2.4.1 Characteristics of Secondary data, 1., , Reliability of data: Secondary data generally have a Pre-established degree of, validity and reliability which need not be re-examined by the researcher who is reusing such data., , 2., , Suitability of data: The object, scope and nature of the original enquiry must be, studied and then carefully scrutinise the data for suitability., , 3., , Adequacy: The data is considered inadequate if the level of accuracy achieved in, data is found inadequate or if they are related to an area which may be either, narrower or wider than the area of the present enquiry., , 2.4.2 Advantages and Disadvantages of Using Secondary Data, Secondary data is available from other sources and may already have been used in previous, research, making it easier to carry out further research., 27
Page 28 :
It is time-saving and cost-efficient The data was collected by someone other than the, researcher. Administrative data and census data may cover both larger and much smaller, samples of the population in detail. Information collected by the government will also, cover parts of the population that may be less likely to respond to the census., A clear benefit of using secondary data is that much of the background work needed has, already been carried out, such as literature reviews or case studies. The data may have, been used in published texts and statistics elsewhere, and the data could already be, promoted in the media or bring in useful personal contacts. Secondary data generally have, a pre-established degree of validity and reliability which need not be re-examined by the, researcher who is re-using such data., Secondary data can provide a baseline for primary research to compare the collected, primary data results to and it can also be helpful in research design., However, secondary data can present problems, too. The data may be out of date or, inaccurate. If using data collected for different research purposes, it may not cover those, samples of the population researchers want to examine, or not in detail. Administrative, data, which is not originally collected for research, may not be available in the usual research, formats or may be difficult to get access to., 2.5 TYPE OF MEASUREMENT SCALES, Normally, when one hears the term measurement, they may think in terms of measuring the, length of something (i.e. the length of a piece of wood) or measuring a quantity of something, (i.e. a cup of flour). This represents a limited use of the term measurement. In statistics, the, term measurement is used more broadly and is more appropriately termed as scales of, measurement. Scales of measurement refer to ways in which variables/numbers are defined, and categorised. Each scale of measurement has certain properties which in turn determine, the appropriateness for use of certain statistical analyses. The four scales of measurement, are nominal, ordinal, interval, and ratio. The detailed explanation of each of these scales is, given as under., 2.5.1 Nominal Scale, Categorical data and numbers that are simply used as identifiers or names represent a, nominal scale of measurement. Numbers on the back of a baseball jersey (St. Louis, Cardinals 1 = Ozzie Smith) and your social security number are examples of nominal data., 28
Page 29 :
If we conduct a study and including gender as a variable, we will give 1 as code to Female, and Male as 2 or vice versa, when we enter my data into the computer. Thus, we are using, the numbers 1 and 2 to represent categories of data. Statistical techniques that are, appropriate for analysing nominal data are limited. However, some of the more widely, used statistics, such as the chi-square statistic, can be applied to nominal data, often, producing useful information., 2.5.2 Ordinal Scale, An ordinal scale of measurement represents an ordered series of relationships or rank, order. Individuals competing in a contest may be fortunate to achieve first, second, or third, place. First, second, and third place represent ordinal data. If Roscoe takes first and, Wilbur takes second, we do not know if the competition was close; we only know that, Roscoe outperformed Wilbur. Likert-type scales (such as “On a scale of 1 to 10 with one, being no pain and ten being high pain, how much pain are you in today?”) also represent, ordinal data. Fundamentally, these scales do not represent a measurable quantity. An, individual may respond 8 to this question and be in less pain than someone else who, responded 5. A person may not be in half as much pain if they responded 4 than if they, responded 8. All we know from this data is that an individual who responds 6 is in less, pain than if they responded 8 and in more pain than if they responded 4. Therefore,, Likert-type scales only represent a rank ordering. With ordinal data, the distances or, spacing represented by consecutive numbers are not always equal. Mutual funds as, investments are sometimes rated in terms of risk by using measures of default risk, currency, risk, and interest rate risk. These three measures are applied to investments by rating them, as having high, medium, and low risk. Suppose high risk is assigned a 3, medium risk a 2,, and low risk a 1. If a fund is awarded a 3 rather than a 2, it carries more risk, and so on., However, the differences in risk between categories 1, 2, and 3 are not necessarily equal., Thus, these measurements of risk are only ordinal-level measurements., Another example of the use of ordinal numbers in business is the ranking of the top 50, most admired companies in Fortune magazine. The numbers ranking the companies are, only ordinal in measurement. Certain statistical techniques are specifically suited to ordinal, data, but many other techniques are not appropriate for use on ordinal data. For example,, it does not make sense to say that the average of “moderately helpful” and “very helpful”, is “moderately helpful and a half.” Because nominal and ordinal data are often derived, from imprecise measurements such as demographic questions, the categorisation of people, 29
Page 30 :
or objects, or the ranking of items, nominal and ordinal data are non-metric data and are, sometimes referred to as qualitative data., 2.5.3 Interval Scale, A scale which represents quantity and has equal units but for which zero represents simply, an additional point of measurement is an interval scale. Also, Interval-level data, measurement is the next to the highest level of data in which the distances between, consecutive numbers have meaning and the data are always numerical. The Fahrenheit, scale is a clear example of the interval scale of measurement. Thus, 60 degree Fahrenheit, or -10 degrees Fahrenheit are interval data. Measurement of sea level is another example, of an interval scale. With each of these scales there is direct, measurable quantity with, equality of units. In addition, zero does not represent the absolute lowest value. Rather, it, is point on the scale with numbers both above and below it (for example, -10 degrees, Fahrenheit). In addition, with interval-level data, the zero point is a matter of convention or, convenience and not a natural or fixed zero point. Zero is just another point on the scale, and does not mean the absence of the phenomenon. For example, zero degrees Fahrenheit, is not the lowest possible temperature. Some other examples of interval level data are the, percentage change in employment, the percentage return on a stock, and the dollar change, in stock price., 2.5.4 Ratio Scale, Ratio-level data measurement is the highest level of data measurement. Ratio data have, the same properties as interval data, but ratio data have an absolute zero, and the ratio of, two numbers is meaningful. The notion of absolute zero means that zero is fixed, and the, zero value in the data represents the absence of the characteristic being studied. The value, of zero cannot be arbitrarily assigned because it represents a fixed point. This definition, enables the statistician to create ratios with the data., Examples of ratio data are height, weight, time, volume, and Kelvin temperature. With, ratio data, a researcher can state that 180 pounds of weight is twice as much as 90 pounds, or, in other words, make a ratio of 180:90. Many of the data gathered by machines in, industry are ratio data., Other examples in the business world that are ratio level in measurement are production, cycle time, work measurement time, passenger miles, number of trucks sold, complaints, 30
Page 31 :
per 10,000 fliers, and number of employees. With ratio-level data, no b factor is required, in converting units from one measurement to another-that is, y = ax. As an example, in, converting height from yards to feet: feet = 3 yards. Because interval and ratio-level data, are usually gathered by precise instruments often used in production and engineering, processes, in national standardised testing, or in standardised accounting procedures, they, are called metric data and are sometimes referred to as quantitative data. The below table, will clearly explain the purpose and nature of each scale in brief:, , 31
Page 32 :
2.6 SUMMARY, As can be seen from the above discussion that primary data is an original and unique data,, which is directly collected by the researcher from a source according to his requirements., As opposed to secondary data which is easily accessible but are not pure as they have, undergone through many statistical treatments. Also, in order to measurement data, different, scales have been used by the researcher according to his/her purpose of study., 2.7 GLOSSARY, Discrete variable: A variable that assumes only some selected values in a range., Continuous variable: A variable that assumes any value within a range., Questionnaire: A set of printed or written questions with a choice of answers, devised, for the purposes of a survey or statistical study., Interview: An interview is a formal meeting at which someone is asked questions in, order to find out if they are suitable for a job or a course of study., Observation: Observation is the active acquisition of information from a primary source., In living beings, observation employs the senses. In science, observation can also, involve the recording of data via the use of scientific instruments. The term may also, refer to any data collected during the scientific activity., Ordinal Scale: An ordinal scale is a scale (of measurement) that uses labels to classify, cases (measurements) into ordered classes., Nominal Scale: A Nominal Scale is a measurement scale, in which numbers serve as, “tags” or “labels” only, to identify or classify an object. A nominal scale measurement, normally deals only with non-numeric (quantitative) variables or where numbers have, no value., Interval Scale: The interval scale is defined as a quantitative measurement scale where, the difference between 2 variables is meaningful. Interval scale is the 3rd level of, measurement. This means that the difference between two variables on a scale is an, actual and equal distance., 32
Page 33 :
Ratio Scale: Ratio scale is a type of variable measurement scale which is quantitative, in nature. Ratio scale allows any researcher to compare the intervals or, differences. Ratio scale is the 4th level of measurement and possesses a zero point or, character of origin. This is a unique feature of ratio scale., 2.8 SELF ASSESSMENT QUESTIONS, i., , The science of collecting, organizing, presenting, analysing and interpreting data, to assist in making more effective decisions is called: (Tick the correct answer):-, , a) Statistic, b) Parameter, c) Population, d) Statistics, ii. When the characteristic being studied is nonnumeric, it is called a:, a) Quantitative variable, b) Qualitative variable, c) Discrete variable, d) Continuous variable, iii. When the variable studied can be reported numerically, the variable is called a:, a) Quantitative variable, b) Qualitative variable, c) Independent variable, d) Dependent variable, iv. Listings of the data in the form in which these are collected are known as:, a) Secondary data, b) Raw data, c) Arrayed data, d) Qualitative data, v. Data that are collected by anybody for some specific purpose and use are called:, a) Qualitative data, 33
Page 34 :
b) Primary data, c) Secondary data, d) Continuous data, vi. The data which have under gone any treatment previously is called:, a) Primary data, b) Secondary data, c) Symmetric data, d) Skewed data, 2.9 LESSON END EXERCISE, 1. What is measurement scales in research?, ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, 2. What are the three methods of data collection?, ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, 3. What are the advantages of using primary data?, ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, 4. Differentiate between Questionnaire and Schedule., ____________________________________________________________, 34
Page 35 :
____________________________________________________________, ____________________________________________________________, 5. What is interval and ratio data?, ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, 6. Explain different sources of collecting secondary data., ____________________________________________________________, ____________________________________________________________, ____________________________________________________________, , 2.10, , SUGGESTED READING, , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., Anderson, Sweeney and Williams: Statistics for Business and Economics- Thompson,, New Delhi., Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall, Delhi., Levin and Brevson: Business Statistics, Pearson Education, New Delhi., Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , 35
Page 36 :
M.Com. I, , Course No. M.Com-115, , Unit- I, , Lesson No. 3, SAMPLING, , STRUCTURE, 3.1 Introduction, 3.2 Objectives, 3.3 Census, 3.4 Concept of Sampling, 3.4.1 Essentials of Sampling, 3.4.2 Probability Sampling, 3.4.3 Non-Probability Sampling, 3.5 Non Probability Sampling Methods, 3.5.1 Convenience Sampling, 3.5.2 Judgement Sampling, 3.5.3 Quota Sampling, 3.5.4 Snowball Sampling, 3.6 Summary, 3.7 Glossary, 3.8, , Self Assessment Questions, 36
Page 37 :
3.9 Lesson End Exercise, 3.10 Suggested Reading, 3.1, , INTRODUCTION, , The way in which we select a sample of individuals to be research participants is critical., How we select participants will determine the population to which we may generalise our, research findings. The procedure that we use for assigning participants to different treatment, conditions will determine whether bias exists in our treatment groups? We address the, concept of sampling in this lesson. Further, the current lesson explore the process of sampling, as well as different methods used for selecting non-probability sample. Why do researchers, often take a sample rather than conduct a census? What are the differences between, random and non-random sampling? This lesson addresses these questions about the, sampling. Sampling is widely used in business as a means of gathering useful information, about a population. Data are gathered from samples and conclusions are drawn about the, population as a part of the inferential statistics process., 3.2, , OBJECTIVES, , After studying this lesson, you will be able to:, •, , define the concept of census and sampling., , •, , understand the difference between probability and non-probability sampling., , •, , contrast sampling to census and differentiate among different methods of sampling,, which include convenience, judgment, quota, and snowball, , •, , know the essentials of sampling., 3.3, , CENSUS, , Sometimes it is preferable to conduct a census of the entire population rather than taking, a sample. There are at least two reasons why a business researcher may opt to take a, census rather than a sample, providing there is adequate time and money available to, conduct such a census:, , 37
Page 38 :
1) To eliminate the possibility that by chance a randomly selected sample may not be, representative of the population., 2) For the safety of the consumer. Even when proper sampling techniques are implemented, in a study, there is the possibility that a sample could be selected by chance that does not, represent the population. For example, if the population of interest is all truck owners in, the state of Colorado, a random sample of truck owners could yield mostly ranchers, when, in fact, many of the truck owners in Colorado are urban dwellers. If the researcher, or study sponsor cannot tolerate such a possibility, then taking a census may be the only, option. In addition, sometimes a census is taken to protect the safety of the consumer. For, example, there are some products, such as airplanes or heart defibrillators, in which the, performance of such is so critical to the consumer that 100% of the products are tested,, and sampling is not a reasonable option., Every research study has a target population that consists of the individuals, institutions, or, entities that are the object of investigation. The sample is taken from a population list, map,, directory, or other source used to represent the population. This list, map, or directory is, called the frame, which can be school lists, trade association lists, or even lists sold by list, brokers., Ideally, a one-to-one correspondence exists between the frame units and the population, units. In reality, the frame and the target population are often different. For example, suppose, the target population is all families living in Detroit. A feasible frame would be the residential, pages of the Detroit telephone books. How would the frame differ from the target, population? Some families have no telephone. Other families have unlisted numbers. Still, other families might have moved and/or changed numbers since the directory was printed., Some families even have multiple listings under different names., Frames that have over registration contain the target population units plus some additional, units. Frames that have under registration contain fewer units than does the target population., Sampling is done from the frame, not the target population. In theory, the target population, and the frame are the same. In reality, a business researcher’s goal is to minimise the, differences between the frame and the target population., , 38
Page 39 :
3.4, , CONCEPT OF SAMPLING, , The U.S. Bureau of the census used it first in 1940. Prior to that recorded instances are, relatively few in number. After 1920 sampling began to develop systematically and much, of the growth was in agricultural field rather than in social research. In recent years, sampling, has become an essential part of research procedure and every researcher required to be, familiar with its logic and some of its important techniques. Sampling is not typical of, science only. In a way, we also practice crude versions of sampling in our day to day lines., The house wives, for example, press a few cods of boiled rice to be able to declare that it, is ready to be served. Understandably, it is not feasible to examine each and every grain in, the cooking pot, thus, instead of studying each and every unit, in sampling method, a small, portion is selected which represents the whole population., According to P.V. Young, “A statistical sample is a miniature picture or cross section of the, entire group or aggregate from which the sample is taken. The entire group from which a, sample is chosen is known as the ‘population’; “Universe” or ‘Supply’., According to Goode and Hatt, “A sample as the name implies, is a smaller representation, of a larger whole.”, The idea of sampling is quite old, though the theory of sampling has developed in recent, years. Very often a handful of grains of boiling rice is examined to ascertain whether it is, cooked or not, and a doctor examines a few drops of blood to ascertain the blood type., Likewise this technique is employed in many other fields., The main objective of sampling technique is to draw conclusions about the whole by, examining only a part of it., 3.4.1 Essentials of Sampling, The choice of a sample as representative of the whole group is based upon following, assumptions:, 1) Underlying homogeneity amidst complexity: Although things, especially phenomena,, appear to be very complex in nature, so that no two things appear alike, a keener study, has disclosed that beneath this apparent diversity there is underlying fundamental unity., Apparently every student may appear to be different. There are differences of health,, 39
Page 40 :
body, habits, personality etc. But fundamentally they are similar in many respects, so that, a study of some of them will throw significant light upon the whole group. It is the possibility, of such ideal types in the whole population that makes sampling possible. If no two students, were alike in any respect the sampling would have been impossible., 2) Possibility of Representative Election: The second assumption is that it is possible, to draw a representative sample. It has been proved that if a certain number of units are, selected from a mass on purely random basis, every unit will have a change of being, included and the sample so selected will contain all types of units, so that it may be, representative of the whole group. This principle is popularly known as the law of statistical, regularity and is the very basis of all sampling enquiries., 3) Absolute Accuracy not Essential: The third basic factor is the fact that absolute, accuracy is not essential in case of mass study. In large scale studies we have to depend, upon averages which are considered fairly significant in any type of enquiry, the result of, sampling studies although not hundred percent accurate are nevertheless sufficiently accurate, to permit valid generalisations., 4) Independent: All units of a sample must be independent of each other. In other words, inclusion of one item in the sample should not be dependent upon the inclusion of other, items of the universe., 5) Adequacy: The number of items in the sample should be fairly adequate so that some, reliable conclusion can be drawn., Advantages of Sampling, Sampling is a process of selecting samples from a group or population to become the, foundation for estimating and predicting the outcome of the population as well as to detect, the unknown piece of information. A sample is the sub-unit of the population. There are a, few advantages associated with the sampling process., 1., , Low cost of sampling, , If data were to be collected for the entire population, the cost will be quite high. A sample, is a small proportion of a population. So, the cost will be lower if data is collected for a, sample of population which is a big advantage., 40
Page 41 :
2., , Less time consuming in sampling, , Use of sampling takes less time also. It consumes less time than census technique. Tabulation,, analysis etc., take much less time in the case of a sample than in the case of a population., 3., , Scope of sampling is high, , The investigator is concerned with the generalization of data. To study a whole population, in order to arrive at generalisations would be impractical. Some populations are so large, that their characteristics could not be measured. Before the measurement has been, completed, the population would have changed. But the process of sampling makes it, possible to arrive at generalisations by studying the variables within a relatively small, proportion of the population., 4., , Accuracy of data is high, , Having drawn a sample and computed the desired descriptive statistics, it is possible to, determine the stability of the obtained sample value. A sample represents the population, from which it is drawn. It permits a high degree of accuracy due to a limited area of, operations. Moreover, careful execution of field work is possible. Ultimately, the results of, sampling studies turn out to be sufficiently accurate., 5., , Organisation of convenience, , Organisational problems involved in sampling are very few. Since sample is of a small size,, vast facilities are not required. Sampling is therefore economical in respect of resources., Study of samples involves less space and equipment., 6., , Intensive and exhaustive data, , In sample studies, measurements or observations are made of a limited number. So, intensive, and exhaustive data are collected., 7., , Suitable in limited resources, , The resources available within an organisation may be limited. Studying the entire universe, is not viable. The population can be satisfactorily covered through sampling. Where limited, resources exist, use of sampling is an appropriate strategy while conducting marketing, research., , 41
Page 42 :
8., , Better rapport, , An effective research study requires a good rapport between the researcher and the, respondents. When the population of the study is large, the problem of rapport arises. But, manageable samples permit the researcher to establish adequate rapport with the, respondents., Disadvantages of Sampling, The reliability of the sample depends upon the appropriateness of the sampling method, used. The purpose of sampling theory is to make sampling more efficient. But the real, difficulties lie in selection, estimation and administration of samples., 1., , Chances of bias, , The serious limitation of the sampling method is that it involves biased selection and thereby, leads us to draw erroneous conclusions. Bias arises when the method of selection of, sample employed is faulty. Relative small samples properly selected may be much more, reliable than large samples poorly selected., 2., , Difficulties in selecting a truly representative sample, , Difficulties in selecting a truly representative sample produces reliable and accurate results, only when they are representative of the whole group. Selection of a truly representative, sample is difficult when the phenomena under study are of a complex nature. Selecting, good samples is difficult., 3., , Inadequate knowledge in the subject, , Use of sampling method requires adequate subject specific knowledge in sampling, technique. Sampling involves statistical analysis and calculation of probable error. When, the researcher lacks specialised knowledge in sampling, he may commit serious mistakes., Consequently, the results of the study will be misleading., 4., , Changeability of units, , When the units of the population are not in homogeneous, the sampling technique will be, unscientific. In sampling, though the number of cases is small, it is not always easy to stick, to the, selected cases. The units of sample may be widely dispersed. Some of the cases of, sample may not cooperate with the researcher and some others may be inaccessible., Because of these problems, all the cases may not be taken up. The selected cases may, 42
Page 43 :
have to be replaced by other cases. Changeability of units stands in the way of results of, the study., 5., , Impossibility of sampling, , Deriving a representative sample is difficult, when the universe is too small or too, heterogeneous. In this case, census study is the only alternative. Moreover, in studies, requiring a very high standard of accuracy, the sampling method may be unsuitable. There, will be chances of errors even if samples are drawn most carefully., 3.4.2 Probability Sampling, In random sampling every unit of the population has the same probability of being selected, into the sample. Random sampling implies that chance enters into the process of selection., For example, most Americans would like to believe that winners of nationwide magazine, sweepstakes or numbers selected as state lottery winners are selected by some random, draw of numbers., 3.4.3 Non-Probability Sampling, In non random sampling not every unit of the population has the same probability of being, selected into the sample. Members of non random samples are not selected by chance., For example, they might be selected because they are at the right place at the right time or, because they know the people conducting the research., Sometimes random sampling is called probability sampling and non random sampling is, called non probability sampling. Because every unit of the population is not equally likely, to be selected, assigning a probability of occurrence in non random sampling is impossible., Sampling studies are becoming more and more popular in all types of mass studies, but, they are especially important in case of social surveys. The vastness of the population, the, difficulties of contacting people, high refusal rate, difficulties of ascertaining the universe, make sampling the best alternative in case of social studies. Recent developments in sampling, techniques have made this method more reliable and valid. The result of sampling has, attained a sufficiently high standard of accuracy. In social research a close study of the, people has to be made generally taking a sufficiently long period in studying each unit., Under such circumstances sampling is most suitable to be resorted to., , 43
Page 44 :
3.5 NON PROBABILITY SAMPLING METHODS, Sampling techniques used to select elements from the population by any mechanism that, does not involve a random selection process are called non random sampling techniques., Because chance is not used to select items from the samples, these techniques are nonprobability techniques and are not desirable for use in gathering data to be analysed by the, methods of inferential statistics. Sampling error cannot be determined objectively for these, sampling techniques. Four non-random sampling techniques are presented here: convenience, sampling, judgment sampling, quota sampling, and snowball sampling., 3.5.1 Convenience Sampling, In convenience sampling, elements for the sample are selected for the convenience of the, researcher. The researcher typically chooses elements that are readily available, nearby,, or willing to participate. The sample tends to be less variable than the population because, in many environments the extreme elements of the population are not readily available., The researcher will select more elements from the middle of the population. For example,, a convenience sample of homes for door-to-door interviews might include houses where, people are at home, houses with no dogs, houses near the street, first-floor apartments,, and houses with friendly people. In contrast, a random sample would require the researcher, to gather data only from houses and apartments that have been selected randomly, no, matter how inconvenient or unfriendly the location. If a research firm is located in a mall, a, convenience sample might be selected by interviewing only shoppers who pass the shop, and look friendly., 3.5.2 Judgement Sampling, Judgment sampling occurs when elements selected for the sample are chosen by the, judgment of the researcher. Researchers often believe they can obtain a representative, sample by using sound judgment, which will result in saving time and money. Sometimes, ethical, professional researchers might believe they can select a more representative sample, than the random process will provide. They might be right! However, some studies show, that random sampling methods outperform judgment sampling in estimating the population, mean even when the researcher who is administering the judgment sampling is trying to put, together a representative sample. When sampling is done by judgment, calculating the, probability that an element is going to be selected into the sample is not possible. The, sampling error cannot be determined objectively because probabilities are based on non, random selection., 44
Page 45 :
Other problems are associated with judgment sampling. The researcher tends to make, errors of judgment in one direction. These systematic errors lead to what are called biases., The researcher also is unlikely to include extreme elements. Judgment sampling provides, no objective method for determining whether one person’s judgment is better than another’s., 3.5.3 Quota Sampling, A third non random sampling technique is quota sampling, which appears to be similar to, stratified random sampling. Certain population subclasses, such as age group, gender, or, geographic region, are used as strata. However, instead of randomly sampling from each, stratum, the researcher uses a non random sampling method to gather data from one, stratum until the desired quota of samples is filled. Quotas are described by quota controls,, which set the sizes of the samples to be obtained from the subgroups. Generally, a quota, is based on the proportions of the subclasses in the population. In this case, the quota, concept is similar to that of proportional stratified sampling., Quotas often are filled by using available, recent, or applicable elements. For example,, instead of randomly interviewing people to obtain a quota of Italian Americans, the, researcher would go to the Italian area of the city and interview there until enough responses, are obtained to fill the quota. In quota sampling, an interviewer would begin by asking a, few filter questions; if the respondent represents a subclass whose quota has been filled,, the interviewer would terminate the interview., Quota sampling can be useful if no frame is available for the population. For example,, suppose a researcher wants to stratify the population into owners of different types of cars, but fails to find any lists of Toyota van owners. Through quota sampling, the researcher, would proceed by interviewing all car owners and casting out non-Toyota van owners, until the quota of Toyota van owners is filled., Quota sampling is less expensive than most random sampling techniques because it, essentially is a technique of convenience. However, cost may not be meaningful because, the quality of non random and random sampling techniques cannot be compared. Another, advantage of quota sampling is the speed of data gathering. The researcher does not have, to call back or send out a second questionnaire if he does not receive a response; he just, moves on to the next element. Also, preparatory work for quota sampling is minimal., The main problem with quota sampling is that, when all is said and done, it still is only a non, random sampling technique. Some researchers believe that if the quota is filled by randomly, 45
Page 46 :
selecting elements and discarding those not from a stratum, quota sampling is essentially a, version of stratified random sampling. However, most quota sampling is carried out by the, researcher going where the quota can be filled quickly. The object is to gain the benefits of, stratification without the high field costs of stratification. Ultimately, it remains a non, probability sampling method., 3.5.4 Snowball Sampling, Another non random sampling technique is snowball sampling, in which survey subjects, are selected based on referral from other survey respondents. The researcher identifies a, person who fits the profile of subjects wanted for the study. The researcher then asks this, person for the names and locations of others who would also fit the profile of subjects, wanted for the study. Through these referrals, survey subjects can be identified cheaply, and efficiently, which is particularly useful when survey subjects are difficult to locate. It is, the main advantage of snowball sampling; its main disadvantage is that it is non random., 3.6, , SUMMARY, , In this lesson we studied the concept of census and sampling. Also, we studied in detail, various techniques of sampling used in statistics for collecting information from population., Sampling is the process whereby some elements (individuals) in the population are selected, for a research study. The population consists of all individuals with a particular characteristic, that is of interest to the researchers. If data are obtained from all members of the population,, then we have a census; if data are obtained from some members of the population, then, we have a sample. With probability sampling, a researcher can specify the probability of, an element’s (participant’s) being included in the sample. With non probability sampling,, there is no way of estimating the probability of an element’s being included in a sample., Convenience sampling is quick and inexpensive because it involves selecting individuals, who are readily available at the time of the study (such as introductory psychology students)., The disadvantage is that convenience samples are generally less representative than random, samples; therefore, results should be interpreted with caution., 3.7 GLOSSARY, •, , Population: In statistics, a population is a set of similar items or events which is, of interest for some question or experiment., , •, , Sample: In statistics and quantitative research methodology, a data sample is a, set of data collected and the world selected from a statistical population by a defined, procedure., 46
Page 47 :
•, , Parameter: Parameters in statistics is an important component of, any statistical analysis. In simple words, a parameter is any numerical quantity that, characterises a given population or some aspect of it. This means the parameter tells, us something about the whole population., , •, , Statistic: A statistic is a characteristic of a sample. Generally, a statistic is used to, estimate the value of a population parameter., , •, , Probability Sampling: A probability sampling method is any method, of sampling that utilises some form of random selection., , •, , Non-probability Sampling: Non-probability sampling is a sampling technique, where the samples are gathered in a process that does not give all the individuals, in the population equal chances of being selected., 3.8, , (i), , SELF ASSESSMENT QUESTIONS, Fill in the Blanks:, 1. A sample is a study of............................of the population., 2. A population is the .............................of limits under study., 3. Random sample is also referred to as ......................sampling., 4. Any numerical value calculated from sample data is called., 5. Standard deviation of sampling distribution of any statistic is called., , (ii), , Indicate whether the following statements are True or False ()., 1. A sample is less expensive than a census., , (T, , F), , 2. The results obtained in a census study are always more reliable than those obtained, in a sample study., (T, F), 3. Judgement sampling is a type of probability sampling method., 3.9, 1., , (T, , F), , LESSON END EXERCISE, What type of sampling is best for qualitative research?, ____________________________________________________________, ____________________________________________________________, 47
Page 49 :
3.10 SUGGESTED READING, •, , Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall,, Delhi., , •, , Levin and Brevson: Business Statistics, Pearson Education, New Delhi., , •, , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , •, , Hien, L.W: Quantitative Approach to Managerial Decisions, Prentice Hall,, New Jersy, India, Delhi., , •, , Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., , 49
Page 50 :
M.Com. I, , Course No. M.Com-115, , Unit- I, , Lesson No. 4, PROBABILITY SAMPLING METHODS, , STRUCTURE, 4.1 Introduction, 4.2 Objectives, 4.3 Probability Sampling Methods, 4.3.1 Simple Random Sampling, 4.3.1.1 Merits, 4.3.1.2 Demerits, 4.3.2 Systematic Sampling, 4.3.2.1 Merits, 4.3.2.2 Demerits, 4.3.3 Stratified Sampling, 4.3.3.1 Merits, 4.3.3.2 Demerits, 4.3.4 Cluster Sampling, 4.3.4.1 Merits, 4.3.4.2 Demerits, 4.8 Summary, 4.9 Glossary, 4.10 Self Assessment Questions, 4.11 Lesson End Exercise, 4.12 Suggested Reading, , 50
Page 51 :
4.1 INTRODUCTION, In practical problems the statistician is often with the necessity of discussing population, where he cannot examine every member. For example, an inquirer into the heights of the, population of a city cannot afford the time or expense required to measure the height of, every individual; nor can a producer who wants to know what proportion of his product is, defective examine every single product. In such cases an investigator can examine a limited, number of individuals of the population and hope that they will tell him, with reasonable, trust worthiness, as much as he wants to know about the population from which they, come. We are thus led to the questions: what can be said about the population when we, can examine only a limited number of its members? This specific question is the origin of, the theory of sampling. Also, there are two main methods used in survey research for, selecting sample, viz., probability sampling and non-probability sampling. The big difference, is that in probability sampling all persons has a chance of being selected, and results are, more likely to accurately reflect the entire population. We had already discussed nonprobability sampling techniques in lesson 3. Therefore, the focus of this lesson is on probability, sampling techniques., In sampling, there are few terminologies that a researcher should be familiar with. For, example, let us say you are working in a research project on computing implementation, for elderly and disabled citizens for a smart home system. You are supposed to find out the, average age of senior and disabled citizens involved in your study., (a) The community, families living in the town with smart homes form the population or, study population and are usually denoted by the letter N., (b) The sample group of elderly people or senior citizens and disable people in the vicinity, of the smart home community is called sample., (c) The number of elderly people or senior citizens and disabled people you obtain, information to find their average age is called the sample size and is usually denoted by, letter n., (d) The way you select senior citizens and disabled people is called the sampling design or, strategy., (e) Each citizen or disabled people that become the basis for selecting your sample is, called the sampling unit or sampling element., 51
Page 52 :
(f) A list identifying each respondent in the study population is called sampling frame. In, case when all elements in a sampling population cannot be individually identified, you, cannot have a sampling frame for the study population., (g) Finally, the obtained findings based on the information of the respondents are called, sample statistics., 4.2 OBJECTIVES, After going through this lesson, you will be able to understand:, •, , various terms associated with sampling;, , •, , various methods of probability sampling, , •, , when to use the different methods of probability sampling., , •, , why sampling is so common in business decisions., , 4.3 PROBABILITY SAMPLING METHODS, The various sampling techniques can be grouped into two categories, i.e., probability, sampling (also known as random sampling) and non-probability sampling (non-random, sampling). Non-probability sampling techniques were already discussed in lesson 3., Therefore, here our focus is on probability sampling techniques. Probability sampling, methods are those in which every item in the universe has a known chance, or probability,, of being selected in the sample. This implies that the selection of sample items is independent, of the person making the sample, that is, the sampling operation is controlled so objectively, that the items will be chosen strictly at random. It may be noted that the term random, sample is not used to describe the data in the sample but the process employed to select, the sample. Randomness is thus a property of the sampling procedure instead of an individual, sample. As such randomness can enter processed sampling in a number of ways and, hence random samples may be many kinds., , 52
Page 53 :
4.3.1 SIMPLE RANDOM SAMPLING, Researchers use two major sampling techniques: probability sampling and non-probability, sampling. With probability sampling, a researcher can specify the probability of an element’s, (participant’s) being included in the sample. With non-probability sampling, there is no, way of estimating the probability of an element’s being included in a sample. If the, researcher’s interest is in generalising the findings derived from the sample to the general, population, then probability sampling is far more useful and precise. Unfortunately, it is, also much more difficult and expensive than non-probability sampling. Probability sampling, is also referred to as random sampling or representative sampling. The word random, describes the procedure used to select elements (participants, cars, test items) from a, population., When random sampling is used, each element in the population has an equal chance of, being selected (simple random sampling). The sample is referred to as representative, because the characteristics of a properly drawn sample represent the parent population in, all ways., Step 1. Defining the Population, Before a sample is taken, we must first define the population to which we want to generalise, our results. The population of interest may differ for each study. It could be the population, of professional football players in the United States or the registered voters in Bowling, Green, Ohio. It could also be all college students at a given University, or all sophomores, at that institution. It could be female students, or introductory psychology students, or 10year-old children in a particular school, or members of the local senior citizens center. The, point should be clear; the sample should be drawn from the population to which you want, to generalise the population in which you are interested. It is unfortunate that many, researchers fail to make explicit their population of interest. Many investigators use only, college students in their samples, yet their interest is in the adult population of the United, States. To a large extent, the generalisability of sample data depends on what is being, studied and the inferences that are being made. For example, imagine a study that sampled, college juniors at a specific University. Findings showed that a specific chemical compound, produced pupil dilation. We would not have serious misgivings about generalising this, finding to all college students, even tentatively to all adults, or perhaps even to some nonhuman organisms. The reason for this is that physiological systems are quite similar from, 53
Page 54 :
one person to another, and often from one species to another. However, if we find that, controlled exposure to unfamiliar political philosophies led to radicalisation of the, experimental participants, we would be far more reluctant to extend this conclusion to the, general population., Step 2. Constructing a List, Before a sample can be chosen randomly, it is necessary to have a complete list of the, population. In some cases, the logistics and expense of constructing a list of the entire, population is simply too great, and an alternative procedure is forced upon the investigator., We could avoid this problem by restricting our population of interest-by defining it narrowly., However, doing so might increase the difficulty of finding or constructing a list from which, to make our random selection., For example, you would have no difficulty identifying female students at any given University, and then constructing a list of their names from which to draw a random sample. It would, be more difficult to identify female students coming from a three-child family, and even, more difficult if you narrowed your interest to firstborn females in a three-child family., Moreover, defining a population narrowly also means generalising results narrowly., Caution must be exercised in compiling a list or in using one already constructed. The, population list from which you intend to sample, must be both recent and exhaustive. If, not, problems can occur. By an exhaustive list, we mean that all members of the population, must appear on the list. Voter registration lists, telephone directories, homeowner lists,, and school directories are sometimes used, but these lists may have limitations. They must, be up to date and complete if the samples chosen from them are to be truly representative, of the population. In addition, such lists may provide very biased samples for some research, questions we ask. For example, a list of homeowners would not be representative of all, individuals in a given geographical region because it would exclude transients and renters., On the other hand, a ready-made list is often of better quality and less expensive to obtain, than a newly constructed list would be., Some lists are available from a variety of different sources. Professional organisations,, such as the American Psychological Association, the American Medical Association, and, the American Dental Association, have directory listings with mailing addresses of members., Keep in mind that these lists do not represent all psychologists, physicians, or dentists., Many individuals do not become members of their professional organisations. Therefore,, 54
Page 55 :
a generalisation would have to be limited to those professionals listed in the directory. In, universities and colleges, complete lists of students can be obtained from the registrar., Let’s look at a classic example of poor sampling in the hours prior to a presidential election., Information derived from sampling procedures is often used to predict election outcomes., Individuals in the sample are asked their candidate preferences before the election, and, projections are then made regarding the likely winner. More often than not, the polls, predict the outcome with considerable accuracy. However, there are notable exceptions,, such as the 1936 Literary Digest magazine poll that predicted “Landon by a Landslide”, over Roosevelt, and predictions in the U.S. presidential election of 1948 that Dewey, would defeat Truman., We have discussed the systematic error of the Literary Digest poll. Different reasons, resulted in the wrong prediction in the 1948 presidential election between Dewey and, Truman. Polls taken in 1948 revealed a large undecided vote. Based partly on this and, early returns on the night of the election, the editors of the Chicago Tribune printed and, distributed their newspaper before the election results were all in. The headline in bold, letters indicated that Dewey defeated Truman. Unfortunately for them, they were wrong., Truman won, and the newspaper became a collector’s item., One analysis of why the polls predicted the wrong outcome emphasised the consolidation, of opinion for many undecided voters. It was this undecided group that proved the, prediction wrong. Pollsters did not anticipate that those who were undecided would vote, in large numbers for Truman. Other factors generally operate to reduce the accuracy of, political polls. One is that individuals do not always vote the way they say they are going, to. Others may intend to do so but change their mind in the voting booth., Also, the proportion of potential voters who actually cast ballot differs depending upon, the political party and often upon the candidates who are running. Some political analysts, believe (along with politicians) that even the position of the candidate’s name on the ballot, can affect the outcome (the debate regarding butterfly ballots in Florida during the 2000, presidential election comes to mind)., Step 3. Drawing the Sample, After a list of population members has been constructed, various random sampling options, are available. Some common ones include tossing dice, flipping coins, spinning wheels,, 55
Page 56 :
drawing names out of a rotating drum, using a table of random numbers, and using computer, programs. Except for the last two methods, most of the techniques are slow and, cumbersome. Tables of random numbers are easy to use, accessible, and truly random., Let’s look at the procedures for using the table. The first step is to assign a number to each, individual on the list. If there were 1,000 people in the population, you would number, them 0 to 999 and then enter the table of random numbers. Let us assume your sample, size will be 100. Starting anywhere in the table, move in any direction you choose, preferably, up and down. Since there are 1,000 people on your list (0 through 999) you must give, each an equal chance of being selected. To do this, you use three columns of digits from, the tables. If the first three-digit number in the table is 218, participant number 218 on the, population list is chosen for the sample. If the next three-digit number is 007, the participant, assigned number 007 (or 7) is selected. Continue until you have selected all 100 participants, for the sample. If the same number comes up more than once, it is simply discarded. In the, preceding fictional population list, the first digit (9) in the total population of 1,000 (0–, 999) was large. Sometimes the first digit in the population total is small, as with a list of, 200 or 2,000. When this happens, many of the random numbers encountered in the table, will not be usable and therefore must be passed up. This is very common and does not, constitute a sampling problem. Also, tables of random numbers come in different column, groupings. Some come in columns of two digits, some three, some four, and so on. These, differences have no bearing on randomness. Finally, it is imperative that you not violate the, random selection procedure. Once the list has been compiled and the process of selection, has begun, the table of random numbers dictates who will be selected. The experimenter, should not alter this procedure. A more recent method of random sampling uses the special, functions of computer software. Many population lists are now available as software, databases (such as Excel, Quattro Pro, Lotus123) or can be imported to such a database., Many of these database programs have a function for generating a series of random numbers, and a function for selecting a random sample from a range of entries in the database., Step 4. Contacting Members of a Sample, Researchers using random sampling procedures must be prepared to encounter difficulties, at several points. As we noted, the starting point is an accurate statement that identifies the, population to which we want to generalise. Then we must obtain a listing of the population,, accurate and up-to-date, from which to draw our sample. Further, we must decide on the, random selection procedure that we wish to use. Finally, we must contact each of those, selected for our sample and obtain the information needed. Failing to contact all individuals, 56
Page 57 :
in the sample can be a problem, and the representativeness of the sample can be lost at, this point. To illustrate what we mean, assume that we are interested in the attitudes of, college students at your University. We have a comprehensive list of students and randomly, select 100 of them for our sample. We send a survey to the 100 students, but only 80, students return it. We are faced with a dilemma. Is the sample of 80 students who, participated representative? Because 20% of our sample was not located, does our sample, under represent some views? Does it over represent other views? In short, can we generalise, from our sample to the college population? Ideally, all individuals in a sample should be, contacted. As the number contacted decreases, the risk of bias and not being representative, increases. Thus, in our illustration, to generalise to the college population would be to, invite risk. Yet we do have data on 80% of our sample. Is it of any value? Other than, simply dropping the project or starting a new one, we can consider an alternative that, other researchers have used. In preparing our report, we would first clearly acknowledge, that not all members of the sample participated and therefore the sample may not be, random-that is, representative of the population. Then we would make available to the, reader or listener of our report the number of participants initially selected and the final, number contacted, the number of participants cooperating, and the number not cooperating. We would attempt to assess the reasons participants could not be contacted, and whether differences existed between those for whom there were data and those for, whom there were no data. If no obvious differences were found, we could feel a little, better about the sample’s being representative., Differences on any characteristic between those who participated and those who did not, should not automatically suggest that the information they might give would also differ., Individuals can share many common values and beliefs, even though they may differ on, characteristics such as gender or education. In situations requiring judgments, such as the, one described, the important thing is for the researcher to describe the strengths and, weaknesses of the study (especially telling the reader that only 80 of the 100 surveys were, returned), along with what might be expected as a result of them., The problem just described may be especially troublesome when surveys or questionnaires, deal with matters of a personal nature. Individuals are usually reluctant to provide information, on personal matters, such as sexual practices, religious beliefs, or political philosophy. The, more personal the question, the fewer the number of people who will respond. With surveys, or questionnaires of this nature, a large number of individuals may refuse to cooperate or, refuse to provide certain information. Some of these surveys have had return rates as low, 57
Page 58 :
as 20%. Even if we knew the population from which the sample was drawn and if the, sample was randomly selected, a return rate as low as 20% is virtually useless in terms of, generalising findings from the sample to the population. Those individuals responding to a, survey (20% of the sample) could be radically different from the majority of individuals not, responding (80% of the sample)., Let’s apply these four steps of random sampling to our TV violence study. Our first step is, to define the population. We might begin by considering the population as all children in, the United States that are 5–15 years old. Our next step will be to obtain an exhaustive list, of these children. Using U. S. Census data would be one approach, although the task, would be challenging and the Census does miss many people. The third step is to select a, random sample. As noted earlier in the chapter, the simplest technique would be to use a, database of the population and instruct the database software to randomly select children, from the population. The number to be selected is determined by the researcher and is, typically based on the largest number that can be sampled given the logistical resources of, the researcher. Of course, the larger the sample, the more accurately it will represent the, population. In fact, formulas can be used to determine sample size based on the size of the, population, the amount of variability in the population, the estimated size of the effect, and, the amount of sampling error that the researcher decides is acceptable (refer to statistics, books for specifics). After the sample is selected from the population, the final step is to, contact the parents of these children to obtain consent to participate. You will need to, make phone calls and send letters. Again, this will be a challenge; you expect that you will, be unable to contact a certain percentage, and that a certain percentage will decline to, participate. All this effort, and we have not even begun to talk about collecting data from, these children., From this example, it is clear that random sampling can require an incredible amount of, financial resources. As noted earlier in the chapter, we have two options. We can define, the population more narrowly (perhaps the 5- to 15-year-olds in a particular school district), and conduct random sampling from this population, or we can turn to a sampling technique, other than probability sampling., 4.3.1.1 Merits, 1., , Since it is a probability sampling, it eliminates the bias due to the personal judgement, or discretion of the investigator. Accordingly, the sample selected is more, representative of the population than in case of judgement sampling., 58
Page 59 :
2., , Because of its random character, it is possible to ascertain the efficient of the, estimates by considering the standard errors of their sampling distributions., , 3., , The theory of random sampling is highly developed so that it enables us to obtain, the most reliable and maximum information at the least cost, and results in savings, in time, money and labour., , 4.3.1.2 Demerits, 1., , Simple random sampling requires an up to date frame, i.e., a complete and up-todate list of the population units to be sampled. In practice, since this is not readily, available in many inquiries, it restricts the use of this sampling design., , 2., , In field surveys if the area of coverage is fairly large, then the units selected in the, random sample are expected to be scattered widely geographically and thus it, may be quite time consuming and costly to collect the requisite information or, data., , 3., , If the sample is not sufficiently large, then it may not be representative of the, population and thus may not reflect the true characteristics of the population., , 4., , The numbering of the population units and the preparation of the slips is quite time, consuming and uneconomical particularly if the population is large. Accordingly,, this method can’t be used effectively to collect most of the data in social sciences., , 4.3.2 SYSTEMATIC SAMPLING, Systematic sampling is another random sampling technique. Unlike stratified random, sampling, systematic sampling is not done in an attempt to reduce sampling error. Rather,, systematic sampling is used because of its convenience and relative ease of administration., With systematic sampling, every kth item is selected to produce a sample of size n from a, population of size N. The value of k, sometimes called the sampling cycle, can be determined, by the following formula. If k is not an integer value, the whole-number value should be, used, k =N/n, where,, n = sample size, N = population size, k = size of interval for selection, 59
Page 60 :
As an example of systematic sampling, a management information systems researcher, wanted to sample the manufacturers in Texas. He had enough financial support to sample, 1,000 companies (n). The Directory of Texas Manufacturers listed approximately 17,000, total manufacturers in Texas (N) in alphabetical order. The value of k was 17 (17,000/, 1,000) and the researcher selected every 17th company in the directory for his sample., Did the researcher begin with the first company listed or the 17th or one somewhere, between? In selecting every kth value, a simple random number table should be used to, select a value between 1 and k inclusive as a starting point. The second element for the, sample is the starting point plus k. In the example, k = 17, so the researcher would have, gone to a table of random numbers to determine a starting point between 1 and 17., Suppose he selected the number 5.He would have started with the 5th company, then, selected the 22nd (5 + 17), and then the 39th, and so on., Besides convenience, systematic sampling has other advantages. Because systematic, sampling is evenly distributed across the frame, a knowledgeable person can easily determine, whether a sampling plan has been followed in a study. However, a problem with systematic, sampling can occur if the data are subject to any periodicity, and the sampling interval is in, syncopation with it. In such a case, the sampling would be non-random. For example, if a, list of 150 college students is actually a merged list of five classes with 30 students in each, class and if each of the lists of the five classes has been ordered with the names of top, students first and bottom students last, then systematic sampling of every 30th student, could cause selection of all top students, all bottom students, or all mediocre students; that, is, the original list is subject to a cyclical or periodic organisation. Systematic sampling, methodology is based on the assumption that the source of population elements is random., 4.3.2.1 Merits, 1., , Systematic sampling is very easy to operate and checking can also be done quickly., Accordingly, it results in considerable saving in time and labour relative to simple, random sampling or stratified sampling., , 2., , Systematic sampling may be more efficient than simple random sampling provided, the frame is complete and up-to-date and the units are arranged serially in a random, order like the names in a telephone directory where the units are arranged in, alphabetical order. However, even in alphabetical arrangement, certain amount of, non-random character may persist., 60
Page 61 :
4.3.2.2 Demerits, 1., , Systematic sampling works well only if the complete and up-to-date frame is, available and if the units are randomly arranged. However, these requirements are, not generally fulfilled., , 2., , Systematic sampling gives biased results if there are periodic features in the frame, and the sampling interval (k) is equal to or a multiple of the period., , 4.3.3 STRATIFIED SAMPLING, This procedure known as stratified random sampling is also a form of probability sampling., To stratify means to classify or to separate people into groups according to some, characteristics, such as position, rank, income, education, or ethnic background. These, separate groupings are referred to as subsets or subgroups. For a stratified random sample,, the population is divided into groups or strata. A random sample is selected from each, stratum based upon the percentage that each subgroup represents in the population. Stratified, random samples are generally more accurate in representing the population than are simple, random samples. They also require more effort, and there is a practical limit to the number, of strata used. Because participants are to be chosen randomly from each stratum, a, complete list of the population within each stratum must be constructed. Stratified sampling, is generally used in two different ways. In one, primary interest is in the representativeness, of the sample for purposes of commenting on the population. In the other, the focus of, interest is comparison between and among the strata., Let’s look first at an example in which the population is of primary interest. Suppose we, are interested in the attitudes and opinions of university faculty in a certain state toward, faculty unionisation. Historically, this issue has been a very controversial one evoking strong, emotions on both sides. Assume that there are eight universities in the state, each with a, different faculty size (faculty size = 500 + 800 + 900 + 1,000 + 1,400 + 1,600 + 1,800 +, 2,000 = 10,000). We could simply take a simple random sample of all 10,000 faculty and, send those in the sample a carefully constructed attitude survey concerning unionisation., After considering this strategy, we decide against it. Our thought is that universities of, different size may have marked differences in their attitudes, and we want to be sure that, each university will be represented in the sample in proportion to its representation in the, total university population. We know that, on occasion, a simple random sample will not, do this. For example, if unionisation is a particularly “hot” issue on one campus, we may, 61
Page 62 :
obtain a disproportionate number of replies from that faculty. Therefore, we would construct, a list of the entire faculty for each university and then sample randomly within each university, in proportion to its representation in the total faculty of 10,000. For example, the university, with 500 faculty members would represent 5% of our sample; assuming a total sample, size of 1,000, we would randomly select 50 faculty from this university. The university with, 2,000 faculty would represent 20% of our sample; thus, 200 of its faculty would be randomly, selected. We would continue until our sample was complete. It would be possible but, more costly and time consuming to include other strata of interest-for example, full, associate,, and assistant professors. In each case, the faculty in each stratum would be randomly, selected., As previously noted, stratified samples are sometimes used to optimise group comparisons., In this case, we are not concerned about representing the total population. Instead, our, focus is on comparisons involving two or more strata. If the groups involved in our, comparisons are equally represented in the population, a single random sample could be, used. When this is not the case, a different procedure is necessary. For example, if we, were interested in making comparisons between whites and blacks, a simple random, sample of 100 people might include about 85 to 90 whites and only 10 to 15 blacks. This, is hardly a satisfactory sample for making comparisons. With a stratified random sample,, we could randomly choose 50 whites and 50 blacks and thus optimize our comparison., Whenever strata rather than the population are our primary interest, we can sample in, different proportions from each stratum., Although random sampling is optimal from a methodological point of view, it is not always, possible from a practical point of view., Stratified random sampling can be either proportionate or disproportionate. Proportionate, stratified random sampling occurs when the percentage of the sample taken from each, stratum is proportionate to the percentage that each stratum is within the whole population., For example, suppose voters are being surveyed in Boston and the sample is being stratified, by religion as Catholic, Protestant, Jewish, and others. If Boston’s population is 90%, Catholic and if a sample of 1,000 voters is taken, the sample would require inclusion of, 900 Catholics to achieve proportionate stratification. Any other number of Catholics would, be disproportionate stratification. The sample proportion of other religions would also, have to follow population percentages. Or consider the city of El Paso, Texas, where the, population is approximately 77% Hispanic. If a researcher is conducting a citywide poll in, 62
Page 63 :
El Paso and if stratification is by ethnicity, a proportionate stratified random sample should, contain 77% Hispanics. Hence, an ethnically proportionate stratified sample of 160 residents, from El Paso’s 600,000 residents should contain approximately 123 Hispanics. Whenever, the proportions of the strata in the sample are different from the proportions of the strata, in the population, disproportionate stratified random sampling occurs., 4.3.3.1 Merits, 1., , More representatives: Since the population is first divided into various strata, and then a sample is drawn from each stratum there is a little possibility of any, essential group of the population being completely excluded. A more representative, sample is thus secured, C.J. Grohmann has rightly pointed out that this type of, sampling balances the uncertainty of random sampling against the bias of deliberate, selection., , 2., , Greater accuracy: Stratified sampling ensures greater accuracy. The accuracy is, maximum if each stratum is so formed that it consists of uniform or homogeneous, items., , 3., , Greater geographical concentration: As compared with random sample,, stratified samples can be more concentrated geographically, i.e., the units form the, different strata may be selected in such a way that all of them are localised in one, geographical area. This would greatly reduce the time and expenses of interviewing., , 4.3.3.2 Demerits, 1., , Disproportional stratified sampling requires the assignment of weights to different, strata and if the weights assigned are faulty, the resulting sample will not be, representative and might give biased results., , 2., , The items from each stratum should be selected at random. But this may be difficult, to achieve in the absence of skilled sampling supervisors and a random selection, within each stratum may not be ensured., , 3., , Because of the likelihood that a stratified sample will be more widely distributed, geographically than a simple random sample cost per observation may be quite, high., 63
Page 64 :
4.3.4 CLUSTER SAMPLING, Cluster (or area) sampling is a fourth type of random sampling. Cluster (or area) sampling, involves dividing the population into non-overlapping areas, or clusters. However, in contrast, to stratified random sampling where strata are homogeneous within, cluster sampling, identifies clusters that tend to be internally heterogeneous. In theory, each cluster contains, a wide variety of elements, and the cluster is a miniature, or microcosm, of the population., Examples of clusters are towns, companies, homes, colleges, areas of a city, and geographic, regions. Often clusters are naturally occurring groups of the population and are already, identified, such as states or Standard Metropolitan Statistical Areas. Although area sampling, usually refers to clusters that are areas of the population, such as geographic regions and, cities, the terms cluster sampling and area sampling are used interchangeably in this text., After randomly selecting clusters from the population, the business researcher either selects, all elements of the chosen clusters or randomly selects individual elements into the sample, from the clusters. One example of business research that makes use of clustering is test, marketing of new products. Often in test marketing, the United States is divided into, clusters of test market cities, and individual consumers within the test market cities are, surveyed., Sometimes the clusters are too large, and a second set of clusters is taken from each, original cluster. This technique is called two-stage or multistage sampling. For example, a, researcher could divide the United States into clusters of cities. He could then divide the, cities into clusters of blocks and randomly select individual houses from the block clusters., The first stage is selecting the test cities and the second stage is selecting the blocks., Clusters are usually convenient to obtain, and the cost of sampling from the entire population, is reduced because the scope of the study is reduced to the clusters. The cost per element, is usually lower in cluster or area sampling than in stratified sampling because of lower, element listing or locating costs. The time and cost of contacting elements of the population, can be reduced, especially if travel is involved, because clustering reduces the distance to, the sampled elements. In addition, administration of the sample survey can be simplified., Sometimes cluster or area sampling is the only feasible approach because the sampling, frames of the individual elements of the population are unavailable and therefore other, random sampling techniques cannot be used. If the elements of a cluster are similar, cluster, sampling may be statistically less efficient than simple random sampling. Moreover, the, costs and problems of statistical analysis are greater with cluster or area sampling than, with simple random sampling., 64
Page 65 :
4.3.4.1 Merits, Multistage sampling is more flexible as compared to other methods of sampling. It is, simple to carry out and results in administrative convenience by permitting the field work, to be concentrated and yet covering large area., An important practical advantage of multistage sampling is that we need the second stage, frame only for those units which are selected in the first sample and this leads to great, saving of operating cost., Consequently this technique is of great utility, particularly in surveys of under developed, area where no up-to-date frame is available for subdivision of the material into reasonably, small sampling units., 4.3.4.2 Demerits, Errors are likely to be larger in this method. The variability of the estimates under this, method may be greater than that of estimates based on simple random sampling. This, variability depends on the composition of the primary units. In general, a multistage sampling, is usually less efficient than a suitable single stage sampling of the same size., 4.8 SUMMARY, When we conduct research, we are generally interested in drawing some conclusion about, a population of individuals that have some common characteristic. However, populations, are typically too large to allow observations on all individuals, and we resort to selecting a, sample. In order to make inferences about the population, the sample must be representative., Thus, the manner in which the sample is drawn is critical. Probability sampling uses random, sampling in which each element in the population (or a subgroup of the population with, stratified random sampling) has an equal chance of being selected for the sample. When, probability sampling is not possible, non-probability sampling must be used. Larger samples, are more likely to accurately represent characteristics of the population, and smaller samples, are less likely to accurately represent characteristics of the population. Therefore, researchers, strive for samples that are large enough to reduce sampling error to an acceptable level., Even when samples are large enough, it is important to evaluate the specific method by, which the sample was drawn. We are increasingly exposed to information obtained from, self-selected samples that represent only a very narrow subgroup of individuals. Much of, such information is meaningless because the subgroup is difficult to identify., 65
Page 66 :
Therefore, in this lesson we have studied the concept and uses of various probability, sampling techniques with their merits and demerits., 4.9 GLOSSARY, •, , Stratified Sampling: Stratified sampling is a probability sampling, technique wherein the researcher divides the entire population into different, subgroups or strata, then randomly selects the final subjects proportionally from, the different strata., , •, , Sampling Frame: A list of the items or people forming a population from which a, sample is taken, , •, , Sample Size: The number (n) of observations taken from a population through, which statistical inferences for the whole population are made., , •, , Cluster Sampling: With cluster sampling, the researcher divides the population, into separate groups, called clusters. Then, a simple random sample of clustersis, selected from the population. The researcher conducts his analysis on data from, the sampled clusters., , •, , Simple Random Sampling: Simple random sampling is a sampling technique, where every item in the population has an even chance and likelihood of being, selected in the sample., , •, , Systematic Sampling: Systematic sampling is a type of, probability sampling method in which sample members from a larger population, are selected according to a random starting point and a fixed, periodic interval., This interval, called the sampling interval, is calculated by dividing the population, size by the desired sample size., 4.10 SELF ASSESSMENT QUESTIONS, (i) Indicate whether the following statements are true or false., , 1., , With probability samples the chance, or probability, of each case being selected, from the population is unknown., (T, F), , 2., , The sampling frame for any probability sample is a complete list of all the cases in, the population from which your sample will be drawn., (T, F), 66
Page 67 :
3., , Choice of sampling technique or techniques is dependent on your research, question(s) and objectives and the feasibility of gaining access to the data., (T, , 4., , F), , Generalisations about populations from data collected using any probability sample, is based on intuition., (T, F), , 4.11 LESSON END EXERCISE, 1., , Describe the various methods of drawing a sample. Which one would you suggest, and why?, __________________________________________________________________, __________________________________________________________________, __________________________________________________________________, , 2., , Describe the importance of sampling. Critically examine the merits of probability, sampling methods., __________________________________________________________________, __________________________________________________________________, __________________________________________________________________, , 3., , Specify and explain the factors that make sampling preferable to a complete census, in statistical investigation., ____________________________________________________________________, __________________________________________________________________, __________________________________________________________________, , 4., , How would you determine the sample size for stratified sampling? Explain with, the help of a suitable example., __________________________________________________________________, __________________________________________________________________, __________________________________________________________________, 67
Page 68 :
5., , Discuss the purpose and the use of sampling frames in social work research., Describe how a sampling frame is different from the sample in which it was drawn., __________________________________________________________________, __________________________________________________________________, __________________________________________________________________, , 6., , What are the limitations of this sampling method, and in what specific ways could, the sampling method have affected the findings?, __________________________________________________________________, __________________________________________________________________, __________________________________________________________________, , 4.12 SUGGESTED READING, •, , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , •, , Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., , •, , Anderson, Sweeney and Williams: Statistics for Business and EconomicsThompson, New Delhi., , •, , Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall,, Delhi., , •, , Levin and Brevson: Business Statistics, Pearson Education, New Delhi., , •, , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , 68
Page 69 :
M.Com. I, , Course No. M.Com-115, , Unit- I, , Lesson No. 5, , SAMPLING AND NON-SAMPLING ERRORS AND BASICS OF SPSS, STRUCTURE, 5.1 Introduction, 5.2 Objectives, 5.3 Sampling Errors, 5.3.1 Types of Sampling Errors, 5.3.2 Sources of Sampling Errors, 5.3.3 Measurement and Control of Sampling Errors, 5.4 Non Sampling Errors, 5.4.1 Types of Non-sampling Errors, 5.4.2 Sources of Non-sampling Errors, 5.4.3 Measurement and Control of Non-Sampling Errors, 5.5, , Basics of data feeding and analysis Software-SPSS, 5.5.1 Features of SPSS, 5.5.2 Benefits of SPSS, 5.5.3 Variable View, 5.5.4 Data View, 5.5.5 Statistical Test Using SPSS, , 5.6, , Summary, , 5.7, , Glossary, 69
Page 70 :
5.8, , Self Assessment Questions, , 5.9, , Lesson End Exercise, , 5.10 Suggested Reading, 5.1, , INTRODUCTION, , The aim of sampling is usually to estimate one or more population values (parameters), from a sample. The next lesson deals in depth with this issue of estimation, but we mention, here that estimates such as sample means or proportions are random quantities. If we, were repeat the sampling process, the estimate would vary and this sample-to-sample, variability can be described by a distribution (e.g. the distribution of the sample mean or, sample proportion). The estimate is not guaranteed to be the same as the value that we are, estimating, so we call the difference the error in the estimate. There are different kinds of, error. Further, there is no question that business, education, and all fields of science have, come to rely heavily on the computer. This dependence has become so great that it is no, longer possible to understand social and health science research without substantial, knowledge of statistics and without the understanding of statistical software. The number, and types of statistical software packages available continue to grow each year. In this, lesson we have chosen to understand the basic concepts of SPSS known as Statistical, Package for the Social Sciences. SPSS is chosen because of its popularity within both, academic and business circles, making it the most widely used package of its type. SPSS, is also a versatile package that allows many different types of analyses, data transformations,, and forms of output - in short, it will more than adequately serve our purposes., 5.2 OBJECTIVES, After reading this lesson, you will be able to:, •, , understand the concept of errors in statistics., , •, , know about sampling and non-sampling errors., , •, , know about the basics of statistical software i.e. SPSS., , •, , how sampling and non-sampling errors will be minimised., , 70
Page 71 :
5.3 SAMPLING ERRORS, Sampling error is the error that arises in a data collection process as a result of taking a, sample from a population rather than using the whole population., Sampling error is one of two reasons for the difference between an estimate of a population, parameter and the true, but unknown, value of the population parameter. The other reason, is non-sampling error. Even if a sampling process has no non-sampling errors then estimates, from different random samples (of the same size) will vary from sample to sample, and, each estimate is likely to be different from the true value of the population parameter., The sampling error for a given sample is unknown but when the sampling is random, for, some estimates (for example, sample mean, sample proportion) theoretical methods may, be used to measure the extent of the variation caused by sampling error., 5.3.1 Types of Sampling Errors, When you survey a sample, your interest usually goes beyond just the people in the sample., Rather, you are trying to get information to project onto a larger population. For this, reason, it is important to understand common sampling errors so you can avoid them. Five, common types of sampling errors:, , , Population Specification Error: This error occurs when the researcher does not, understand who they should survey. For example, imagine a survey about breakfast, cereal consumption. Who to survey? It might be the entire family, the mother, or the, children. The mother might make the purchase decision, but the children influence her, choice., , , , Sample Frame Error: A frame error occurs when the wrong sub-population is used, to select a sample. A classic frame error occurred in the 1936 presidential election, between Roosevelt and Landon. The sample frame was from car registrations and, telephone directories. In 1936, many Americans did not own cars or telephones, and, those who did were largely Republicans. The results wrongly predicted a Republican, victory., , 71
Page 72 :
, , Selection Error: This occurs when respondents self-select their participation in the, study-only those that are interested respond. Selection error can be controlled by, going extra lengths to get participation. A typical survey process includes initiating presurvey contact requesting cooperation, actual surveying, and post-survey follow-up., If a response is not received, a second survey request follows, and perhaps interviews, using alternate modes such as telephone or person-to-person., , , , Non-Response: Non-response errors occur when respondents are different than, those who do not respond. This may occur because either the potential respondent, was not contacted or they refused to respond. The extent of this non-response error, can be checked through follow-up surveys using alternate modes., , , , Sampling Errors: These errors occur because of variation in the number or, representativeness of the sample that responds. Sampling errors can be controlled by, (1) careful sample designs, (2) large samples, and (3) multiple contacts to assure, representative response., , 5.3.2 Sources of Sampling Errors, A sampling error is a problem in the way that members of a population are selected for, research or data collection, which impacts the validity of results. Numerically, a sampling, error expresses the difference between results for the sample and estimated results for the, population., Subjects are selected through several different methods, broadly categorised as probabilitybased or non-probability-based. Probability-based methods are considered to yield the, most valid results because each member of a population has an equal chance of selection;, as long as a sufficiently large sample is selected, the group should be representative of the, population., No sampling method is infallible. In simple random sampling, considered to be the most, foolproof method, subjects for the sample are randomly selected from the entire population, to create a subset. Even in this case, however, sample size is an issue. In general, a larger, group of subjects will be more representative of the population. Imagine, for example,, a study in which thirty subjects are selected from a population of a thousand-random, selection could not ensure that the sample would represent the population. Other, sampling errors include:, 72
Page 73 :
Non-response: Subjects may fail to respond, and those who respond may differ from, those who don’t in significant ways., Self-selection: If subjects volunteer, that may indicate that they have a particular bias, related to the study, which can skew results., Sample frame error: A non-representative subgroup may be selected as a sample., Population specification error: The researcher fails to identify the population of interest, with enough precision., A sufficiently large sample size, randomised selection and attention to study design can all, help to improve the validity of data., 5.3.3 Measurement and Control of Sampling Errors, Of the two types of errors, sampling error is easier to identify. Although sampling error is, unavoidable when collecting a random sample, we can take measures to estimate and, reduce sampling error. The margin of error that you commonly see with survey results is in, fact an estimate of sampling error. Because it is just an estimate, there is a small chance, (typically five percent or less) that the margin of error is actually larger than stated in a, report., The techniques used for measuring and reducing sampling error are:, Increase the Sample Size, A larger sample size leads to a more precise result because the study gets closer to the, actual population size., Divide the Population into Groups, Instead of a random sample, test groups according to their size in the population. For, example, if people of a certain demographic make up 35% of the population, make sure, 35% of the study is made up of this variable., , 73
Page 74 :
Know your Population, The error of population specification is when a research team selects an inappropriate, population to obtain data. Know who buys your product, uses it, works with you, and so, forth. With basic socio-economic information, it is possible to reach a consistent sample of, the population. In cases like marketing research, studies often relate to one specific, population like Facebook users, Baby Boomers, or even homeowners., Reducing Sampling Error by Increasing Sample Size, One way to reduce sampling error is to increase the size of your sample by selecting more, subjects to observe. Sampling error and sample size have an inversely correlated, relationship, meaning that as sample size grows, sampling error decreases. However, it’s, important to note that increasing sample size usually results in an increase in cost., The more people that you want to survey in your study, the more expensive your study will, be, as there are costs associated with identifying respondents or participants. There will, also be an increased cost from a time usage perspective., We’ve found that after bringing sample size to 1,000 participants, researchers generally, start to get fewer bangs for their buck. This is due to the relationship between sample size, and margin of error. Once you have a sample size of 1,000, even if you more than double, your sample size to 2,500 you are only decreasing your margin of error by one percent., Reducing Sampling Error with Solid Sample Design, Sampling error can also be reduced by ensuring that you have a solid sample design. For, example, if your target population is made up of defined subpopulations, then you could, reduce margin of error by sampling each subpopulation independently. The tactics above, only reduce sampling error -they do not eliminate it. The only true way to eliminate sampling, error entirely is to examine each and every individual member of your target population., This is often impractical, and in many cases impossible. But that's not to say that generating, random samples is an ineffective means of investigating a population. It's still a convenient, and effective way to examine a large-scale, complex population., , 74
Page 75 :
Example, A population specification error means that XYZ does not understand the specific types, of consumers who should be included in the sample. If, for example, XYZ creates a, population of people between the ages of 15 and 25 years old, many of those consumers, do not make the purchasing decision about a video streaming service because they do not, work full-time. On the other hand, if XYZ put together a sample of working adults who, make purchase decisions, the consumers in this group may not watch 10 hours of video, programming each week., Selection error also causes distortions in the results of a sample, and a common example, is a survey that only relies on a small portion of people who immediately respond. If XYZ, makes an effort to follow up with consumers who don’t initially respond, the results of the, survey may change. Furthermore, if XYZ excludes consumers who don’t respond right, away, the sample results may not reflect the preferences of the entire population., 5.4 NON SAMPLING ERRORS, It is a general assumption in sampling theory that the true value of each unit in the population, can be obtained and tabulated without any errors. In practice, this assumption may be, violated due to several reasons and practical constraints. This results in errors in observations, as well as in tabulation. Such errors which are due to factors other than sampling are called, non-sampling errors. The non-sampling errors are unavoidable in census and surveys. The, data collected by complete enumeration in census is free from sampling error but would, not remain free from non-sampling errors. The data collected through sample surveys can, have both-sampling errors as well as non-sampling errors. Non-sampling errors arise, because of the factors other than the inductive process of inferring about the population, from a sample. In general, the sampling errors decrease as the sample size increases, whereas non-sampling error increases as the sample size increases. In some situations,, the non-sampling errors may be large and deserve greater attention than the sampling, error., In any survey, it is assumed that the value of the characteristic to be measured has been, defined precisely for every population unit. Such a value exists and is unique. This is called, the true value of the characteristic for the population value. In practical applications, data, collected on the selected units are called survey values and differ from the true values., Such difference between the true and observed values is termed as observational error or, 75
Page 76 :
response error. Such an error arises mainly from the lack of precision in measurement, techniques and variability in the performance of the investigators. Therefore, non-sampling, error is the error that arises in a data collection process as a result of factors other than, taking a sample., Non-sampling errors have the potential to cause bias in polls, surveys or samples., There are many different types of non-sampling errors and the names used to describe, them are not consistent. Examples of non-sampling errors are generally more useful than, using names to describe them., 5.4.1 Types of Non-sampling Errors, Non-sampling errors may be broadly classified into three categories., (a), , Specification Errors: These errors occur at planning stage due to various reasons,, e.g., inadequate and inconsistent specification of data with respect to the objectives, of surveys/census, omission or duplication of units due to imprecise definitions,, faulty method of enumeration/interview/ambiguous schedules etc., , (b), , Ascertainment Errors: These errors occur at field stage due to various reasons, e.g., lack of trained and experienced investigations, recall errors and other type of, errors in data collection, lack of adequate inspection and lack of supervision of, primary staff etc., , Ascertainment errors may be further sub-divided into, i., , Coverage errors owing to over-enumeration or under-enumeration of the, population or sample, resulting from duplication or omission of units and from, non-response., , ii., , Content errors relating to wrong entries due to errors on the part of investigators, and respondents., , (c) Tabulation Errors: These errors occur at tabulation stage due to various reasons,, e.g., inadequate scrutiny of data, errors in processing the data, errors in publishing the, tabulated results, graphs etc., Same division can be made in the case of tabulation error also. There is a possibility of, missing data or repetition of data at tabulation stage which gives rise to coverage errors, and also of errors in coding, calculations etc. which gives rise to content errors., 76
Page 77 :
5.4.2 Sources of Non-Sampling Errors, Non sampling errors can occur at every stage of planning and execution of survey or, census. It occurs at planning stage, field work stage as well as at tabulation and computation, stage. The main sources of non-sampling errors are lack of proper specification of the, domain of study and scope of investigation, incomplete coverage of the population or, sample, faulty definition, defective methods of data collection and tabulation errors., More specifically, one or more of the following reasons may give rise to non-sampling, errors or indicate its presence:, a), , The data specification may be inadequate and inconsistent with the objectives of, the survey or census., , b), , Due to imprecise definition of the boundaries of area units, incomplete or wrong, identification of units, faulty methods of enumeration etc, data may be duplicated, or may be omitted., , c), , The methods of interview and observation collection may be inaccurate or, inappropriate., , d), , The questionnaire, definitions and instructions may be ambiguous., , e), , The investigators may be inexperienced or not trained properly., , f), , The recall errors may pose difficulty in reporting the true data., , g), , The scrutiny of data is not adequate., , h), , The coding, tabulation etc. of the data may be erroneous., , i), , There can be errors in presenting and printing the tabulated results, graphs etc., , j), , In a sample survey, the non-sampling errors arise due to defective frames and, faulty selection of sampling units., , k), , These sources are not exhaustive but surely indicate the possible source of errors, , The non-response error may occur due to refusal by respondents to give information or, the sampling units may be inaccessible. This error arises because the set of units getting, 77
Page 78 :
excluded may have characteristic so different from the set of units actually surveyed as to, make the results biased. This error is termed as non-response error since it arises from the, exclusion of some of the anticipated units in the sample or population. One way of dealing, with the problem of non-response is to make all efforts to collect information from a subsample of the units not responding in the first attempt., 5.4.3 Measurement and Control of Non- Sampling Errors, Some suitable methods and adequate procedures for control can be adopted before initiating, the main census or sample survey. Some separate programmes for estimating the different, types of non-sampling errors are also required. Some such procedures are as follows:, 1., , Consistency check: Certain items in the questionnaires can be added which may, serve as a check on the quality of collected data. To locate the doubtful observations,, the data can be arranged in increasing order of some basic variable. Then they can, be plotted against each sample unit. Such graph is expected to follow a certain, pattern and any deviation from this pattern would help in spotting the discrepant, values., , 2., , Sample check: An independent duplicate census or sample survey can be, conducted on a comparatively smaller group by trained and experienced staff. If, the sample is properly designed and if the checking operation is efficiently carried, out, it is possible to detect the presence of non-sampling errors and to get an idea, of their magnitude. Such procedure is termed as method of sample check., , 3., , Post-census and post-survey checks: It is a type of sample check in which a, sample (or subsample) is selected of the units covered in the census (or survey), and re-enumerate or re-survey it by using better trained and more experienced, survey staff than those involved in the main investigation. This procedure is called, as post-survey check or post-census. The effectiveness of such check surveys, can be increased by re-enumerating or re-surveying immediately after the main, census to avoid recall error taking steps to minimise the conditioning effect that the, main survey may have on the work of the check-survey., , 4., , External record check: Take a sample of relevant units from a different source,, if available, and to check whether all the units have been enumerated in the main, investigation and whether there are discrepancies between the values when matched., 78
Page 79 :
The list from which the check-sample is drawn for this purpose, need not be a, complete one., 5., , Quality control techniques: The use of tools of statistical quality control like, control chart and acceptance sampling techniques can be used in assessing the, quality of data and in improving the reliability of final results in large scale surveys, and census., , 6., , Study or recall error: Response errors arise due to various factors like the attitude, of respondents towards the survey, method of interview, skill of the investigators, and recall errors. Recall error depends on the length of the reporting period and, on the interval between the reporting period and data of survey. One way of studying, recall error is to collect and analyze data related to more than one reporting period, in a sample (or sub-sample) of units covered in the census or survey., , 7., , Interpenetrating sub-samples: The use of interpenetrating sub-sample technique, helps in providing an appraisal of the quality of information as the interpenetrating, sub-samples can be used to secure information on non-sampling errors such as, differences arising from differential interviewer bias, different methods of eliciting, information etc. After the sub-samples have been surveyed by different groups of, investigators and processed by different team of workers at the tabulation stage, a, comparison of the final estimates based on the sub-samples provides a broad, check on the quality of the survey results., , 5.5 BASICS OF DATA FEEDING AND ANALYSIS SOFTWARE -SPSS, SPSS is a software which is widely used as a Statistical Analytic Tool in the Field of Social, Science, Such as market research, surveys, competitor analysis, and others. It is a, comprehensive and flexible statistical analysis and data management tool. It is one of the, most popular statistical package which can perform highly complex data manipulation and, analysis with ease. It is designed for both interactive and non interactive users., The SPSS Consists of 2 Sheets -One Is the Data View and the Variable View, 5.5.1 Features of SPSS, 1., , It is easy for you to learn and use, , 2., , SPSS includes a lot of data management system and editing tools, 79
Page 80 :
3., , It offers you in-depth statistical capabilities, , 4., , It offers an excellent plotting, reporting and presentation features, , 5.5.2 Benefits of SPSS, Here are few key points why SPSS is considered the best tool to use:, 1., , Effective data management: SPSS in data analysis easier and quicker for you, as the program knows the location of the cases and the variables. It reduces the, manual work of the user to a great extent, , 2., , Wide range of options: SPSS offers a wide range of methods, graphs and, charts to you. It also comes with better screening and cleaning option of the, information as a preparation for further analysis., , 3., , Wide range of options: In SPSS the output is kept separate from the data itself., It stores the data in a separate file., , 5.5.3 Variable View, It is the sheet where you define the variable of the Data that you have, the Variable View, Consists of the Following Column heading., , , Name: Enter the Unique Identifiable and Sorting Variable Name, E.g.: In the Data of, Students, The Variables can be ID, Sex, Age, Class, Etc, Note : This will not allow, any special character or Space while describing the variables of the Data, and Once, you enter the First Variable, Immediately you can see SPSS generating all the other, information regarding how you want to set that Variable that you have entered to be,, , , , Type: You Can change the Type of Variable, Whether Numeric, Alphabets or Alpha, Numeric by selecting the respective Type in this column, this will restrict the use of any, other type being used under this variable column, , , , Width : Defines the Character Width this Variable Should allow, Especially helpful, while entering Mobile number which allow only 10 character, , , , Decimal: Defines the Decimal point you required to display, Eg: Used in case of, percentages, 80
Page 81 :
, , Label: Since the Name Column Doesn’t allow you to use any Special Character or, Space, here you can give any name as an Label for that Variable you wanted to assign, , , , Value : This is to define/ Label a Value Wherever you see in the Data, Eg: You can, Label “0” in the data as ABSENT for Exam, So when you find 0 in the data, it will be, labelled as ABSENT for Exam, You Can also Label the Employee ID Number with, their Name, So that Using the value Label Switch Button you can view the Name of, the employee, but in report the Name will not appear, only the EMP ID number will, appear, This helps in reading the data better in data view, , , , Missing : You can mention the Data which you don’t want the SPSS to consider, while analyzing, Like “0” Value is considered as Absent, so for analysis it will neglect, “0” if its mentioned in Missing, which will be helpful in Mean, Mode Etc,, , , , Align : You can mention the alignment of the data in the data sheet, Left, Right of, Middle,, , , , Measure: This is where you will define the measure of the Variable that you have, entered, whether Scale, Ordinal or Nominal type of Variable., , After defining all the variables of the data, once you click on the DATA View Sheet, You, can See the Labels entered by you as the Columns Label,, 5.5.4 Data View, The Data View is a Spreadsheet which contains rows and columns, the Data Can be, entered in the Data View Sheet Either Manually or the data can be imported from the data, file, SPSS can read the data file in any of the one format, which can be Excel, Plain text, files or relational (SQL) Databases, Before importing the excel sheet change the excel file, format to (.xls)., SPSS can read the data file better in numeric than the String (Text), So it’s always better, to convert most of the data as Numeric Variables Data, Eg: In case of Survey they Use, more of Yes/ no, Good/Average/Bad, Male/Female, – Here in all these cases you can use, codes in data file like 1 – Yes, 2 – No; Also make sure that the Excel data is arranged in, such a way that Rows always contain Responses from Different people and Columns, contains responses to different questions;, , 81
Page 82 :
Importing Excel Data File into SPSS:, , , Click File …. Select Open ….. Select Data, , , , In the Dialog Box, in Files of Type Drop Box Select .xls File, , , , Then select the excel file data that you have stored in your system and Select, Open, , , , In the Dialog box, Ensure that you tick the “ Read Variable names from the first, row of data”, , , , Click Ok, , , , SPSS data Editor will open and you can find your data file in it;, , Analyse : After the data in imported or entered in the data view sheet, you can run the, reports and analyze the data using the Analyze option on Top Toolbar, You can find All the, Analytical tools in this option, Let Us take that we have to find the Mean, Median, Mode, for the data that is entered,, , , Go to Analyse option, , , , Select Descriptive Statistics, , , , Select Frequencies, , , , In the Dialog Box, on the left Side You will Have List of Variables That you have, mentioned in the data, Select the Variables for which you have to find the Mean,, Median, Mode and drop them in the right side box, using the drop button in the, center,, , , , Click on Statistics Button,, , , , You will find the Check box of Mean, Median and Mode in the central tendency, Dialog Box, Check them,, , , , Click Continue,, , , , In the Frequencies Dialog Box, Click the Chart Button, and select the type of, Chart you want the data to be shown,, , , , Then click ok,, 82
Page 83 :
, , The Output will be opened in a separate window, with List of Mean, Median,, Mode and with pictorial representation of the chart that you have selected earlier;, , 5.5.5 Statistical Test Using SPSS, Quick Data Check: Before using any statistical test, It’s always advisable to do a data, check to know how the data has been distributed and clearly defined, whether the missing, values are neglected etc, Data Check is usually done using Charts, so that any abnormalities, can be easily detected and data can be corrected,, Histogram -is widely used to check the data in case of One Variable Tests – Creating an, histogram is already been explained;, Scatter Plot Chart -is used for Two Variable Tests:, , , Click Graphs……… Select Legacy Dialogs…….. Select Scatter/Dot, , , , Select the Simple Scatter Chart, , , , Click define, , , , You will Find the X axis and Y axis for Comparison, , , , Drop the Variable for X axis and Y axis respectively, , , , Click Ok, , Majorly Tests can be diversified based on the purpose, it can be 2 Types – Comparison, Tests and Association Test, Comparison Test can be further divided into 3 types based on, the number of variables you want to compare, one variable tests, two variable tests and, multi variable test;, Comparison Tests – One Variable using SPSS, 1. A. Chi Square Test, , , Click Analyse…. Select Non Parametric Tests… Select Legacy Dialogs, , , , Select Chi- Square Test, , 83
Page 84 :
, , In the Chi- Square Dialog Box, Drop the Variable that you want to run the test in, the “Test Variable List” using the Drop Button in the center, , , , In Expected Range Tick “ Get from Data”, , , , In Expected Values Tick “ All Categories Equal”, , , , Click OK, , , , Chi Square result will appear in the Output Window, , 1. B. One Sample T- Test, , , Click Analyse…. Select Compare Means…, , , , Select One Sample T – Test, , , , In the One Sample T Test Dialog Box, Drop the Variable that you want to run the, test in the “Test Variable List” using the Drop Button in the center, , , , In the Test Value enter the Population value, , , , Click Paste – Syntax will appear with all the conditions, , , , Click OK and run the Test, , , , One Sample T – Test result will appear in the Output Window, , Comparison Tests – Two Variable using SPSS, 1. C. Paired Sample T test:, , , Click Analyse…. Select Compare Means…, , , , Select One Paired Sample T Test, , , , In the One Sample T Test Dialog Box, Drop the Two Variables that you want to, run the test in the “Paired Variable List” using the Drop Button in the center one by, one, , , , Click Paste – Syntax will appear with all the conditions, , , , Click OK and run the Test, 84
Page 85 :
, , Paired Sample T – Test result will appear in the Output Window, , Comparison Tests – More Variable using SPSS, 1. Repeated Measures ANOVA, , , Click Analyse…. Select General Linear Model…, , , , Select Repeated Measures, , , , In the Repeated Measures define Factor Dialog Box, we are going to define the, multiple variables that we going to run,, , , , Give a Name to the Set of Variables that you going to compare (Factor Name) –, Like Courses, in the “Within-Subject Factor Name”, , , , Enter the number of Variable under this Factor in the “ Number of Levels”, , , , Click ADD, , , , Give the Measure Name, Like Rank, Rating,, , , , Click ADD, , , , Click “Define” to Define the Variables under the Mentioned Factor Name – Courses, , , , Drop all the Variables you wanted to include below the Factor Name ( Courses), to Within – Subject Variables, , , , Click Option and select descriptive Statistics, , , , Click Paste – Syntax will appear with all the conditions, , , , Click OK and run the Test, , , , Repeated Measures ANOVA – Test result will appear in the Output Window, , Associate Tests – Two Variables, 2., , A. Correlation Tests, , , Click Analyse…. Select Correlate…, , 85
Page 86 :
, , Select Bivariate…, , , , In the Bivariate Correlation Dialog Box, Drop the Two Variables that you want to, run the test in the “Variable List” using the Drop Button in the center one by one, , , , Ensure You tick “Pearson” Correlation Coefficients, , , , Test of Significance tick “Two Tailed”, , , , Also Tick Flag Significant Correlations, , , , Click Paste – Syntax will appear with all the conditions, , , , Click OK and run the Test, , , , Bivariate Correlation – Test result will appear in the Output Window, , 2., , B. One Way ANOVA Test, , , Click Analyze…. Select Compare Means…, , , , Select One Way ANOVA…, , , , In the One Way ANOVA Dialog Box, Drop the Variables in the Dependent List, and the List of variables in the Factor List, Example: Weight of Children in the, dependent List and the Health Drinks in the Factor List, , , , Select Option and Select Descriptive, , , , Click Paste – Syntax will appear with all the conditions, , , , Click OK and run the Test, , , , One Way ANOVA – Test result will appear in the Output Window, , SPSS Output Window, This Window Contains all the Output which is run after the Statistical Tests, The output of, the Statistical Test will be displayed in the Output window in the Table and Chart/graph, format only;, , 86
Page 87 :
The Output Window consists of 2 Segments – Left hand side you can see the output, outline and the right hand side is the Actual Output, Output Outline will show the Titles and, the subtitles of the Output organized in a hierarchy tree structure, by clicking the title or the, subtitles you can view the actual output; You can hide or delete the branch of the tree, structure,, The Table can be easily copied and pasted in the Word or Excel Sheet, Same is the Graph, or Chart that is displayed in the Actual Output can also be copied and pasted, The Data Editor Window will be saved in (.sav) format as an SPSS data File, SPSS, Syntax file will be saved in (.spv) format as an SPSS Syntax file and Output Viewer, Window will be saved in (.spv) or (.spo) format as an SPSS Output File,, How to Open Saved SPSS file and run the output Again, , , Click File …. Select Open ….. Select Data, , , , In the Dialog Box, in Files of Type Drop Box Select (.sav) File, , , , Then select the SPSS data file that you have stored in your system and Select, Open, , , , Click Ok, , , , SPSS data Editor will open and you can find your data file in it;, , , , Make the necessary changes if you want to edit the data or in the Variables and, run the test required, , The bottom line is though Excel offers a good way of data organisation, SPSS is more, suitable for in depth data analysis., 5.6, , SUMMARY, , The objective of this lesson is to know about sampling and non-sampling errors as well as, to discuss about the basic concept of statistical packages for social sciences (SPSS)., Sampling error includes systematic error and random error. Systematic error occurs when, the sample is not properly drawn (an error of the researcher). Random error is the degree, 87
Page 88 :
to which the sample is not perfectly representative of the population. Even with the best, sampling techniques, some degree of random error is expected. We have studied different, methods to sample from population, viz., simple random sample, stratified sample, cluster, sample etc. Each of these involves randomness in the sample-selection process, so the, estimated mean or proportion is unlikely to be exactly the same as the underlying population, parameter that is being estimated. When sampling books from a library or sacks of rice, from the output of a factory, sampling error is the main or only type of error. Further, when, sampling from some types of population- especially human populations - problems often, arise when conducting one of the above sampling schemes. For example, some sampled, people are likely to refuse to participate in your study. Such difficulties also result in errors, and these are called non-sampling errors. Non-sampling errors can be much higher than, sampling errors and are much more serious. Unlike sampling errors, the size of nonsampling errors cannot be estimated from a single sample-it is extremely difficult to assess, their likely size., Non-sampling errors often distort estimates by pulling them in one direction. It is therefore, important to design a survey to minimise the risk of non-sampling errors. In the end this, discussion, it is true to say that sampling error is one which is completely related to the, sampling design and can be avoided, by expanding the sample size. Conversely, nonsampling error is a basket that covers all the errors other than the sampling error and so,, it is unavoidable by nature as it is not possible to completely remove it., 5.7, , GLOSSARY, , , , Confidence Interval: The confidence interval is a range of values, above and, below a finding, in which the actual value is likely to fall. The confidence, interval represents the accuracy or precision of an estimate, , , , Random Sampling Error: Sampling error is the error caused by observing, a sample instead of the whole population. The sampling error is the difference, between a sample statistic used to estimate a population parameter and the actual, but unknown value of the parameter., , , , Non-Sampling Error: Non-sampling error is a catch-all term for the deviations, of estimates from their true values that are not a function of the sample chosen,, including various systematic errors and random errors that are not due to sampling., 88
Page 89 :
, , SPSS: SPSS stands for Statistical Package for the Social Sciences. It’s used by, various kinds of researchers for complex statistical data analysis., , 5.8, , SELF ASSESSMENT QUESTIONS, (i) Please Tick the correct answer:-, , 1., , Which of the following is not an example of non-sampling risk?, a) Failing to evaluate results properly, b) Use of an audit procedure inappropriate to achieve a given audit objective, c) Obtaining an unrepresentative sample, d) Failure to recognise an error, , 2., , Why do sampling errors occur?, a) Differences between sample and population, b) Differences among samples themselves, c) Choice of elements of sampling, d) all of the above, , 3., , Any calculation on the sampling data is called:, (a) Parameter, (b) Static, (c) Mean, (d) Error., , 89
Page 90 :
4. Probability distribution of a statistics is called:, (a) Sampling, (b) Parameter, (c) Data, (d) Sampling distribution, 5. In probability sampling, probability of selecting an item from the population is known, and is:, (a) Equal to zero, (b) Non zero, (c) Equal to one, (d) All of the above, 6. Standard deviation of sample mean without replacement__________ standard deviation, of sample mean with replacement:, (a) Less than, (b) More than, (c) 2 times, (d) Equal to, 5.9 LESSON END EXERCISE, 1., , Discuss how sampling errors vary with the size of the sample., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, 90
Page 92 :
___________________________________________________________, ___________________________________________________________, , 5.10 SUGGESTED READING, Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall, Delhi., Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., Mc Clave, Benson and Sincich: Statistics for Business and Economics, Eleventh, Edition, Prentice Hall Publication., Gupta, S.C.: Fundamentals of Statistics, 7th edition, Himalaya Publishing House, New, Delhi, India., , 92
Page 93 :
M.Com. I, , Course No. M.Com-115, , Unit- II, , Lesson No. 6, ASSOCIATION OF ATTRIBUTES, , STRUCTURE, 6.1 Introduction, 6.2, , Objectives, , 6.3, , Concept of Association of attributes, , 6.3.1 Difference between Correlation and Association, 6.4 Notation and Terminology, 6.4.1 Classes and Class Frequencies, 6.4 2, , Order of Classes and Class Frequencies, , 6.4.3 Ultimate Class frequencies, 6.4.4 Contingency Table, 6.4.5 Relationship between the Class Frequencies, 6.5 Consistency of Data, 6.6 Summary, 6.7 Glossary, 6.8 Self Assessment Questions, 6.9 Lesson End Exercise, 6.10 Suggested Reading, , 93
Page 94 :
6.1 INTRODUCTION, Generally statistics deal with quantitative data only. But in behavioural sciences, one often, deals with variable which is not quantitatively measurable. Literally an attribute means a, quality or characteristic which is not related to quantitative measurements. Examples of, attributes are health, honesty, blindness etc. They cannot be measured directly. The observer, may find the presence or absence of these attributes. Statistics of attributes based on, descriptive character. An attribute refers to the quality of a characteristic. The theory of, attributes deals with qualitative types of characteristics that are calculated by using, quantitative measurements. Therefore, the attribute needs slightly different kinds of statistical, treatments, which the variables do not get. Attributes refer to the characteristics of the item, under study, like the habit of smoking, or drinking. So ‘smoking’ and ‘drinking’ both refer, to the example of an attribute., In the theory of attributes, the researcher put more emphasis on quality (rather than on, quantity). Since the statistical techniques deal with quantitative measurements, qualitative, data is converted into quantitative data in the theory of attributes., There are certain representations that are made in the theory of attributes. The population, in the theory of attributes is divided into two classes, namely the negative class and the, positive class. The positive class signifies that the attribute is present in that particular item, under study, and this class in the theory of attributes is represented as A, B, C, etc. The, negative class signifies that the attribute is not present in that particular item under study,, and this class in the theory of attributes is represented as etc., The assembling of the two attributes, i.e. by combining the letters under consideration, (such as AB), denotes the assembling of the two attributes. This assembling of the two, attributes is termed as dichotomous classification. The number of observations that have, been allocated in the attributes is known as the class frequencies. These class frequencies, are symbolically denoted by bracketing the attribute terminologies i.e. (B), which stands, for class frequency of the attribute B. The frequencies of the class also have some levels in, the attribute. For example, the class that is represented by the ‘n’ attribute refers to the, class that has the nth order. For example, (B) refers to the class of 2nd order in the theory, of attributes., There is also independence nature in the theory of attributes. The two attributes are said to, be independent only if the two attributes are absolutely uncorrelated to each other., 94
Page 95 :
In the theory of attributes, A and B are said to be associated with each other only if the two, attributes are not independent, but are related to each other in some way or another., The positive association in the two attributes exists under the following condition:, (AB) > (A) (B) / N., The negative association in the two attributes exists under the following condition:, (AB) < (A) (B) / N., The situation of complete association in the two attributes arises when the occurrence of, attribute A is completely dependent upon the occurrence of attribute B. However, attribute, B may occur without attribute A, and the same thing holds true if attribute A is the independent, one., Ordinarily, the two attributes are said to be associated if the two occur together in a number of, cases., Therefore, qualitative characteristics such as deafness, blindness, employment, beauty, hair, colour, sex etc. of an individual of a universe or population are termed as attributes. The, attributes are not orderable into series from least to most or vice versa., The classification which divides a group into two classes according to one attribute called, classification by Dichotomy or simple classification. The classification which divides the, group into more than two classes according to one attributes is called manifold classification., For example, according to the attribute “Hair Colour” the population of a city may be, divided into different classes:, i.Fair-haired people, ii.Red-haired people, iii.Brown haired people, iv.Black haired people, If several (more than two) attributes are noted, the process of classification may however,, be continued indefinitely. Such type of classification may be called classification as a series, 95
Page 96 :
of dichotomies. For example, consider two attributes, namely “Blindness and Deafness”., The people of a city may be first divided into two classes according to the attribute ‘Blindness’, and then each of these two classes may further be classified according to the attribute, ‘Deafness’. And therefore, ultimately we have four classes:, i.The class of blind and deaf people., ii.The class of blind and non-deaf people., iii.The class of non-blind and non-deaf people., iv.The class of non-blind and deaf people., , 6.2 OBJECTIVES, The objective of studying this lesson is to:, •, •, •, •, •, •, , know the concept of association of attributes., how correlation is different from association., learn about different terminology used in the study of association., solve exercises based on these concepts., provide the concept of consistency of data., develop relationship between class frequencies., , 6.3 CONCEPT OF ASSOCIATION OF ATTRIBUTES, As pointed out earlier, statistics deals with quantitative phenomenon only. However, the, qualitative character may arise in any of the following two ways:, 1., , In the first place, we may measure the actual magnitude or size of some phenomenon., For example, we may measure the height of students of a class, their weight, etc., Similarly, we may study the wage structure of the workers of a particular factory,, the amount of rainfall in a year. The characteristics of this type of phenomenon are, known as statistics of variables. The various statistical techniques like measures of, central tendency, dispersion, correlation etc. deals with such variables., , 2., , In the second place, there are certain phenomenon like blindness, deafness, etc., which are not capable of direct quantitative measurement. In such cases the, quantitative character arises only indirectly in the process of counting. For example,, we can determine out of 1000 persons, how many are blind and how many are not, 96
Page 97 :
blind but we cannot precisely measure blindness. Such phenomenon, where direct, quantitative measurement is not possible, i.e., where we can study only the presence, or absence of a particular characteristics, are called as statistics of attributes., 6.3.1 Difference between Correlation and Association, The tool of correlation is used to measure the degree of relationship between two such, phenomena as are capable of direct quantitative measurement. On the other hand, the, method of association of attributes is employed to measure the degree of relationship, between two phenomena whose size we cannot measure and where we cannot only, determine the presence or absence of a particular attribute., While dealing with statistics we have to classify the data. The classification is done on the, basis of presence or absence of particular attribute or characteristic. When we are studying, only one attribute, two classes are formed-one possessing that attribute and another not, possessing it. For example, when we are studying the attribute employment, two classes, shall be formed, i.e., those who are employed and those who are not employed. When, two attributes are studied, four classes shall be formed. If, besides employment, we study, the gender-wise distribution, four classes shall be formed: number of males employed,, number of females employed, number of males unemployed and number of females, unemployed., It should be noted that in some cases while classifying the attributes no clear-cut definition, of an attribute and line of demarcation between classes can be drawn., For example, when the attribute ‘employment’ is being studied the data are classified into, ‘Employed’ and ‘Unemployed’. But there can be further category of those people who, are partially employed (i.e. part-time). Also there may be some persons who are employed, before the survey but on the date of survey they are unemployed. So we cannot treat them, as employed and also as unemployed because there is some difference between those, persons who have not got any job, and those who have got some job but were retrenched, after some time. Hence, it is absolutely essential to lay down clear-cut definition of the, various attributes under Study. This is often a difficult task. Hence, this limitation must be, kept in mind while studying association between attributes., , 97
Page 98 :
6.4 NOTATION AND TERMINOLOGY, To facilitate the theory of attributes it is necessary to have some standard notations for the, attributes, the classes formed and for the observations assigned to each of them. Therefore,, for the sake of simplicity and convenience it is imperative to use certain symbols to represent, different classes and their frequencies. It is customary to use capital letters A and B to, represent the presence of attributes, may be called as positive attributes. Whereas, the, Greek letters and and are generally used to represent absence of attributes A and B, also called as negative attributes. Thus ‘’ = not A and ‘’ = not B. Therefore, the classes, A and , B and are complementary to each other. Combination of the attributes will be, represented by just a position of letters. For example, if A denotes blindness and B denotes, deafness, then AB denotes blindness and deafness. Similarly, of A denotes ‘married’, B, denotes ‘man’ and c denotes ‘left handed’, then:, A, , denotes married women, , AB denotes married men, B, , denotes unmarried men, , C, , denotes unmarried left handed, , ABC denotes married left handed men, A denotes married right handed women, Any combination of letters say, A, AB, , B etc. by means of which we specify the, characteristics of the member of a class, may be called as class symbols. A collection of all, individuals is denoted by N. Further, if A represents males then would represent females., Similarly if B represents literates then would represent illiterates., Lets looks at the combination, (AB): number of literate males, (A ): number of illiterate males, ( B): number of literate females, 98
Page 99 :
(): number of illiterate females, 6.4.1 Classes and Class frequencies, Different attributes in themselves, their sub-groups and combinations are called different, classes and the numbers of observations assigned to them are called their class frequencies., If two attributes are studied the number of classes will be 9, i.e., (A), (), (B), (), (A),, (), (B) and N., The number of observations or units belonging to each class is known as their frequencies, are denoted within bracket. Thus (A) stands for the frequency of A and (AB) stands for, the number objects possessing the attribute both A and B. Further, the class frequencies of, the type (A), (B), (AC), (BC), (ABC), etc. which involve only positive attributes are, called as positive frequencies. The class frequencies of the type (), (), (), (),, () etc. which involve only negative attributes are called negative frequencies. The, class frequencies of the type (A), (B), (AB), (BC) etc. which involve the mixture of, positive and negative attributes are called as contrary frequencies., 6.4.2 Order of Classes and Class Frequencies, A class represented by n attributes is called a class of nth order and the corresponding, frequency as the frequency of the nth order. Thus (A), (B), ()...etc. are class frequencies, of order 1; (AB), (B), (), (), etc. are class frequencies of second order; (ABC),, (AB), () etc., are frequencies of third order and so on. N, the total number of members, of the population, without any specification of attributes, is known as a frequency of zeroorder. Thus, the order of classes and class frequencies depend upon the number of attributes, assign to a particular class., In general, the following rules are used to determine the class frequencies:, Rule 1: With n attributes there are in all 2n positive classes., Rule 2: With n attributes the number of classes is 3n., That is,, i.For one attribute, there are three (31 = 3) frequencies., 99
Page 100 :
ii.For two attributes, there are 9 (32 = 9) frequencies., iii.For three attributes, there are 27 (33 = 27) frequencies., 6.4.3 Ultimate Class Frequencies, The classes specified by n attributes i.e. those of highest order, are called as ultimate class, frequencies. Every class frequency can be expressed as the sum of the ultimate class, frequencies. If these are given, the data can be completely determined. If there are n, attributes, then there will be 2n ultimate class frequencies. If two attributes are studied then, the number of classes of ultimate class order shall be 22 = 4. In case of three attributes,, there would be 23 = 8 classes of the ultimate order., Therefore, for two attributes A and B, there are 2×2 = 4 ultimate class frequencies denoted, with; (AB), (A), (), (B). And in case of three attributes say A, B and C, the number, of ultimate class frequencies would be 23 = 8, i.e., (ABC), (AB), (), (BC), (C),, (C), (B), (A)., Also, total number of ultimate class frequencies = total number of positive class frequencies., 6.4.4 Contingency Table, A table which represents the classification according to the distinct classes of two, characteristics A and B is called a two way contingency table. Also, it is a type of table in, a matrix format that displays the (multivariate) frequency distribution of the variables. This, table provide a basic picture of the interrelation between two variables and can help find, interactions between them. Suppose the attribute A has m distinct classes denoted by A1,, A2,......Am and the attribute B has n distinct classes B1, B2,.....Bn then there are in all mn, distinct classes (called cells) in the contingency table. In the contingency table, the total of, various rows A1, A2.....Am and total of various columns B1, B2,....Bn give the first order, frequencies and cells have the frequencies of second order. The grand total of all frequencies, gives the total number of observations, i.e. N. The contingency table can be written as:, , 100
Page 101 :
The classification by dichotomies with two attributes A and b is generally known as 2×2, (read as two by two) contingency table. The contingency table of order (2×2) for two, attributes A and B can be displayed as given below:, , (A), , With the help 2×2 contingency table, we can find the ultimate class frequencies from the, positive class frequencies with two attributes., (A) + (α) = N ; (AB) + (Aβ) = (A); (AB) + (αB) = (B); (B) + (β) = N ; (αB) + (αβ) =, (B); (Aβ) + (αβ ) = (β), 6.4.5 Relationship between the Class Frequencies, All the class frequencies of various orders are not independent of each other and any, class frequency can always be expressed in terms of class frequencies of higher order., Thus:, , 101
Page 103 :
“Any class frequency can be expressed as the sum of the 2n ultimate class frequencies.”, Therefore, in dichotomous classification of attributes, the data can be specified completely, by:, i.The set of all the ultimate class frequencies., ii.The set of all the positive frequencies., Buy this we mean that if we know all the ultimate class frequencies or all the positive class, frequencies, then we can obtain the frequencies of all other classes of different orders., 6.5, , CONSISTENCY OF DATA, , In order to find out whether the given data are consistent or not we have to apply a very, simple test. The test is to find out whether any one or more of the ultimate class-frequencies, is negative or not. If none of the class frequencies is negative we can safely calculate that, the given data are consistent (i.e. the frequencies do not conflict in any way each other). On, the other hand, if any of the ultimate class frequencies comes out to be negative the given, data are inconsistent., Example 1: Given N = 2500, (A) = 420, (AB) = 85 and (B) = 670. Find the missing, values., Solution:, We know N = (A) + (α) = (B) + (β), (A)= (AB) + (Aβ), (α) = (αB) + (αβ), (B)= (AB) + (αB), (β) = (Aβ) + (αβ), (β) = (Aβ) + (αβ), From (2) 420 = 85 + (Aβ), (Aβ) = 420 –85, (Aβ) = 335, From (4) 670 = 85 + (αB), (αB) = 670–85, 103
Page 104 :
(αB) = 585, From (1) 2500 = 420 + (α), (α) = 2500–420, (α) = 2080, From (1) (β) = 2500–670, (β) = 1830, From (3) = 2080 = 585 + (αβ), () = 1495, Example 2: Test the consistency of the following data with the symbols having their usual, meaning. N = 1000 (A) = 600 (B) = 500 (AB) = 50, Solution: We are given with:, N = 1000; (A) = 600; (B) = 500, (AB) = 50, We have to find the missing class frequencies with the help contingency table as under:, Attributes, , B, , , , Total, , A, , (AB) = 50, , (A) = 51–50=1, , (A) = 51, , , , (B) = 52–50=2, , () = 9–2 = 7, , () = 60–51=9, , Total, , (B) = 52, , () = 60–32 = 28, , N= 60, , Since (αβ) = –50, the given data is inconsistent, Example 3: Examine the consistency of the given data. N = 60 ; (A) = 51; (B) = 32; (AB), = 25., Solution: We are given with:, N = 60; (A) = 51; (B) = 52, (AB) = 50, We have to find the missing class frequencies with the help of contingency table as under:, , 104
Page 105 :
6.6, , Attributes, , B, , , , Total, , A, , (AB) = 50, , (A) = 600–500=100, , (A) = 600, , , , (B)= 500–50=450, , () = 400–450 = -50, , () = 1000–600=400, , Total, , (B) = 500, , () = 1000–500 = 500, , N= 1000, , SUMMARY, , In this lesson we introduced the concept of association of attributes to make familiar with, the class frequency, ultimate class frequency and contingency table. We also studied to, develop relationship between class frequencies. Also, learn the concept of consistency of, data., , 6.7, , GLOSSARY, , •, , Attributes: Qualitative characteristics are termed as attributes., , •, , Positive attributes: The presence of attribute is termed as positive attributes., , •, , Class Frequency: The number of observations assigned to any class is called as, class frequency., , •, , Negative Attributes: The absence of attributes is called as negative attributes., , •, , Ultimate Class Frequency: The Class frequency specified by highest order is, called as ultimate class frequency., , •, , Consistency: The necessary and sufficient condition for the consistency of set of, class frequencies is that none of the ultimate class frequency should be negative., 6.8, , SELF ASSESSMENT QUESTIONS, , 1. Fill in the blanks:, , a), , Association of attributes helps us to study the relationship between phenomena, which are of ..................nature., 105
Page 106 :
b), , (AB) denotes the number of individuals possessing attributes.................., , c), , When we study two attributes, the total frequencies are........................, , d), , The order of classes depends upon the number of ................under study., , Tick () the correct option:2. Which of the following describe the middle part of a group of numbers?, a)Measure of Variability, b) Measure of Central Tendency, c) Measure of Association, d) Measure of Shape, 3. If an attribute has two classes, it is called:, a) Trichotomy, b) Simple classification, c) Dichotomy, d) Manifold classification., 4. If an attribute has more than two classes, it is said to be:, a) Mainfold classification, b) Trichotomy, c) Dichotomous, d) All of the above, 5. The total of all frequencies n is of order:, a) Zero, b) One, c) Two, d) Three, 106
Page 107 :
6. In case of consistent data, no class frequency can be:, a) Positive, b) Negative, c) Both (a) and (b), d) Neither (a) and (b), 7. With two attributes A and B, the total number of ultimate frequencies is:, a) Two, b) Four, c) Six, d) None, , 6.9, 1., , LESSON END EXERCISE, What is the difference between variables and attributes?, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 2., , Differentiate between correlation and association., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 3., , Explain consistency of data., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 107
Page 108 :
___________________________________________________________________, 4., , For three attributes A, B and C. Write down all the class frequencies of order zero,, one, two and three., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 5., , Is there any inconsistency in the data given below:, i. N= 1000, (A)=150, (B)=300, (AB)=200, ii. N=1000, (A)=50, (B)=60, (AB)=20, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 6., , From the following given frequencies. Find the other missing frequencies:, (AB) =250, (Aβ) =120, (αB) =200, (αβ) =70, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 108
Page 109 :
6.10 SUGGESTED READING, •, , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , •, , Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., , •, , Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall,, Delhi., , •, , Levin and Brevson: Business Statistics, Pearson Education, New Delhi., , •, , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , 109
Page 110 :
M.Com. I, , Course No. M.Com-115, , Unit- II, , Lesson No. 7, , ASSOCIATION AND DISASSOCIATION, STRUCTURE, 7.1 Introduction, 7.2 Objectives, 7.3 Association and Disassociation, 7.3.1 Independence of Attributes, 7.3.2 Association of Attributes, 7.3.3 Partial Association, 7.4 Summary, 7.5 Glossary, 7.6, , Self Assessment Questions, , 7.7 Lesson End Exercise, 7.8, , Suggested Readings, , 110
Page 111 :
7.1 INTRODUCTION, Association is when one is into something, connected with it, being the active participant., Dissociation is when one is outside something, detached from it, observing it. A very simple, principle in life is that it is probably a good idea to be associated with the stuff one wants and, dissociated from the stuff one doesn’t want. This is basically what we are trying to accomplish, in processing, more of what the person wants, and less of what he/she doesn’t want. Negative, stuff isn’t negative anymore when we dig into it and find out what it is really about, and any, experience has a valid reason for being there. So, it is not as simple as just getting rid of the, bad stuff. But there is no reason you should feel forced to continuously endure the worst, effects of what is there in your life., Two attributes, A and B, are usually defined to be independent, within any given field of, observation or “universe,” when the chance of finding them together is the product of the, chances of finding either of them separately. The physical meaning of the definition seems, rather clearer in a different form of statement, viz., if we define A and B to be independent, when the proportion of A’s amongst the B’s of the given universe is the same as in that, universe at large. If for instance the question were put “What is the test for independence, of small-pox attack and vaccination?”, the natural reply would be “The percentage of, vaccinated amongst the attacked should be the same as in the general population “or” The, percentage of attacked amongst the vaccinated should be the same as in the general, population.” Also, the two attributes are termed positively or negatively associated according, as (AB) is greater or less than the value it would have in the case of independence, or, to put, the same thing another way, according as (AB)/(A) is greater or less than {B)/N or (AB)/, (B) greater or less than (A)/N., , 7.2 OBJECTIVES, After studying this lesson, you would be able to:, • explain the concept of association., • differentiate between association and disassociation., • explain independence of attributes., 111
Page 112 :
7.3 ASSOCIATION AND DISASSOCIATION, The word association as used in statistics has a technical meaning different from the one in, ordinary speech. In common language one speaks of A and B as being ‘associated’ if they, appear together in a number of cases. But in statistics, A and B are associated only if they, are independent. On the other hand, if this number (or proportion) is less than expected, for independence, they are disassociated, thus, in case of second order frequencies, A and, B are:, i.Associated if (AB) (αβ) > (Aβ) (αB); and, ii.Disassociated if (AB) (αβ) < (Aβ) (αB)., Hence, it should carefully be noted that association cannot be inferred from the mere fact, that some A’s and B’s, however greater the proportion., 7.3.1 Independence of Attributes, If the attributes are said to be independent the presence or absence of one attribute does, not affect the presence or absence of the other. For example, the attributes skin colour and, intelligence of persons are independent., If two attributes A and B are independent then the actual frequency is equal to the expected, frequency, i.e., (AB) = (A). (B)/N, Similarly, , (α β) = (α). (β)/N, , Two attributes are said to be independent if there does not exist any relationship between, them., Criteria of Independence of Two Attributes are:, Criteria 1: Proportion method: In this method, we compare the presence or absence of a, given attribute in the other. If two attributes A and B are independent, then we would, expect:, (a) the same proportion of A’s amongst B’s as amongst β’s., (b) the proportion of B’s amongst A’s amongst α’s, 112
Page 113 :
For example, if the attributes ‘intelligence’ and ‘beauty’ are independent, then the proportion, of intelligent persons among beautiful and non-beautiful persons must be same or conversely,, the proportion of beautiful persons amongst intelligent and un-intelligent persons must be, same., Symbolically, if attributes A and B are independent, the criterion in (α) and (b) respectively, imply that:, (AB)/ (B) = (Aβ)/(β), 1- (AB)/ (B) = 1 - (Aβ)/(β), (B)-(AB)/ (B) = (β) - (Aβ)/(β), (αB)/(B) = (αβ)/(β), (AB)/ (A) = (αB)/(α), 1- (AB)/ (A) = 1 - (αB)/(α), (A)- (AB)/ (A) = (α) - (αB)/ (α), (Aβ)/(A) = (αβ)/(α), Criteria 2: This criterion of independence of attributes is based on the ultimate class, frequencies in terms of the frequencies of the first order:, We get:, (AB) = (A) *(B)/N, (AB)/N = (A)/N * (B)/N, Which leads to the following fundamental rule?, “If attributes A and B are independent, the proportion of AB’s in the population is equal to, the product of the proportion of A’s and proportion B’s in the population.”, Criteria 3: This criterion of independence of two attributes is based on the class frequencies, of second., If A and B are independent, then by criterion 1, we have:, (AB)/(B) = (Aβ)/(β) and (αB)/ (B) = (αβ)/(β), Dividing these equations, we get:, 113
Page 114 :
(AB)/(αB) = (Aβ)/(αβ), (AB)(αβ) = (Aβ) (αB), Example 1: In a sample of 1000 individuals, 100 posses the attribute A and 300 posses, attribute B. A and B are independent. How many individuals posses both A and B, and, how many posses neither?, Solution: In the usual notations, we are given: (A) = 100; (B) = 300; N = 1000, (α) = N – (A) = 1000 – 100= 900; (β) = N – (B) = 1000-300 = 700, Since a and B are independent, we have (AB) = (A) × (B) /N = 100×300/1000 = 30, Hence, the number of persons posses both the attributes A and B is 30., Since A and B are independent, α and β are also independent., Therefore, (αβ) = (α)×(β)/N = 900×700/1000 = 630., Hence, the number of persons who posses neither of the attributes A and B is 630., Example 2: In an analysis of two attributes, if:, N = 160, (A) = 96 and (B) = 50, find the frequencies of the remaining classes on the, assumption that A and B are independent., Solution: Since the attributes A and B are independent (given),, (AB) = (A) (B)/N =, The remaining frequencies can now be obtained on completing the 9-square table as given, in table below:, Therefore, (α) = 64; (β) = 110; (αB) = 20; (Aβ) = 66; (αβ) = 44, , , , 114
Page 115 :
7.3.2 Association of Attributes, Two attributes A and B are said to be associated if they are not independent but they, are related with each other in some way or other., The attributes A and B are said to be positively associated if (AB) > (A).(B)/N, If (AB) <(A).(B)/N,, then they are said to be negatively associated, Example 3: Show that whether A and B are independent, positively associated or negatively, associated (AB) =128, (αB) = 384, (Aβ) = 24 and (αβ) = 7., Solution:, (A)= (AB) + (Aβ) = 128 + 24, (A)= 152, (B) = (AB) + (αB) = 128 +384, (B) = 512, (α) = (αB) + (αβ) = 384 + 72, (α) = 456, (N) = (A) + (α) = 152 + 456= 608, Here, (AB) = 128, (A) = 152, (B) = 512, N = 608, (AB) =, , A× B, N, , 128 =, Here, 128 = 128, Hence, A and B are independent, 115
Page 116 :
7.3.3 Partial Association, So far we have discussed the association of A and B In the universe as a whole without, finding the other attributes in the universe. It is possible, however, that there is no direct, relationship between A and B, i.e., the association between attributes A and B may be due, to their association with a third attribute, say. C. Therefore, if A is positively associated, with C and if B is associated with C. A may be found to be positively associated with B., But this type of association between A and B is not direct; it is the effect of their association, with a third attribute C. To find out whether the association with a third attribute C, it, would be necessary to study the association of A and B in the sub-population C and γ. If, A and B are associated in both the sub-populations of C and γ. It would Indicate that A, and B are really associated with each other. The association between A and B in subpopulations are called partial associations, to distinguish them from the total associations, between A and B in the population at large. The following example will illustrate clearly the, concept of partial association:, An association is observed between vaccination and prevention from attack by smallpox. It means that vaccination prevents attack of small-pox. However, on a detailed analysis, one may find that the attributes vaccination and attack of small-pox are not directly, associated-the association between them is due to a third factor, namely economic, condition. Those people who are economically well-off live in better conditions, open, houses, get better food and nutrition and can afford the cost of vaccination and as such the, possibility of their getting an attack of small-pox is less. On the other hand, those who are, poor, live in filthy conditions, dirty surroundings, dirty houses and because of illiteracy do, not believe in vaccination. They cannot afford the cost of vaccination also. As such they, are liable to suffer more from diseases. If we denote A for vaccination, B for small-pox, and C for economic conditions, we may find that there is positive association between A, and C and also between B and C. Hence, in order to arrive at correct conclusions it is, necessary that on the basis of economic conditions the population is divided into two parts, rich (C) and Poor (γ) and in each sub-population association is ascertained between, vaccination (A) and prevention from small-pox (B). If this third attribute is ignored it will, give rise to misleading conclusions., , 116
Page 117 :
The association found between the attributes A and B in the universe of C’s and universe, of γ ’s is termed as partial association to distinguish them from total association found, between A and B in the universe at large. The methods of finding out partial association, are the same as used in finding the total association. The only difference is that we have to, find out separately the association of A and B in C and γ., Further, in order to ascertain whether two attributes are associated or not, we are going to, apply the following methods in subsequent lessons:, I. Comparison of Observed and Expected Frequencies Method or (Simply, Comparison Method)., II. Proportion Method, III. Yule’s Coefficient of Association, IV. Coefficient of Colligation, V. Coefficient of Contingency, 7.4 SUMMARY, The simplest possible form of statistical classification is “division” (as the logicians term it), “by dichotomy,” i.e. the sorting of the objects or individuals observed into one or other of, two mutually exclusive classes according as they do or do not possess some character or, attribute; as one may divide men into sane and insane, the members of a species of plants, into hairy and glabrous, or the members of a race of animals into males and females. The, mere fact that we do employ such a classification in any case must not of course be held to, imply a natural and clearly defined boundary between the two classes; e.g. sanity and, insanity, hairiness and glabrousness, may pass into each other by such fine gradations that, judgments may differ as to the class in which a given individual should be entered. The, judgment must however be finally decisive; intermediates not being classed as such even, when observed. The theory of statistics of this kind is of a good deal of importance, not, merely because they are of a fairly common type-the statistics of hybridisation experiments, given by the followers of Mendel may be cited as recent examples- but because the ideas, and conceptions required in such theory form a useful introduction to the more complex, and less purely logical theory of variables. In this lesson we introduced the concept of, association and disassociation with practical examples. We also studied the concept of, independence of attributes., 117
Page 118 :
7.5 GLOSSARY, , , Association: In the theory of attributes, the attributes A and B are said to be, associated with each other only if the two attributes are not independent, but are, related to each other in some way or another., , , , Disassociation: Dissociation is when one is outside something, detached from it,, observing it., , , , Frequency: In statistics the frequency (or absolute frequency) of an event is the, number of times the event occurred in an experiment or study., , , , Class: The population in the theory of attributes is divided into two classes, namely, the negative class and the positive class. The positive class signifies that the attribute, is present in that particular item under study, and this class in the theory of attributes is, represented as A, B, C, etc., , 7.6 SELF ASSESSMENT QUESTIONS, Tick the correct answer:1. The value of chi-square statistic is always:, (a) Negative, , (b) Zero, , (c) Non-negative, , (d) One, , 2. The shape of the chi-square distribution depends upon:, (a) Parameters, Standard deviation, , (b) Degree of freedom, , (c) Number of cells, , (d), , 3. For a 3 x 3 contingency table, the numbers of cells in the table are:, (a) 3, , (b) 6, , (c) 9, , (d) 4, , 4. If all frequencies of classes are same, the value of Chi-square is:, (a) Zero, , (b) One, , (c) Infinite, , (d) All of the above, , (c) Attribute, , (d) Discrete, , 5. The eyes colour of 100 women is:, (a) Variable, , (b) Constant, , 6. If two attributes A and B have perfect positive association the value of coefficient of, association is equal to:, (a) +1, , (b) -1, , (c) 0, 118, , (d) (r-1)(c-1)
Page 119 :
7.7 LESSON END EXERCISE, 1., , With three attributes A, B and C write down:, i.Number of positive class frequencies., ii.Number of ultimate class frequencies., iii.Number of all the class frequencies., iv.All the class frequencies in symbols., _________________________________________________________________________, _________________________________________________________________________, _________________________________________________________________________, , 2., , Find the missing frequencies from the following frequencies:, N= 1000, (A)= 877, (B)= 1086, _________________________________________________________________________, _________________________________________________________________________, _________________________________________________________________________, , 3., , Among the adult population of a certain town 50% are male; 60% are wages earners, and 50% are 45 years of age or above. 10% of male are not wage earners and 40%, of the male are under 45 years of age. Can we infer anything about the percentage, of the population of 45 or over are wage earners., _________________________________________________________________________, _________________________________________________________________________, _________________________________________________________________________, , 119
Page 120 :
7.8 SUGGESTED READINGS, •, , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , •, , Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., , •, , Anderson, Sweeney and Williams: Statistics for Business and Economics-Thompson,, New Delhi., , •, , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , •, , Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., , 120
Page 121 :
M.Com. I, , Course No. M.Com-E-115, , Unit- II, , Lesson No. 8, METHODS OF ATTRIBUTES, , STRUCTURE, 8.1 Introduction, 8.2 Objectives, 8.3 Comparison Method, 8.3.1 Limitations, 8.4 Proportion Method, 8.4.1 Limitations, 8.5 Summary, 8.6 Glossary, 8.7 Self Assessment Questions, 8.8 Lesson End Exercise, 8.9 Suggested Readings, 8.1 INTRODUCTION, When data is collected on the basis of some attribute or attributes, we have statistics, commonly termed as statistics of attributes. It is not necessary that the objects may process, only one attribute; rather it would be found that the objects possess more than one attribute., In such a situation our interest may remain in knowing whether the attributes are associated, with each other or not. Technically, we say that the two attributes are associated if they, appear together in a greater number of cases than is to be expected if they are independent, 121
Page 122 :
and not simply on the basis that they are appearing together in a number of cases as is, done in ordinary life. The association may be positive or negative (negative association is, also known as disassociation). Independent, positive association, negative association etc., are known as nature of association. Therefore, in this lesson our purpose is to explore the, nature of association between two attributes by using comparison and proportion method., 8.2, , OBJECTIVES, , After studying this lesson, you would be able to:, , , assess what kind of associations among attributes is likely to occur., , , , find the nature of association between two attributes by using comparison and, proportion method., , 8.3, , COMPARISON METHOD, , This method is also known as comparison of observed and expected frequency, method. This is so because when comparison method is applied, the actual observations, are compared with expected observations. If actual observation is equal to the expectation,, the attributes are said to be independent; if actual observation is more than the expectation,, the attributes are said to be positively associated and if the actual observation is less than, the expectation, the attributes are said to be negatively associated., Symbolically, attributes A and B are:, (i) Independent if (AB) =, , (A) x (B), N, , (ii) Positively associated if (AB) >, , (iii) Negatively associated if (AB) <, , (A) x (B), N, (A) x (B), N, , The same is true for attributes α and B; α and β; A and β. Thus, attributes α and β, are called:, , 122
Page 123 :
(i) Independent if (α β) =, , (α) x (β), N, , (ii) Positively associated if (α β) >, (iii) Negatively associated if (α β) <, , (α) x (β), N, (α) x (β), N, , Example 1: From the following data, find out whether: A and B are independent, associated, and disassociated, if N = 100, (A) = 40, (B) = 80, (AB) = 30., Solution: With comparison method, attributes shall be:, (i) Independent if (AB) =, , (A) x (B), N, , (ii) Positively associated if (AB) >, , (A) x (B), N, , (iii) Negatively associated or Disassociated if (AB) <, , (A) x (B), N, , Therefore, here in our case, (AB) = 30, (A) = 40, (B) = 80 and N = 100, (A) x (B), , =, , N, Thus, (AB) = 30 is less than, , 40 x 80, , = 32, , 100, (A) x (B), N, , = 32, , Hence, attributes A and B are disassociated., Example 2: From the following ultimate class frequencies, find the frequencies of the, positive and negative classes and the total number of observations., , 123
Page 124 :
(AB) = 100, (αB) = 80, (Aβ) = 50, (αβ) = 40., Solution: By putting these values in the nine square table, we can find the desired, information., , N, , A, , α, , Total, , B, , 100, , 80, , 180, , Β, , 50, , 40, , 90, , Total, , 150, , 120, , 270, , = (AB) + (Aβ) + (αB) + (αβ), = 100 + 50 + 80 + 40 = 270, , 8.3.1, , (A), , = (AB) + (Aβ) = 100 + 50 = 150, , (B), , = (AB) + (αB) = 100 + 80 = 180, , (α), , = (αB) + (αβ) = 80 + 40 = 120, , (β), , = (Aβ) + (αβ) = 50 + 40 = 90, , Limitations, , With the help of this method we can only determine the nature of association (i.e.,, whether there is positive or negative association or no association) and not the degree of, association (i.e., whether association is high or low). Yules’s coefficient is superior because, it provides information not only on the nature but also on the degree of association., 8.4, , PROPORTION METHOD, , If there is no relationship of any kind between two attributes A and B, we expect, to find the same proportion of A’s amongst the B’s as amongst the β’ s. Thus, if a coin is, tossed we expect the same proportion of heads irrespective of whether the coin is tossed, by the right hand or the left hand., , 124
Page 125 :
Symbolically, two attributes may be termed as:, (i) Independent if, , (𝐀𝐁), (𝐁), , (𝐀𝛃), , =, , (𝛃), (𝐀𝐁), , (ii) Positively associated if, , (𝐁), , >, , (𝐀𝛃), (𝛃), , (ii) Negatively associated or disassociated if, , (𝐀𝐁), (𝐁), , <, , (𝐀𝛃), (𝛃), , This case is application, applicable when B is taken as proportion., Similarly, when A is taken as proportion, the attributes are:, (i) Independent if, , (𝐀𝐁), (𝐀), , =, , (𝛂𝐁), (𝛂), (𝐀𝐁), , (ii) Positively associated if, , (𝐀), , >, , (𝛂𝐁), (𝛂), , (iii) Negatively associated or disassociated if, , (𝐀𝐁), (𝐀), , <, , (𝛂𝐁), (𝛂), , Example 3: In a population of 500 students the numbers of married are 200. Out, of 150 students who failed 60 belonged to the married group. It is required to find out, whether the attributes marriage and failure are independent, positively associated or, negatively associated., , Solution: Let A denote married students., ∴ α represents unmarried students., Let B denote failure, ∴ β represent success, And N represents total number of students., On the basis of the information given in question we have:, N = 500, (A) = 200, (β) = 150, (Aβ) = 60., Applying the proportion method:, Attributes A and b shall be independent if, , 125, , (𝐀𝐁), (𝐀), , <, , (𝛂𝐁), (𝛂)
Page 126 :
In other words, if the proportion of married students who failed is the same as the proportion, of unmarried students who failed, we can say that the attributes, marriage and failure are, independent., (𝐀𝐁), , Proportion of married students who failed: i.e., (𝐀𝐁)== 60/ 200 = 0.3 or 30%, (𝐀), (𝐀), , Proportion of unmarried students who failed:, , (𝛂𝐁, 𝛂𝐁), 𝟗𝟎, (𝛂𝐁) = 𝟗𝟎 = 𝟎. 𝟑 𝐨𝐫 𝟑𝟎%, i.e. (𝛂) ) = 𝟑𝟎𝟎= 𝟎. 𝟑 𝐨𝐫 𝟑𝟎%, (𝛂), 𝟑𝟎𝟎, , Since the two proportions are the same we conclude that the attributes, marriage and, failure are independent., Example 4: Find whether A and B are independent in the following case by proportion, method: (AB) = 256, (αB) = 768, (Aβ) = 48, (αβ) = 144., Solution: Attributes A and A shall be independent if :, (𝐀𝐁), (𝐀), , =, , (𝛂𝐁), (𝛂), , For finding (A) and (α) let us prepare a nine square table:, , BB, ΒΒ, Total, Total, , A A, 256, 256, 4848, 304, 304, , α α, 768, 768, 144, 144, 912, 912, , Total, Total, 1024, 1024, 192, 192, 1216, 1216, 𝟕𝟔𝟖, , Therefore, by applying the above equation, 𝟐𝟓𝟔/𝟑𝟎𝟒 = 𝟗𝟏𝟐, Since left hand side and right hand sides are equal i.e.,, , (𝐀𝐁), (𝐀), , =, , (𝛂𝐁), (𝛂), , Hence attributes A and B are independent., 8.4.1. Limitations, Just like the previous method, in this method also we can only determine the, nature of association and not the degree of association., , 126
Page 127 :
Example 5: In a certain class it was found that 70% of teh students passed in half yearly, examination, 30% students passed half yearly and annual examination, while 28% were, such who passed in annual but failed in half yearly examination. Find the percentage of, students who:, (a) Passed in annual examination., (b) Passed in half yearly but failed in annual examination, and, (c) Failed in both the examination., Solution: Let A denote those passing annual examination, Let B denote those passing the half yearly examination. á and â will represent, respectively those who fail in the annual examination and yhose failing in the half yearly, examination., , (i), (ii), (iii), , (A), (αB), (αβ), , (i), , (A) = (AB) + (Aβ) = 30 + 28 = 58%, , Hence the percentage of students who passed in the annual examination is 58., (ii), , (αB) = (B) - (AB) = 70 -30 = 40%, , Hence the percentage of students who failed in annual examination nut passed in the, half yearly is 40., (iii), , (αβ) = (β) –(Aβ) = N – (B) – (Aβ) = 100 – 70 – 28 = 2%, , Hence 2% students failed in both the examinations., Example 6: A survey was conducted in respect of marital status and success in, examinations. Out of 2000 persons who appeared for an examination, 80% of them were, boys, and the rest were girls. Among 300 married boys, 140 were successful, 1100 boys, were successful among unmarried boys. In respect of 10 married girls 40 were successful,, 127
Page 128 :
200 unmarried girls were successful. Construct two separate nine- square tables and, determine the association between marital status and passing of examination., Solution: Let A denote married boys, α would denote unmarried boys, Let B denote those who were successful, β denote those who were unsuccessful, In respect of boys, we are given the following information:, N =1600, (A) = 300, (AB) = 140, (αB) = 1100, We can find out the missing values from the nine square table., A, 140, 160, 300, , B, Β, Total, , Expectation of (AB) =, , 𝑨 ×𝑩, 𝑵, , α, 1100, 200, 1300, , =, , Total, 1240, 360, 1600, , 𝟑𝟎𝟎 ×𝟏𝟐𝟒𝟎, 𝟏𝟔𝟎𝟎, , = 232.5, , Since (AB) the actual observation (140) is less than the expected observation (232.5), the, attributes marriage and success in the examination are negatively associated., We can construct another table in respect of girls. Taking A as married girls, the given, information is:, N = 400, (AB) = 40, (A) = 100, (αβ) = 200, We can find out the missing frequencies from the nine square table,, , B, Β, Total, , A, 40, 60, 100, , Expectation of (AB) =, , α, 200, 100, 300, 𝑨 ×𝑩, 𝑵, , =, , Total, 240, 160, 400, , 𝟏𝟎𝟎 ×𝟐𝟒𝟎, 𝟒𝟎𝟎, , = 60, 128
Page 129 :
Since (AB) actual observation (40) is less than the expectation (60), the attributes, marriage and success in the examination are negatively associated., Example 7: Out of 3000 unskilled workers of a factory, 2000 come from rural areas, and out of 1200 skilled workers, 300 come from rural areas. Determine the association, between skill and residence by the method of proportion., Solution: Let A denote skilled workers, α will denote unskilled workers, Let B denote workers from rural areas, β will denote workers from urban areas, We are given: (A) = 1200, (α) = 3000, (αB) = 2000, (AB) = 300, According to the method of proportions, two attributes A and B are said to be, independent if :, , (𝐀𝐁), (𝐀), , In the given case:, Since, , (𝐀𝐁), (𝐀), , =, , (𝐀𝐁), (𝐀), , (𝛂𝐁), (𝛂), , =, , 𝐢𝐬 𝐥𝐞𝐬𝐬 𝐭𝐡𝐚𝐧, , 𝟑𝟎𝟎, 𝟏𝟐𝟎𝟎, (𝛂𝐁), (𝛂), , = 0.25 and, , (𝛂𝐁), (𝛂), , =, , 𝟐𝟎𝟎𝟎, 𝟑𝟎𝟎𝟎, , = 𝟎. 𝟔𝟕, , , there is negatibe association between skill and residence., , Example 8: Use proportion method to determine the nature of association between A, and B:, B, , β, , Total, , A, , 30, , 50, , 80, , α, , 20, , 100, , 120, , Total, , 50, , 150, , 200, 129
Page 130 :
Solution: According to the proportion method if :, (𝐀𝐁), (𝐀), , =, , (𝛂𝐁), (𝛂), , when A is taken as proportion, the attributes A and B are independent., , We have (AB) = 30, (A) = 80, (αB) = 20, (α) = 120, (𝐀𝐁), (𝐀), , =, , Since, 8.5, , 𝟑𝟎, 𝟖𝟎, , (𝐀𝐁), (𝐀), , = 0.375 and, >, , (𝛂𝐁), (𝛂), , (𝛂𝐁), (𝛂), , 𝟐𝟎, , = 𝟏𝟐𝟎 = 0.167, , the attributes A and B are positively associated., , SUMMARY, , In this lesson, we have discussed the nature of association between two attributes, with the help of comparison and proportion method. In statistics, attributes A and B are, associated only if they appear together in greater number of cases than is to be expected, if they are independent. In common language association means if A and B occur together, a number of times. If there exist no relationship of any kind between two attributes then, they are said to be independent otherwise are said to be associated. Attributes A and B, are said to be Positively associated if (A) (B) / N < (AB ) Negatively associated if (A ), (B) / N > (AB ) Independent if (A )(B)/N = (AB)., 8.6, , GLOSSARY, , , , Correlation - A common statistical analysis, usually abbreviated as r, that measures the degree of relationship between pairs of interval variables in a sample., The range of correlation is from -1.00 to zero to +1.00. Also, a non-cause and, effect relationship between two variables., , , , Covariate - A product of the correlation of two related variables times their standard deviations. Used in true experiments to measure the difference of treatment, between them., , , , Statistical Analysis - Application of statistical processes and theory to the compilation, presentation, discussion, and interpretation of numerical data., 130
Page 131 :
, , Statistical Bias - Characteristics of an experimental or sampling design, or the, mathematical treatment of data, that systematically affects the results of a study so, as to produce incorrect, unjustified, or inappropriate inferences or conclusions., , , , Statistical Significance - The probability that the difference between the outcomes of the control and experimental group are great enough that it is unlikely, due solely to chance. The probability that the null hypothesis can be rejected at a, predetermined significance level (0.05 or 0.01)., , , , Theory - A general explanation about a specific behaviour or set of events that is, based on known principles and serves to organise related events in a meaningful, way. A theory is not as specific as a hypothesis., , , , Comparison method- By comparison method, the nature of association between, two attributes is calculated by comparing observed and expected frequencies., , 8.7, , SELF ASSESSMENT QUESTIONS, , True/False:1., 2., 3., 4., 8.8, , Yule’s coefficient of association is used to study only the nature of association, between two attributes., T/F, In comparison method, we compare observed and expected frequencies. T/F, (AB) indicate actual or observed frequency., T/F, Association is the relationship between two or more variables., T/F, LESSON END EXERCISE, , 1., , Show whether A and B are independent, positively associated or negatively, associated in each of the following cases: Use comparison method., , (i), , N = 1000; (A) = 450; (B) = 600; (AB) = 340, , (ii), , (B) = 383 (α)= 585; (A) = 480; (AB) = 290, ___________________________________________________________, ___________________________________________________________, , 131
Page 132 :
___________________________________________________________, 2., , Find whether A and B are independent, positively associated or negatively, associated by using proportion method., N = 1000; (A) = 500; (B) = 400; (AB) = 200, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 3., , From the following data prepare 2 2 table and using comparison method, discuss, whether there is any association between literacy and unemployment., Illiterate Unemployed 250 persons, Literate Employed 25 persons, Illiterate Employed 180 persons, Total number of persons 500 persons, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 4., , Can vaccination be regarded as a preventive measure for small pox from the data, given below:, ‘Of 1482 persons in a locality exposed to small pox 368 in all were attacked’, ‘Of 1482 persons, 343 had been vaccinated and of these only 35 were attacked’, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 132
Page 133 :
8.9, , SUGGESTED READINGS, , , , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , , , Gupta, S.C. and V.K. Kapoor : Fundamentals of Applied Statistics., , , , Hooda, R.P.: Statistics for Business and Economics, Macmillan, New Delhi., , , , Hien, L.W: Quantitative Approach to Managerial Decisions, Prentice Hall,, New Jersy, India, Delhi., , , , Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., , , , Mc Clave, Benson and Sincich: Statistics for Business and Economics, Eleventh, Edition, Prentice Hall Publication., , 133
Page 134 :
M.Com. I, , Course No. M.Com-115, , Unit- II, , Lesson No.9, , METHODS OF ASSOCIATION, STRUCTURE, 9.1, , Introduction, , 9.2, , Objectives, , 9.3, , Yule’s Coefficient of Association, , 9.4, , Coefficient of Colligation, 9.4.1, , 9.5, , Yule’s Coefficient of Colligation, , Coefficient of Contingency, 9.5.1, , Contingency Table: Manifold Classification, , 9.5.2, , Chi - Square and Coefficient of Contingency, , 9.6, , Summary, , 9.7, , Glossary, , 9.8, , Self Assessment Questions, , 9.9, , Lesson End Exercise, , 9.10, , Suggested Reading, , 9.1, , INTRODUCTION, , So far we had discussed about the nature of association, i.e., whether two or, more attributes are independent, positively associated and negatively associated with the, help of comparison and proportion method in lesson 8. But the statistician is more interested, in the degree of association along with its nature. Therefore, in this lesson we are going to, 134
Page 135 :
explain the degree of association between two attributes by using an important method,, i.e., Yule’s coefficient of association along with coefficient of colligation and coefficient of, contingency., 9.2, , OBJECTIVES, , After reading this lesson, you would be able to:, , , explain the degree of association between two attributes by using Yule’s coefficient, of association, , , , describe the concept of contingency table for manifold classification;, , , , compute the expected frequencies for different cells, which are necessary for the, computation of chi-square;, , , , calculate coefficient of contingency and interpret the level of association with the, help of it., , , , clarify the difference between coefficient of colligation and Yule’s coefficient of, colligation., , 9.3, , YULE'S COFFICIENT OF ASSOCIATION, , The most popular method of studying association is Yule’s coefficient because, here we can not only determine the nature of association, i.e., whether the attributes are, positively association, negatively associated or independent, but also the degree or extent, to which the two attributes are associated. The Yule’s coefficient is denoted by the symbol, Q and is obtained by applying the formula as given below :, Q=, , (𝐀𝐁)(𝛂𝛃)− (𝐀𝛃)(𝛂𝐁), (𝐀𝐁)(𝛂𝛃)+ (𝐀𝛃)(𝛂𝐁), , The value Yule’s lies between ± 1. When the value of:, Q = + 1, there is a perfect positive association between the attributes., Q = - 1, there is a perfect negative association between the attributes or perfect, disassociation., 135
Page 136 :
Q = 0, attributes are independent., The coefficient of association can be used to compare the intensity of association, between two attributes with the intensity of association between two other attributes., Example 1: Find the Yule’s coefficient of association for the following data:, N = 1500, () = 1117, (B) = 360, (AB) = 35, Solution: By putting the known values of frequencies into 2×2 contingency table and, finding the remaining unknown frequencies :, , B, β, Total, , A, 35, 348, 383, , α, 325, 792, 1117, , Total, 360, 1140, 150, , Yule’s Coefficient of Association is:, Q=, , =, , (𝐀𝐁)(𝛂𝛃)− (𝐀𝛃)(𝛂𝐁), (𝐀𝐁)(𝛂𝛃)+ (𝐀𝛃)(𝛂𝐁), , (𝟑𝟓)(𝟕𝟗𝟐)− (𝟑𝟐𝟓)(𝟑𝟒𝟖), (𝟑𝟓)(𝟕𝟗𝟐)+(𝟑𝟐𝟓)(𝟑𝟒𝟖), , =, , 𝟐𝟕𝟕𝟐𝟎− 𝟏𝟏𝟑𝟏𝟎𝟎, , =, , −𝟖𝟓𝟑𝟖𝟎, , 𝟐𝟕𝟕𝟐𝟎+𝟏𝟏𝟑𝟏𝟎𝟎, , 𝟏𝟒𝟎𝟖𝟐𝟎, , = -0.606, , Example 2: In a group of 800 students, the number of married is 320. Out of 240 students, who failed, 96 belonged to the married group. Find out whether the attributes marriage, and failure are independent., , 136
Page 137 :
Solution: Let A stand for married students and B for those who failed. We are given N =, 800, (A) = 320, (B) = 240, (AB) = 96. By putting the information in nine-square table,, we have, Calculating Yule’s Coefficient of association :, , B, β, Total, , Q=, , (𝐀𝐁)(𝛂𝛃)− (𝐀𝛃)(𝛂𝐁), , A, 96, 224, 320, , α, 144, 336, 480, , Total, 240, 560, 800, , (𝟗𝟔)(𝟑𝟑𝟔)− (𝟐𝟐𝟒)(𝟏𝟒𝟒), , = (𝟗𝟔)(𝟑𝟑𝟔)+ (𝟐𝟐𝟒)(𝟏𝟒𝟒) =, (𝐀𝐁)(𝛂𝛃)+ (𝐀𝛃)(𝛂𝐁), , 𝟑𝟐𝟐𝟓𝟔− 𝟑𝟐𝟐𝟓𝟔, 𝟑𝟐𝟐𝟓𝟔+𝟑𝟐𝟐𝟓𝟔, , =0, , 0, , Example 3: Investigate the association between eye colour of husbands and eye colour, of wives from the data given below:, Husbands with light eyes and wives with light eyes = 309, Husbands with light eyes and wives with non-light eyes = 214, Husbands with non- light eyes and wives with light eyes = 132, Husbands with non-light eyes and wives with non-light eyes= 119, Solution: since we have to find out the association between eye colour of husband and, that of wife, one attribute we would take as a and another as B., Let A denote husbands with light eyes, α would denote husbands with non-light eyes, Let B denote wives with light eyes, β would denote wives with non-light eyes, The given data in terms of these symbols is:, (AB) = 309, (Aβ) = 214, (αB) = 132, (αβ) = 119, 137
Page 140 :
(𝑨𝜷)(𝜶𝑩), , γ=, , 𝟏− (𝑨𝑩)(𝜶𝜷), (𝑨𝜷)(𝜶𝑩), , 𝟏+ (𝑨𝑩)(𝜶𝜷), (𝟐𝟓)(𝟒𝟎), , γ=, , 𝟏− (𝟓𝟎)(𝟐𝟎), , =, (𝟐𝟓)(𝟒𝟎), , 𝟏+ (𝟓𝟎)(𝟐𝟎), , 𝟏−√𝟏, 𝟏+ √𝟏, , =, , 𝟏−𝟏, 𝟏+ 𝟏, , =, , 𝟎, 𝟐, , =𝟎, , Since the value of gamma is 0, therefore there is no association between failure in, economics and English., 9.5, , COEFFICIENT OF CONTINGENCY, , In lesson 8, we have discussed that the classification of the data can be dichotomous, or manifold. If an attribute has only two classes it is said to be dichotomous and if it has, many classes, it is called manifold classification. For example the criterion ‘location’ can, be divided into big city and small town. The other characteristic ‘nature of occupancy’ can, be divided into ‘owner occupied’, ‘rented to private parties’. This is dichotomous, classification. Now suppose we have N observations classified according to both criteria., Nature of, Occupancy, Owner, occupied, Rented to, parties, Total, , Table 1, Location, Big Town, , Total, , Small Town, , 54, , 67, , 121, , 107, , 22, , 129, , 161, , 89, , 250, , Here we have classification by two criteria - one location (two categories) and the, other nature of occupancy (two categories). Such a two-way table is called contingency, table. The table above is 2 2 contingency table where both the attributes have two, categories each. The table has 2 rows and 2 columns and 22 = 4 distinct cells. We also, discussed in the previous lesson that the purpose behind the construction of such table is, to study the relationship between two attributes i.e. the two attributes or characteristics, 140
Page 141 :
appear to occur independently of each other or whether there is some association between, the two. In the above case our interest lies in ascertaining whether both the attributes i.e., location and nature of occupancy are independent. In practical situations, instead of two, classes, an attribute can be classified into number of classes. Such type of classification is, called manifold classification., For example stature can be classified as very tall, tall, medium, short and very short. In the, present lesson, we shall discuss manifold classification; related contingency table and, methodology to test the intensity of association between two attributes, which are classified, into number of classes. The main focus of this lesson would be the computation of Yules’, s coefficient of association, coefficient of colligation, chi-square and the coefficient of, contingency, which would be used to measure the degree of association between two, attributes., 9.5.1, , Contingency Table : Manifold Classification, , We have already learnt that if an attribute is divided into more than two parts or groups,, we have manifold classification. For example, instead of dividing the universe into two, parts-heavy and not heavy, we may sub-divide it in a large number of parts very heavy,, heavy, normal, light and very light. This type of sub division can be done for both the, attributes of the universe. Thus, attribute A can be divided into a number of groups A1,, A2,…, Ar. Similarly, the attribute B can be sub-divided into B1, B2,…, Br. When the, observations are classified according to two attributes and arranged in a table, the display, is called contingency table. This table can be 3 3, 4 4, etc. In 33 table both of the, attributes A and B have three sub-divisions. Similarly, in 4 4 table, each of the attributes A, and B is divided into four parts, viz. A1, A2, A3, A4 and B1, B2, B3, B4. The number of, classes for both the attributes may be different also. If attribute A is divided into 3 parts, and B into 4 parts, then we will have 34 contingency table. In the same way, we can have, 3 5, 4 3, etc. contingency tables. It should be noted that if one of the attributes has two, classes and another has more than two classes, even then the classification is manifold., Thus, we can have 2 3, 2 4, etc. contingency tables. We shall confine our attention to two, attributes A and B, where A is sub-divided into r classes, A1, A2,…, Ar and B is subdivided, into s classes B1, B2, …, Bs. Following is the layout of r s contingency table, , 141
Page 143 :
We can have manifold classification of the two attributes in which case each of the two, attributes are first observed and then each one is classified into two or more subclasses,, resulting into what is called as contingency table. The following is an example of 4 × 4, contingency table with two attributes A and B, each one of which has been further classified, into four sub-categories., , Association can be studied in a contingency table through Yule’s coefficient of, association as stated above, but for this purpose we have to reduce the contingency table, into 2 × 2 table by combining some classes. For instance, if we combine (A1) + (A2) to, form (A) and (A3) + (A4) to form () and similarly if we combine (B1) + (B2) to form (B), and (B3) + (B4) to form () in the above contingency table, then we can write the table in, the form of a 2 × 2 table as shown in Table below:, , After reducing a contingency table in a two-by-two table through the process of combining some classes, we can work out the association as explained above. But the practice of, combining classes is not considered very correct and at times it is inconvenient also. Therefore, Karl Pearson has suggested a measure known as Coefficient of mean square contin-, , 143
Page 144 :
gency or simply coefficient of contingency for studying association in contingency tables., This can be obtained as under :, , While finding out the value of c we proceed on the assumption of null hypothesis, i.e.,, the two attributes are independent and exhibit no association., For calculation of c we have to determine the value of x2 (pronounced as chi-square)., The steps in calculating the value of x2 are:, (i) Find the expected or independent frequency for each cell. Thus, for cell (A1B1),, the expectation if, , 𝑨 𝟏×𝑩 𝟏, 𝑵, , ., , (ii) Obtain the difference between observed and expected frequencies in each cell,, i.e., find (O-E)., (iii)Square (O – E) and divide the figure by E. The expected frequency for each cell., (iv) Add up the figures obtained in step (iii). This would give the value of x2., Thus x2 = ⅀, , (𝑶−𝑬)𝟐, 𝑬, , Once the value of x2 is obtained it is easy to determine the value of C., , Example 8: The following table gives the association among 1000 advocates between, their weights and mental level. Determine the coefficient of contingency between the two., , 144
Page 147 :
9.6, , SUMMARY, , Sometimes only the knowledge of the association (whether positive or negative) or, independence between attributes is not sufficient. We are interested in finding the extent or, degree of association between attributes, so that we can take decision more precisely and, easily. In this regard, we have discussed Yule’s coefficient of association, coefficient of, colligation and coefficient of contingency in this unit. The value of Yule’s coefficient of, association lies between –1 to +1. If Q = +1, A and B are perfectly associated. In between, –1 to +1, are lying different degrees of association; another important coefficient of, association is coefficient of colligation. We have also discussed: Contingency table is a, table of joint frequencies of occurrence of two variables classified into categories, x2 is, used for finding association and relationship between attributes, The calculation of x2 is, based on observed frequencies and theoretically determined (expected) frequencies, We, have seen that if the observed frequency of each cell is equal to the expected frequency of, the respective cell for whole contingency table, then the attributes A and B are completely, independent and if they are not same for some of the cells then it means there exists some, association between the attributes; The degree or the extent of association between attributes, in r x s contingency table could be found by computing coefficient of mean square, contingency, , The value of C lies between 0 and 1 but it never attains the value unity. A value near to 1, shows great degree of association between two attributes and a value near 0 shows no, association., , 147
Page 148 :
9.7, , GLOSSARY, , , , Yule’s Method: Yule’s Y, also known as the coefficient of colligation, is a measure, of association between two binary variables. The measure was developed by, George Udny Yule in 1912, and should not be confused with Yule’s coefficient, for measuring skewness based on quartiles., , , , Coefficient of contingency: The contingency coefficient is a coefficient of, association that tells whether two variables or data sets are independent or, dependent of each other. It is also known as Pearson’s Coefficient (not to be, confused with Pearson’s Coefficient of Skewness), , , , Contingency Table: A table showing the distribution of one variable in rows and, another in columns, used to study the correlation between the two variables., , 9.8, 1., , SELF ASSESSMENT QUESTIONS, In N= 500, (AB) = 140, (B) = 150 and (A) =100, then find other frequencies., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 2., , For the data given in Question 1, find the association between A and B by different, methods., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 3., , Prepare 2×2 contingency table from the following data and find Yule’s coefficient, of Association: (AB) =100, (α B) = 80, (Aβ) = 50, (αβ) = 40., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, 148
Page 149 :
9.9 LESSON END EXERCISE, 1., , Find the Yule’s Coefficient of association from the following data: N= 800, (A)=, 470, (β) =450 and (AB) = 230, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 2., , When are two attributes said to be independent? From the following data check, whether attributes A and B independent or not by using coefficient of colligation, method: N = 100, (A) = 50, (B) = 70 and (AB) = 30, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 3., , The male population of certain state is 250 lakhs. The number of literate males is, 26 lakhs and the total number of male criminals is 32 thousand. The number of, literate male criminal is 3000. Do you find any association between literacy and, criminality ?, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 4., , Out of total population of 1000 the number of vaccinated persons was 600. In all, 200 had an attack of smallpox and out of these 30 were those who were, vaccinated. Do you find any association between vaccination and freedom from, attack? Use coefficient of colligation., , 149
Page 150 :
___________________________________________________________, ___________________________________________________________, ___________________________________________________________, 5., , In an area with a total population of 7000 adults, 3400 are males and out of a total, 600 graduates, 70 are females. Out of 120 graduate employees, 20 are females., (i) Is there any association between sex and education? (ii) Is there any association, between appointment and sex? Use Coefficient of contingency., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 6., , Find if A and B are independent, positively associated or negatively associated in, each of the following cases:, , (i), , N = 100; (A) = 47; (B) = 62 and (AB) = 32, , (ii), , (A) = 495; (AB) = 292; () = 572 and ()= 380, , (iii), , (AB) = 2560; (αB) = 7680; (Aβ)= 480 (β)= 1440, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 7., , Eighty-eight residents of an Indian city, who were interviewed during a sample, survey, are classified below according to their smoking and tea drinking habits., Calculate Yule’s coefficient of association and comment on its value., Smokers, Tea Drinkers, Non-tea Drinkers, , Non-Smokers, , 40, , 33, , 3, , 12, , 150
Page 151 :
___________________________________________________________, ___________________________________________________________, ___________________________________________________________, 8., , Find the association between Literacy and Unemployment from the following figures:, Total Adults: 10,000, Literates:, 1,290, Unemployed:, 1,390, Literate Unemployed: 820, ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 9.10 SUGGESTED READINGS, , , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , , , Gupta, S.C. and V.K. Kapoor: Fundamentals of Applied Statistics., , , , Anderson, Sweeney and Williams: Statistics for Business and EconomicsThompson, New Delhi., , , , Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., , , , Mc Clave, Benson and Sincich: Statistics for Business and Economics, Eleventh, Edition, Prentice Hall Publication., , 151
Page 152 :
M.Com. I, , Course No. M.Com-115, , Unit- II, , Lesson No. 10, PARTIAL AND MULTIPLE, CORRELATION AND REGRESSION ANALYSIS, , STRUCTURE, 10.1, , Introduction, , 10.2, , Objectives, , 10.3, , Correlation, , 10.3.1 Simple Correlation, 10.3.2 Partial Correlation, 10.3.2.1 Order of Partial Correlation Coefficient, 10.3.2.2 Characteristics and uses of Partial Coefficient, 10.3.2.3 Limitations, 10.3.3 Multiple Correlation, 10.3.3.1 Properties of Multiple Correlation Coefficient, 10.3.3.2 Advantages of using Multiple Correlation Analysis, 10.3.3.3 Limitations of Multiple Correlation Analysis, 10.4, , Regression, , 10.4.1 Simple Regression, 10.4.2 Multiple Regression, 10.4.2.1 Objectives of Multiple Regression Analysis, 10.4.2.2 Assumptions of Linear Multiple Regression Analysis, 10.4.2.3 Multiple Regression Equation, 10.5, , Summary, 152
Page 153 :
10.6, , Glossary, , 10.7, , Self Assessment Questions, , 10.8, , Lesson End Exercise, , 10.9, , Suggested Reading, , 10.1 INTRODUCTION, In the study of multivariate correlation models one is naturally very interested in relationships, among the variables. One set of measures useful to this end consists of the coefficients of, multiple correlation and the coefficients of partial correlation. All partial correlation, coefficients measure the correlation between two variables. Also, in industry and business, today, large amounts of data are continuously being generated. This may be data pertaining, to a company’s annual production, annual sales, turnover profits or some other variable of, direct interest to management. The accumulated data may be used to gain information, about the system (as for instance what happens to the output of the plant when temperature, is reduced to half) or to visually depict the past pattern of our interest in regression is, primarily for the first purpose, mainly to extract the main features of the relationships, hidden in or implied by the mass of data., The term ‘regression’ was first used by a British Biometrician Sir Francis Gallon, in the later part of nineteenth century, in connection with the height of parents and their, offsprings. He found that the offspring of tall or short parents tend to regress to the average, height. In other words though tall fathers do tend to have taIl sons, yet the average height, and the average heights of short fathers is less is “stepping back”. But now a days the term, ‘regression’ stands for some sort of functional relationship between two or more related, variables. After having established the fact of two variables are closely related we may be, interested in estimating the value of the one variable given the value of other variable., As we know the correlation coefficient and regression analysis measure the degree, of dependence and nature of the effect of one variable on the other variable. If we deal the, combined effect of a group of variables (at least two variables) upon a single variable, which is not included in that group, our study is that of multiple regression and multiple, correlation. If however, we wish to examine the effect of one variable on another after, eliminating the effect of remaining variables; our study is of partial correlation and partial, regression., 153
Page 154 :
10.2 OBJECTIVES, After reading this lesson, you would be able to:, •, , describe the concept of partial, multiple correlation and multiple regression, , •, , define multiple correlation, partial correlation coefficient and also multiple regression, coefficients., , •, , derive the partial and multiple correlation coefficient formula., , •, , understand the role of regression in establishing mathematical relationship between, dependent and independent variables., , •, , find and interpret the least-squares multiple regression equation., , •, , forecast the value of dependent variable for the given value of independent variable, , •, , calculate and interpret the coefficient of multiple determination (R2)., , 10.3 CORRELATION, The correlation and regression coefficients discuss earlier measure the degree and nature, of the effect of one variable on another variable. While it is useful to know how one, phenomenon is influenced by another. It is also important to know how one phenomenon, is affected by several other variables. Its nature, relationship tends to be complex rather, than siniple. One variable is related to a great number of others, many of which may be, interrelated among themselves. For example, yield of rice is affected by the type of soil,, temperature, amount of rainfall, etc. Whether phenomenon be biological, physical, chemical, or economic, they are affected by a multiplicity of causal factors. It is part of the statistician’s, task to determine the effect of one cause or two or more causes acting separately or, simultaneously or one cause when the effect of others is estimated. This is done with the, help of multiple and partial correlation analysis., The basic distinction between multiple and partial correlation analysis is that whereas in, the former we measure the degree of relationship between the variable and all the variables,, X1, X2, X3 .... Xn taken together, in the latter we measure the degree of relationship, between Y and one of the variables, X2, X3, ….. Xn with the effect of all other variables, removed., , 154
Page 155 :
10.3.1 Simple Correlation, Simple correlation is a measure used to determine the strength and the direction of the, relationship between two variables say X and Y. The value of simple correlation ranges, from –1 to 1., Simple correlation ignores the effect of all other variables even though these variables, might be quite closely related to the independent variable., 10.3.2 Partial Correlation, It is often important to measure the correlation between a dependent variable and one, particular independent variable when all other variables involved are kept constant, i.e.,, when the effects of all other variables are removed referred to as ‘others things being, equal’. This can be obtained simply by calculating coefficient of partial correlation., Therefore, partial correlation analysis measures the strength of the relationship between Y, known as dependent variable and one independent variable in such a way that variations, in the other independent variables are taken into account ., Partial correlation coefficient provides a measure of the relationship between the dependent, variable and other variables, with the effect of the most of variables eliminated., We denote the partial correlation coefficient by r12.3 between X1 and X2, keeping X3 constant., We find that:, r12.3 =, , 𝐫𝟏𝟐 − 𝐫𝟏𝟑 𝐫𝟐𝟑, (𝟏 − 𝐫𝟏𝟑𝟐 ) × (𝟏 − 𝐫𝟐𝟑𝟐 ), , Similarly,, r13.2 =, , 𝐫𝟏𝟑 − 𝐫𝟏𝟐 𝐫𝟑𝟐, (𝟏 − 𝐫𝟏𝟐𝟐 ) × (𝟏 − 𝐫𝟐𝟑𝟐 ), , Where, r13.2 is the coefficient of partial correlation between X1 and X3, keeping X2 constant., , 155
Page 156 :
r23.1 =, , 𝐫𝟐𝟑 − 𝐫𝟐𝟏 𝐫𝟑𝟏, (𝟏 − 𝐫𝟐𝟏𝟐 ) × (𝟏 − 𝐫𝟑𝟏𝟐 ), , where, r23.12 is the coefficient of partial correlation between X2 and X3, keeping X1 constant., Thus, for three variables, X1, X2, and X3, there will be three coefficients of partial correlation, each studying the relationship between two variables when the third is held constant., The partial correlation coefficient thus help us to answer questions such as: Is the correlation, between, say, X1 and X2, merely due to the fact that both are affected by X2 or is there a, no co-variation between X1 and X2 over and above the association due to the common, influence of X3? Therefore, in determining the partial correlation coefficient between X1, and X2, we attempt to remove the influence of X3 from each of the two variables so as to, ascertain whether net relationship exists between the “Unexplained” residuals that remain., Also, it should b noted that the value of a partial correlation coefficient is always interpreted, via the corresponding coefficient, partial determination, i.e., by squaring the partial, correlation coefficient. Thus if X1, X2, and X3 represent sales, advertisement expenditure, and price respectively, we get r212.3 = 0.912. This means that more than 91% of the variation, in sales which is not associated with price, is associated with advertisement expenditure., 10.3.2.1 Order of Partial Correlation Coefficient, The order of partial correlation coefficient depends upon the number of variables are help, to be constant. If no variable is held to be constant, it is known as zero order correlation, or simple correlation coefficient. If one variable is held to be constant it is the case of first, order partial correlation and in case two variables are kept as constant, it is known as, second order partial correlation coefficient, and so on. Thus, partial coefficients such as, r12.3, r13.2 are often referred to as first order coefficients, since one variable has been held, constant. r12, r13 etc. are the examples of simple correlation coefficient., 10.3.2.2 Characteristics and Uses of Partial Correlation, Partial correlation analysis is the measurement of relationship between two factors, with, the effects of two or more other factors eliminated. If the assumptions of the method are, 156
Page 157 :
true for a series of data, the power of partial analysis is great. The problem of holding, certain variables constant, while the relationship between the others is measured, often, presents difficult itself in statistical analysis. Partial correlation is especially useful in the, analysis of interrelated series. It is particularly pertinent to uncontrolled experiments of, various kinds, in which such interrelationships usually exist. Most economic data fall in this, category., Partial correlation is of great value when used in conjunction with gross and multiple, correlation in the analysis of factors affecting variations phenomenon., Partial analysis, like all correlation, has the advantage that the relationships are expressed, concisely in a few well defined coefficients. Also it is adaptable to small amounts of data, and the reliability of the results can be rather easily tested., 10.3.2.3 Limitations, 1. The usefulness of the partial correlation coefficient is somewhat limited by the following, assumption:, i. The zero order correlation must have linear regression., ii. The effects of the independent variables must be additively and not jointly related., iii. Because the reliability of partial coefficient decreases as its order increases, the, number of observations in gross correlations should be fairly large. Very often the, students carry the analysis beyond the limits of the data. Thus, weakness to some, extent can be guarded against by the test of reliability., 2. When the above mentioned assumptions have been satisfied, partial correlation possesses, the disadvantage of laborious calculations and difficult interpretation even for statisticians., The interpretation of the partial and multiple correlation results tends to assume that the, independent variables have causal effects on dependent variable., This assumption is sometimes true, but more often untrue in varying degrees., Example 1: On the basis of the following information compute:, (i) r23.1 (ii) r13.3 (iii) r12.3 when r12 = 0.70, r13 = 0.61, r23 = 0.40, 157
Page 159 :
Solution: We have to find the partial correlation between yield of cotton and the number of, bolts, eliminating the effect of height, i.e, in terms of symbols, we have to calculate r12.3., , r12.3 =, , r12 − r13 r23, (1 − r132 ) × (1 − r232 ), , Substituting the given values:, , r12.3 =, =, , 0.80 − (0.65 × 0.7), (1 − 0.652 ) × (1 − 0.72 ), , 0.8 − 0.455, , = 0.635, , (1 − 0.4225) × (1 − 0.49), , Example 3: If r12 = 0.86, r13 = 0.65, and r23 = 0.72, find the partial correlation coefficient by, keeping third variable constant., Solution: Here in this question we have to find the value of r12.3., , r12.3 =, , r12 − r13 r23, (1 − r132 ) × (1 − r232 ), , Substituting the given values:, , r12.3 =, =, , 0.86 − (0.65 × 0.72), (1 − 0.652 ) × (1 − 0.722 ), , 0.8 − 0.468, (.5775) × (.4816), , = 0.743, , 10.3.3 Multiple Correlation, In multiple correlation we are dealing with the situations that involve three or more variables., For example, we may consider the association between the yield of wheat per acre and, both the amount of rainfall and the average daily temperature. We are trying to make, estimates of the value of one of these variables based on the values of all the others, variables. The variables whose value we are trying to estimate is called the dependent, variable and the other variables on which our estimates are based are known as independent, variables. The statistician himself chooses which variable is to be dependent and which, variables are to be independent. It is merely a question of problem being studied., , 159
Page 160 :
The coefficient of multiple linear correlation is represented by R1 and it is common to add, subscripts designating the variables involved. Thus, R1.234 would represent the coefficient, of multiple linear correlation between X1 on the one hand and X2, X3, and X4 on the other., The subscript of the dependent variable is always left to the left of the point., The coefficient of multiple correlation can be expressed in terms of r12, r13, and r23 as, follows:, , R1.23 =, R3.12 =, R2.31 =, , (𝑟122 + 𝑟132 − 2𝑟12𝑟13𝑟23), (1 − 𝑟232 ), , (𝑟312 + 𝑟322 − 2𝑟12𝑟13𝑟23), (1 − 𝑟122 ), (𝑟232 + 𝑟212 − 2𝑟12𝑟13𝑟23), (1 − 𝑟312 ), , It must noted that R1.23 is the same as R1.32., By squaring R1.23, we obtain the coefficient of multiple determination., , 10.3.3.1 Properties of Multiple Correlation Coefficients, The following are some of the properties of multiple correlation coefficients:, 1., , Multiple correlation coefficient is the degree of association between observed value, of the dependent variable and its estimate obtained by multiple regression., , 2., , Multiple Correlation coefficient lies between 0 and 1 and is always positive in sign., , 3., , If multiple correlation coefficient is 1, then association is perfect and multiple, regression equation may said to be perfect prediction formula., , 4., , If multiple correlation coefficient is 0, dependent variable is uncorrelated with other, independent variables. From this, it can be concluded that multiple regression equation, fails to predict the value of dependent variable when values of independent variables, are known., , 5., , Multiple correlation coefficient is always greater or equal than any total correlation, coefficient. If R1.23 is the multiple correlation coefficient than R1.23 r12 or r13 or r23 ,, and, 160
Page 161 :
6., , Multiple correlation coefficient is obtained by method of least squares would always, be greater than the multiple correlation coefficient obtained by any other method., , 10.3.3.2 Advantages of Using Multiple Correlation Analysis, , The multiple correlation serves the following purposes:, 1., , It serves as a measure of the degree of association between one variable taken as, dependent variable and a group of other variables taken as the independent, variables., , 2., , It also serves as a measure of goodness of fit of the calculated plane of regression, and consequently as a measure of the general degree of accuracy of estimates, made by reference to equation for the plane of regression., , 10.3.3.3 Limitations of Multiple Correlation Analysis, 1), , Multiple correlation analysis is based on the assumption that the relationship between, the variables is linear. In other words, the rate of change in one variable in terms of, others is assumed to be constant for all values. In practice most relationships are, not linear but follow some other pattern. This limits somewhat the use of multiple, correlation analysis. The linear regression coefficients are not accurately descriptive, of curvilinear data., , 2), , Another limitation is the assumption that effects of independent variables on the, dependent variables are separate, distinct and additive. When the effects of variables, are additive, a given change in one has the same effect on the dependent variable, regardless of the sizes of the other two independent variables., , 3), , Linear multiple correlation involves a great deal of work relative to the results, frequently obtained. When the results are obtained, only a few students are able to, interpret them. The misuse of correlation results has probability led to more doubt, on the method than is justified. However, this lack of understanding and resulting, misuse are due to the complexity of the method., , 161
Page 162 :
Example 4: The following zero-order correlation coefficients are given, r12 = 0.98, r13 =, 0.44 and r23 = 0.54. Calculate multiple correlation coefficient treating first variable as, dependent and second and third variables as independent., Solution: We have to calculate the multiple correlation coefficient treating first variable as, dependent and second and third variables as independent, i.e., we have to find R1.23., , (𝑟122 + 𝑟132 − 2𝑟12𝑟13𝑟23), , R1.23 =, , (1 − 𝑟232 ), , Substituting the given values:, (.98)2 + (. 44)2 − 2 × 0.98 × 0.44 × 0.54), , R1.23 =, =, , (1 − .542 ), , √0.9604 + 0.1936 − 0.4657, √0.7084, , = 0.986, , Example 5: If r12 = 0.9, r13 = 0.75 and r23 = 0.7, Find R1.23., Solution:, , R1.23 =, , (𝑟122 + 𝑟132 − 2𝑟12𝑟13𝑟23), (1 − 𝑟232 ), , Substituting the given values, (0.9)2 + (0.75)2 − 2 × 0.9 × 0.75 × 0.7), , R1.23 =, =, , (1 − (0.7)2, , √0.81 + 0.5625 − 0.945, √0.51, , =, , √0.4275, √0.51, , = 0.961, , Example 6: If r12 = 0.77, r13 = 0.72 and r23 = 0.52, Find R3.12., Solution:, , R3.12 =, , (r312 + r322 − 2r12r13r23), (1 − r122 ), , Substituting the given values, , R3.12 =, , (0.72)2 + (0.52)2 − 2 × 0.77 × 0.72 × 0.52), , (1 − (0.77)2, √0.51 + 0.27 − 0.5766, √0.2034, =, =, = 0. 7068., √0.407, √0.407, 162
Page 163 :
10.4 REGRESSION ANALYSIS, Linear regression is a common Statistical Data Analysis technique. It is used to determine, the extent to which there is a linear relationship between a dependent variable and one or, more independent variables. There are two types of linear regression, simple linear, regression and multiple linear regression. In simple linear regression a single independent, variable is used to predict the value of a dependent variable. In multiple linear regression two, or more independent variables are used to predict the value of a dependent variable. The, difference between the two is the number of independent variables. In both cases there is, only a single dependent variable., 10.4.1 Simple Regression, It is also called a simple linear regression. It establishes the relationship between two, variables using a straight line. Linear regression or simple regression attempts to draw a, line that comes closest to the data by finding the slope and intercept that define the line and, minimise the regression errors. Many data relationships do not follow a straight line, so, statisticians use nonlinear regression instead. The two are similar in that both track a, particular response from a set of variables graphically. But nonlinear models are more, complicated than linear models because the function is created through a series of, assumptions that may stem from trial-and-error. In simple regression, two variables exists,, one is dependent and the other one is independent. For example, if X and Y are two, variables, we shall have two simple regression equations, i.e., regression equation of X on, Y by taking X as dependent and Y as independent and regression equation of Y on X by, taking Y as dependent variable., One thing must noted here is that, the regression lines cut each other at the point of average, of X and Y, i.e., if from the point where both the regression lines cut each other a, perpendicular is drawn on the X-axis, we will get the mean value of X and if from that, point a horizontal line is drawn on the Y-axis, we will get the mean value of Y., 10.4.2 Multiple Regression, It is rare that a dependent variable is explained by only one variable. In this case, an, analyst uses multiple regression, which attempts to explain a dependent variable using, more than one independent variable. Multiple regressions can be linear and nonlinear., 163
Page 164 :
Multiple regressions are based on the assumption that there is a linear relationship between, both the dependent and independent variables. It also assumes no major correlation between, the independent variables., Multiple regression is an extension of simple linear regression. It is used when we want to, predict the value of a variable based on the value of two or more other variables. The, variable we want to predict is called the dependent variable (or sometimes, the outcome,, target or criterion variable). The variables we are using to predict the value of the dependent, variable are called the independent variables (or sometimes, the predictor, explanatory or, regressor variables)., For example, you could use multiple regression to understand whether exam performance, can be predicted based on revision time, test anxiety, lecture attendance and gender., Alternately, you could use multiple regression to understand whether daily cigarette, consumption can be predicted based on smoking duration, age when smoking, smoker type,, income and gender started. Also, multiple regressions allow you to determine the overall fit, (variance explained) of the model and the relative contribution of each of the predictors to, the total variance explained. For example, you might want to know how much of the variation, in exam performance can be explained by revision time, test anxiety, lecture attendance and, gender “as a whole”, but also the “relative contribution” of each independent variable in, explaining the variance, , 10.4.2.1 Objectives of Multiple Regression Analysis, The following are the three main objectives of multiple regression analysis:, (1) To derive an equation which provide estimates of the dependent variable from the, values of two or more independent variables., (2) To obtain a measure of the error involved in using regression equation as a basis for, estimation., (3) To obtain a measure of the proportion of variance in the dependent variable accounted, for or explained by the independent variables., 10.4.2.2 Assumptions of Linear Multiple Regression Analysis, When you choose to analyse your data using multiple regression, part of the process, involves checking to make sure that the data you want to analyse can actually be analysed, 164
Page 165 :
using multiple regression. You need to do this because it is only appropriate to use multiple, regression if your data assumptions that are required for multiple regression to give you a, valid results. These assumptions are:, •, , Dependent variable should be measured on a continuous scale (i.e., it is either, an interval or ratio variable). Examples of variables that meet this criterion include, revision time (measured in hours), intelligence (measured using IQ score), exam, performance (measured from 0 to 100), weight (measured in kg) etc., , •, , Two or more independent variables, which can be either continuous (i.e.,, an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable)., Examples of nominal variables include gender (e.g., 2 groups: male and female),, ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity, level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups:, surgeon, doctor, nurse, dentist, therapist), and so forth., , •, , Independence of observations i.e., independence of residuals., , •, , There needs to be a linear relationship between the dependent variable and each, of independent variables., , •, , Data needs to show homoscedasticity, which is where the variances along the line, of best fit remain similar as we move along the line., , •, , Data must be free from multi collinearity, which occurs when you have two or, more independent variables that are highly correlated with each other., , •, , The residuals (errors) are approximately normally distributed., , 10.4.2.3 Multiple Regression Equation, The multiple regression equation describes the average relationship between these variables, and this relationship is used to predict or control the dependent variable., A regression equation is an equation for estimating a dependent variable, Say X1 from the, independent variables say X2, X3, .....and is called a regression equation of X1 on X2 and, X3,....In functional notation, this is sometimes written briefly as X1 = F (X2, X3, X4....), read as “X1 is a function of X2, X3 and so on”., , 165
Page 169 :
10.5 SUMMARY, This lesson is designed to discus about the: 1. The multiple correlation, which is the study of, joint effect of a group of two or more variables on a single variable which is not included in, that group 2. The estimate obtained by regression equation of that variable on other variables, 3. Limit of multiple correlation coefficient, which lies between 0 and +1. 4. The numerical, problems of multiple correlation coefficient, and 5. The properties of multiple correlation, coefficient. Generally, a large number of factors simultaneously influence all social and, natural phenomena. Therefore, correlation and regression studies aim at studying the effects, of a large number of factors on one another. Multiple correlation between X1, X2 and X3, measures the extent upto which the dependent variable X1 is influenced by a group of, independent variables, say X2, X3, .......Xn. On the other hand Partial correlation provides a, measure of the relationship between X1 and X2 after eliminating the effect of other variables., Further, this lesson provides the fundamentals of multiple regression. Broadly speaking, the, fitting of any chosen mathematical function to given data is termed as regression analysis., The method of least squares is used to estimate the unknown parameters. Therefore,, regression is a technique for establishing relationships between variables from the given, data., , 10.6 GLOSSARY, •, , Correlation: Correlation is a statistical technique that can show whether and how, strongly pairs of variables are related. For example, height and weight are related;, taller people tend to be heavier than shorter people., , •, , Regression: Regression is a statistical measurement used in finance, investing, and other disciplines that attempts to determine the strength of the relationship, between one dependent variable (usually denoted by Y) and a series of other changing, variables (known as independent variables)., , •, , Partial Correlation: Partial correlation is the measure of association between, two variables, while controlling or adjusting the effect of one or more additional, variables., , •, , Multiple Correlation: the coefficient of multiple correlation is a measure of how, well a given variable can be predicted using a linear function of a set of other, variables., , 169
Page 170 :
•, , Multiple Regression: Multiple regression is an extension of simple linear, regression. It is used when we want to predict the value of a variable based on the, value of two or more other variables. The variable we want to predict is called the, dependent variable (or sometimes, the outcome, target or criterion variable)., , •, , Regression Coefficients: Regression coefficients are estimates of the unknown, population parameters and describe the relationship between a predictor variable, and the response. In linear regression, coefficients are the values that multiply the, predictor values., , •, , Coefficient of Determination: The coefficient of determination, denoted R2 or, r2 and pronounced “R squared”, is the proportion of the variance in the dependent, variable that is predictable from the independent variable(s)., , 10.7 SELF ASSESSMENT QUESTIONS, Fill in the blanks:1. ________measures the strength of the linear relationship between the dependent and, the independent variable., 2. The correlation coefficient may assume any value between________and_______., 2, , 3. While the range for r is between 0 and 1, the range for r is between__________., True/False:1., , The dependent variable is the variable that is being described, predicted, or controlled., T/F, , 2., , A simple linear regression model is an equation that describes the straight-line relationship, between a dependent variable and an independent variable., T/F, , 3., , If r = -1, then we can conclude that there is a perfect relationship between X and Y., T/F, , 10.8 LESSON END EXERCISE, 1., , Distinguish between partial and multiple correlation by giving suitable example., ___________________________________________________________, 170
Page 172 :
___________________________________________________________, ___________________________________________________________, 6., , Find the partial correlation coefficient between first and second variable from the, following data: r12= 0.40; r23= 0.60; r13= 0.70., ___________________________________________________________, ___________________________________________________________, ___________________________________________________________, , 10.9 SUGGESTED READING, •, , Gupta, S.P.: Statistical Methods, Sultan Chand & Sons, New Delhi., , •, , Gupta, S.C. and V.K. Kapoor: Fundamentals of Applied Statistics., , •, , Levin, Richard and David S Rubin: Statistics for Management, Prentice Hall,, Delhi., , •, , Levin and Brevson: Business Statistics, Pearson Education, New Delhi., , •, , Lawrence B. Morse: Statistics for Business and Economics, Harper Collins., , •, , Mc Clave, Benson and Sincich: Statistics for Business and Economics, Eleventh, Edition, Prentice Hall Publication., , 172