REGENTS' TESTING PROGRAM

DESCRIPTION AND DOCUMENTATION



Kathleen Burk
Anne R. Fitzpatrick





© 1982, Regents' Testing Program



Note that this document was produced in 1982 and does not necessarily include current information. Its purpose is to provide information on the development and initial validation of the Regents' Test.


Table of Contents


List of Tables

List of Figures

I. OVERVIEW OF THE REGENTS' TESTING PROGRAM
The Content of the Regents' Test

Administration and Scoring of the Regents' Test

II. DEVELOPMENT AND VALIDATION OF THE REGENTS' TEST
History of the Regents' Testing Program

Procedures Used for Test Development and Validation

Development and Validation of the Reading Test
Development of Test Specifications
Procedures for Passage Selection and Item Writing
The Passing Score Set on the Reading Test
Evidence of Content Validity
Other Evidence of Validity
Development and Validation of the Essay Test
Development of Test Procedure
Procedure for Selecting Essay Topics
Rationale for Essay Scoring Standards
Evidence of Content Validity
Other Evidence of Validity

III. ADDITIONAL TECHNICAL INFORMATION
Reliability of Reading Test Scores

Analysis of Reading Test Items

Equating of Reading Test Forms

Reliability of Essay Test Scores

IV. REGENTS' TEST RESULTS FROM 1972 TO 1982



References

Appendix A
Statement of Policy by the Board of Regents
Concerning the Regents' Testing Program



Appendix B
List of Members of the Testing Subcommittee
of the Academic Committee on English and Members of
the Committee on the Regents' Reading Test
                          List of Tables


Table

  1    Skill Categories of the Regents' Reading
       Test

  2    Findings from Two Studies on the Correlations
       Between the Regents' Reading Test and Selected
       Academic Variables

  3    Appraisal of Speededness in Recent Administrations
       Of the Reading Test

  4   Correlations between the Difficulties of Reading
      Form F Items for Samples of Students of a Given
      Ability Within Different Types of Institutions

  5   Items from Form 15 of the Regents Reading Test
      That Were Relatively More Difficult for Black
      Students than for White Students

  6   Correlations between the Regents' Reading Test
      And Selected Academic Variables Within Five
      Different Institutions

  7   Means and Standard Deviation of Scores Obtained
      On the Regents' Reading Test and on Selected
      Academic Variables by Students at Five
      Different Institutions

  8   Mean Analytic Ratings on 22 Components for Regents'
      Essays Given Holistic Ratings of 1, 2, 3, and 4

  9   Mean Analytic Ratings on 22 Components for Regents'
      Essays given Holistic Ratings of 1 and 2

 10   Percent of "Passing" Analytic Ratings Assigned on
      22 Components to Regents' Essays Holistically
      Graded as 1's and 2's

 11   The Pass Rates Attained on the Regents' Essay
      Test by Students Performing at Different Levels on
      an Objective Writing Test

 12   Findings from Two Studies on the Correlations
      between the Regents' Essay Test and Selected
      Academic Variables

 13   Relations between Scores on the Verbal Section
      of the Scholastic Aptitude Test (SAT-V) and Passing
      Rates on the Regents' Essay Test

 14   Relation between English Grade Point Averages and
      Passing Rates on the Regents' Essay Test

 15   Mean Ratings Assigned by Essay Raters from
      Predominantly Black and Non-Black Institutions to
      Essays Written by Students from Predominantly Black
      and Non-Black Institutions

 16   Correlations between Regents' Essay Test and 
      Selected Academic Variables Within Five
      Different institutions

 17   Faculty Responses to Questions about the
      Regents' Test

 18   Classification of Repeaters on Two Administrations
      of the Reading Test

 19   KR-20 Reliability Estimates for Form 17 and
      Form 20

 20   Item Analysis Data for the Spring, 1984
      Administration of Form 23

 21   Raw Score to Scaled Score Conversion Table
      For Form 23 of the Regents' Reading Test

 22   Rater Performance Summary Statistics for
      Fall, 1980 through Summer, 1981

 23   Estimates of Rating Reliability for the Essay
      Portion of the Language Skills Examination

 24   Regents' Test Results from 1972 to 1982


                      List of Figures

Figure
  
  1   Mean ratings assigned by essay raters from
      Predominantly black and non-black institutions
      to essays written by students from predominantly
      black and non-black institutions






                   OVERVIEW OF THE REGENTS' TESTING PROGRAM


    By a policy statement issued in 1972, the Board of Regents of the 

University System of Georgia instituted the Regents' Testing Program.  As 

described in this statement, the Program serves as one means by which each 

institution in the University System can ensure that students receiving degrees 

from the institution possess "literacy competence," which was defined as 

"certain minimum skills of reading and writing."  The Board of Regents 

identified two specific objectives for the Testing Program:


    (1) "to provide Systemwide information on the status of student         

        competence in the areas of reading and writing; and

    (2) to provide a uniform means of identifying those students who fail to 

        attain the minimum levels of competence in the areas of reading and  

        writing."


    The Regents' Test was developed to satisfy these objectives.  It is 

composed of two components, a reading test and an essay test.  Students' scores 

on the tests are used to determine whether they have the minimum levels of 

reading and writing skills required for graduation.

    According to the Regents' policy, students may be required to take the 

test in the quarter after they have attained 45 hours of degree credit, and 

they must take the test once before they have acquired 60 hours of credit.  If 

a student has not passed both components of the test by the quarter in which 75 

credit hours are acquired, enrollment in remedial courses is required until 

passing status on the two components of the test has been attained.  There is 

no limit on the number of times a student may take remediation and retake the 

test.  The full text of the current Regents' policy is given in Appendix A.

    Provided in the paragraphs that follow is a brief description of the con-

tent of both the reading and the essay components of the Regents' Test.  Also 

described is the manner in which these tests are administered and scored and 

the manner in which students' scores are reported.


                       The Content of the Regents' Test


The Reading Test

    The Reading Test, which has an administration time of one hour, is a 

60-item, multiple-choice test comprised of ten reading passages and five to 

eight questions about each passage.  The passages usually range from l75 to 325 

words in length, treat topics drawn from a variety of subject areas (social 

science, mathematics and natural science, and humanities), and entail various 

modes of discourse (exposition, narration, and argumentation).  The questions 

that accompany the passages of the Reading Test have been designed to assess 

four major aspects of reading: (1) Vocabulary, (2) Literal Comprehension, (3) 

Inferential Comprehension, and (4) Analysis.  A description of these skills is 

given in Table 1, and a description of the types of items that are used to 

measure each of the skills is available from the faculty member at each insti-

tution who is responsible for reading remediation.  A sample form of the 

Regents' Reading Test, which provides examples of the types of passages and 

items comprising the test, has been distributed to a Regents' Test coordinator 

at each institution. 


                                    Table 1        

                 Skill Categories of the Regents' Reading Test


    Vocabulary: entails identifying the meanings of words as they are used in 
    passages.  The student may use context clues, structural analysis and/or a 
    general understanding of the meaning of the passage to determine the meaning 
    of a word.
    
    Literal Comprehension: entails recognizing information and ideas presented 
    explicitly in passages.  Literal comprehension items require a student to 
    recognize (1) details or facts, (2) a sequence of events, (3) a comparative 
    relationship, (4) a cause and effect relationship, or (5) the referent for 
    which a word or group of words has been substituted in a passage.  
    
    Inferential Comprehension:  entails synthesizing and interpreting material 
    that is presented in a passage.  Inferential comprehension items involve the 
    following skills:  (1) identifying the main idea of a passage or paragraph, 
    (2) inductive reasoning, (3) deductive reasoning, and (4) interpretation of 
    figurative or other language.
    
    Analysis: is concerned with how or why a passage is written rather than what 
    a passage is about. In general, analysis items require inferences to be made 
    about the style, purpose, or organization of a passage.   

    
                              Test Specifications

    The test consists of ten passages with five to eight items for each passage.  
    In all, there are sixty items on the test.  The categories of Vocabulary, 
    Literal Comprehension, and Analysis are each assessed by twelve to fourteen 
    items.  There are twenty to twenty-four items for the Inferential 
    Comprehension category.  
    
    Passages on the test are from textbooks, literary works, magazines, 
    newspapers, and other written material that, in the judgment of committee 
    members, all students receiving college degrees should be able to comprehend.



The Essay Test 

    Students who take the Essay Test have one hour in which to choose and 

write on one of two topics that are given.  A partial list of the topics that 

have been used on past forms of the Essay Test is provided in the Regents' 

Testing Program Essay Scoring Manual, which has been distributed to all 

institutions in the System. 

    Students taking the Essay Test are given the following directions:

         Organization of your essay is important.  Think toward a 
         good thesis sentence, some specific supporting points, 
         and a definite conclusion.  In general, passing the 
         essay will require that you (1) state and develop a 
         central idea; (2) have an organization which is indica-
         tive of an overall plan; (3) deal with the assigned 
         topic; and (4) avoid serious errors in diction, sentence 
         structure, and paragraph development.



                Administration and Scoring of the Regents' Test


Administration

   Each quarter, during a two-day testing period specified by the Regents' 

Testing Program office, the Regents' Test is administered to eligible students 

at all institutions in the University System.  Just before the testing period, 

the Regents' Testing Program office sends to the Regents' Test Coordinator at 

each institution the test materials that are needed.  Because each institution 

is responsible for its own test administrations, the Test Coordinator oversees 

the distribution of these materials and arranges for supervisors and proctors 

to administer the test.  An Administration Manual that is provided by the 

Regents' Program Testing office details the testing procedures that are to be 

followed so that all test administrations are standardized.  Administration 

sites are also monitored periodically by staff from the Regents' Testing 

Program office to ensure that the standardized procedures are followed at each 

institution.  After the last test administration at an institution, all testing 

materials are returned to the Regents' Testing Program office so that the 

students' test responses can be scored.



Scoring the Reading Test

    Students' responses to the items of the Reading Test are recorded on 

machine-readable answer sheets so that these responses can be read and scored 

by computer.  A standard score is used to describe the Reading Test 

performance of each examinee.  This score is derived by translating the 

student's total raw (number-right) score on the test to a Rasch score scale 

with a range from 0 to 99.  Whether the student has met the minimum 

requirements established for reading is determined by comparing this 

translated score to the passing score that has been set for the Reading Test.



Scoring the Essay Test

    The essays to be scored are distributed by the Regents' Testing Program 

office among six scoring centers in the state.  All institutions in the System 

send representatives to be raters at the nearest scoring center.  The number 

of raters sent by each institution is determined by the ratio of its sophomore 

enrollment to the sophomore enrollment of the entire System.  As each essay is 

identified only by the student's social security number, the essay raters do 

not know the identity or the institution of the students whose papers are 

graded.  

    Each essay is graded independently by three raters who use a holistic 

procedure to assign ratings to the essay.  When rating the essays, raters use 

a four-point scale.  A "4" on the scale indicates superior performance, a "3" 

clearly passing performance, a "2" barely passing performance, and a "1" 

substandard or failing performance.  Model essays define the four points of the

rating scale by indicating the meaning of the division points (i.e., 4/3, 3/2, 

2/1) between the ratings on the scale:


                      Ratings:     4    3     2     1
                               ------|-----|-----|------
                      Models:       4/3   3/2   2/1  



One model essay is used to represent each division point.  An essay that is 

judged to be better than the 4/3 model is given a "4"; an essay judged to be 

better than the 3/2 model but not as good as the 4/3 model is given a "3"; an 

essay judged to be better than a 2/1 model but not as good as a 3/2 is given a 

"2"; and an essay judged to be poorer than the 2/1 model is given a "1."  The 

set of standard model essays used to define the division points on the scale 

is included in the Description of Essay Scoring Procedures, which is provided 

in the Regents' Testing Program Essay Scoring Manual.  Also included in this 

description are analyses of the model essays, definitions of the four score 

levels used as the basis for selecting model essays, and answers to questions 

that raters frequently ask about the procedures for scoring the Essay Test.  

These materials are provided to all raters before each quarterly scoring 

session.  For raters who are grading essays for the first time, additional 

information and samples of essays that have been graded are provided in the 

Essay Scoring Manual.

     The final score assigned to an essay is usually the rating on which at 

least two out of three raters agree.  When there is no agreement among the 

raters, the final score is the middle rating of the three assigned to the 

essay.  One consequence of this scoring procedure is that no essay can receive 

a failing grade unless at least two of three raters have given it a failing 

grade.  Further description of the essay scoring procedure is provided in the 

Essay Scoring Manual. 

    As is indicated in the Regents' policy, given in Appendix A, a student 

may request a formal review of a failing essay if (1) there is one passing 

score among the three grades the essay was assigned and (2) the student has 

completed all English composition courses required by the institution.  The 

review is initiated on the student's campus.  If the student's appeal is 

sustained, the essay is sent to the Regents' Testing Program office to be 

rescored by a systemwide review panel.



Score Reporting

    Within the three-week period following a quarterly administration of the 

Regents' Test, each institution in the University System is issued a Report of 

Results.  In an institution's Report, data are provided that describe the test 

performance of each student from the institution who participated in the 

quarterly administration of the Regents' Test.  Also provided to each institu-

tion is an Institutional Summary Report, which includes the following informa-

tion:  a summary of the performance of the institution's examinees on the 

Reading Test and the Essay Test for first-time examinees, repeaters, and these 

two groups combined; a description of the institution's performance on each 

skill category of the Reading Test; and, to facilitate year-to-year compari-

sons, an historical summary of results for first-time examinees and repeaters.  

To allow comparisons with similar institutions, various statistics are reported 

by institutional type (university, senior college, and junior college).  Also 

provided is a report of the test performance of students at each institution 

in the System.

    Personnel at each institution are responsible for reporting scores to 

individual students.



                                 Chapter II




                 DEVELOPMENT AND VALIDATION OF THE REGENTS' TEST


    Given that the primary purpose of the Regents' Testing Program is to

appraise students' reading and writing skills, the procedures used to develop

the Regents' Test are central to determination of its validity. Because

development and validation of this test are highly inter-related, both of these

issues are discussed in this chapter.  Prior to this discussion, a brief history

of the Testing Program is given to explain the rationale underlying its

inception.



 

                    HISTORY OF THE REGENTS' TESTING PROGRAM


    In the middle of the 1960's, the University System of Georgia defined a

core curriculum that all students attending the academic institutions of the

System were henceforth expected to complete.  The requirements pertaining to the

core were somewhat general in nature, since they identified only the types of

courses (e.g., "literature" and "composition") that students would have to take

in order to complete a specified number of credit hours in each of four areas: 

humanities, social studies, science/mathematics, and the major subject.

    In the late 1960's, well before the subject of accountability became a

national preoccupation, the Chancellor of the University System of Georgia

expressed interest in ascertaining what skills had been acquired by those

students who had participated in the core curriculum (Johnson, 1980).  Thus,

during the 1968 - 1969 academic year, samples of students were administered the

College Level Entrance Examination (CLEP) in the interest of measuring their

level of skill in the three core areas of Humanities, Social Studies, and

Mathematics/Science.  In the spring of 1970, the Survey of College Achievement

was administered in lieu of the CLEP because it covered the same subject areas 

and yet required less time to administer.

    The notion of testing only reading and writing skills was born out of 

deliberations between the Chancellor and the presidents of institutions in the

Georgia System during the spring of 1970.  In light of findings that suggested a

statewide as well as national decline in the level of college students' abili-

ties to read and write, the Chancellor and these administrators deemed that

college students' proficiency in these skills should be the focus of a statewide

testing program.  It was also concluded that skills gained from core curriculum

courses would be difficult to define on a systemwide basis since students could

satisfy the core requirements in mathematics, science, and social studies

through a number of different courses.

    Reading and writing skills were judged to fall within the province of the

Academic Committee on English, which is an advisory committee comprised of

English department heads from the 33 institutions of the University System.  A

Testing Subcommittee was appointed by the Academic Committee on English to work

with testing experts on the development of an appropriate test that could be

administered experimentally in the spring of 1971.

    Concerned with devising a test that could be administered and scored

efficiently and inexpensively, the Subcommittee and testing experts agreed that

multiple-choice items could be used to assess students' reading comprehension

and some of their writing skills, namely, their knowledge of grammar and word

usage.  However, the Subcommittee also determined that a writing sample should

be required as part of the writing skills tests; it was believed that students'

ability to organize and express their ideas was important to measure and that

this ability could be validly appraised only by such a measure (Johnson, 1980).

    In light of these views, the Testing Subcommittee and testing experts

devised an experimental version of the test that consisted of three parts:  a

multiple-choice test of reading comprehension, a multiple-choice test of writing

(grammar and usage), and an essay test. The items comprising the

multiple-choice reading and writing components of this test were drawn from

retired forms of the Sequential Test of Educational Progress I and the Cooperative

English Test through a lease agreement with the Educational Testing Service. 

The essay topic assigned to each form of the test was selected and approved by

the Academic Committee on English.  Examinees were to be given 30 minutes to

write on the selected essay topic as well as 30 minutes to work on each of the

two multiple-choice components.  The experimental version of the test, which was

called the Language Skills Test of the University System Junior Testing Program,

was administered to samples of students during the spring of 1971.  The test was

then formally administered systemwide in the winter of 1972 to the System's

6,500 "rising juniors," who had between 60 and 75 hours of college credit.

    As the Junior Testing Program was being developed, the Board of Regents was 

formulating a policy statement about the testing program, which it called the

University System Regents' Testing Program.  This statement, issued in 1972,

described the purposes and procedures of the Program.  As noted in Chapter I,

the specified purpose of the Program was to serve as one means by which each

institution in the University System could ensure that students receiving de-

grees from the institution possessed "literacy competence," which was defined as

"certain minimum skills of reading and writing."  Undergraduate students who

were enrolled in degree programs and had acquired between 60 and 75 quarter

hours of degree credit (i.e., "rising juniors") were required to take the test

to demonstrate competence in reading and writing.  Satisfactory performance on

the Language Skills Test was evidence of competence, but institutions could use

other methods to certify the competence of students who failed this test. 

    Since 1972, some modifications have been made in the Board of Regents'

policy and in the content of the Language Skills Test, but the procedure of

using this test to assess students' basic reading and writing skills has been

routinely carried out once during each quarter of the school year since 1972. 

In 1973, after the test had been administered for several quarters, passing the

reading component and one of the two writing components of the Language Skills

Test became a requirement for graduation.  In 1974, the objective writing compo-

nent was dropped from the testing program so that, since that time, passing both

the reading and the essay components of the test has been a graduation

requirement.  In 1979, the Regents established the current eligibility require-

ments for the test, which specify that students may be required to take the test

in the quarter after they have attained 45 hours of degree credit, and that they

must take the Language Skills Test before they have acquired 60 hours of credit. 

If a student has not passed one or both components of the test by the quarter in

which 75 hours have been acquired, enrollment in remedial courses is required

until passing status on all components of the test has been attained (see

Appendix A for the full text of the Regents' policy).                         

    With respect to the content of the test, in l974 it was decided that

students taking the Essay Test should be given a choice between two topics on

which to write and, by 1978, the time limits for both the Reading and the Essay

Test were extended to one hour.  Also in 1974, responsibility for developing the

reading items was given to the Testing Subcommittee because the lease on the

item pool from ETS had expired.  Subsequently, in Winter, 1982, this responsi-

bility was consigned to a joint committee that consisted of the Testing

Subcommittee and the Committee on the Regents' Reading Test.  A description of

this joint committee and their activities is described in a section that

follows. 



                                 
                           


              PROCEDURES USED FOR TEST DEVELOPMENT AND VALIDATION


    Development and validation of the Regents' Test are properly discussed

together because the validity of this test rests primarily on its content

validity, which was established in the course of developing the test.  Validity

refers to the degree to which a test provides information that is relevant to

the particular descriptions or decisions that are to be made using the scores of

the test (Hambleton, 1980a; Thorndike & Hagan, 1977).  Traditionally, several

kinds of validity have been defined by test specialists.  These different kinds

of validity refer to the relevance of the information provided by a test to dif-

ferent score interpretations or uses, and they rest on different methods for

establishing this relevance (APA, AERA, & NCME et al., 1974; Anastasi, 1976). 

For a skills test, it is most important to demonstrate content validity, which

refers to the appropriateness of claiming that the behaviors assessed by a test

represent the behaviors that the test is intended to assess (APA et al., 1974). 

This kind of validity is established not by empirical studies of test scores,

but rather by judgments of the degree to which the items of a test adequately

sample the specified types of behaviors that the test is intended to assess.  A

test developer can claim that a test is content valid when (1) the skills the

test is to assess have been clearly specified, and (2) experts have judged that

these skills are adequately sampled by the items that have been written for the

test (APA et al., 1974). 

    Described in the sections that follow are the procedures that have been

used both to develop forms of the Regents' Reading and Essays Tests and to show

that these tests are valid. 




                                   Section I


  
                Development and Validation of The Reading Test



DEVELOPMENT OF TEST SPECIFICATIONS  

    The initial specifications for the Reading Test were written by the Testing

Subcommittee of the Academic Committee on English prior to the development of

the first forms of this test in Spring, 1971.  As noted above, the items for

these initial forms were to be drawn from a leased pool of items that had been

used by the Educational Testing Service (ETS) on forms of the Sequential Test of

Educational Progress I (STEP I) and the Cooperative English Test.  The speci-

fications that the Subcommittee adopted resembled those that had been used by

ETS to define what was measured by the STEP I reading test.  The Subcommittee's

specifications indicated that the following categories should be covered by

items of the Reading Test: (1) Reproduce Ideas, (2) Translate Ideas and Make

Inferences, (3) Analyze Motivation, (4) Analyze Presentation, and (5) Criticize

Selection.  These skills were to be assessed as they were in the STEP I - -  by

items that referred to written passages presented in the Reading Test. 

Vocabulary was later added to this list when the objective writing test, which

had contained some vocabulary items, was dropped from the Regents' Testing

Program in 1974.  The Subcommittee specified that Vocabulary would be assessed

in a separate section of the Reading Test by multiple-choice items that

presented a sentence context for the words to be defined.

    The current specifications for the Reading Test were developed in the

winter of 1982 by a joint committee that was charged with the responsibilities

of (1) evaluating the content of the current Reading Test, and (2) developing

detailed descriptions of the skills the test should measure and of the types of

items that should be used to measure these skills.  As noted above, this joint

committee consisted of the Testing Subcommittee and the Committee on the

Regents' Reading Test.  The members of the reading committee were specialists in

reading on the faculty of University System institutions.  They all were

specially appointed by the presidents of their institutions to participate in

specifying the content for the Regents' Reading Test.  In Appendix B, a list is

provided of both the members of the Reading Committee and the members of the

Testing Subcommittee.

    After reviewing the existing specifications for the Reading Test, the joint

committee suggested that several revisions be made.  Noting that the skills

covered by the Reading Test could be delimited by four rather than six

categories, the joint committee recommended that these categories be designated

(1) Vocabulary, (2) Literal Comprehension, (3) Inferential Comprehension, and

(4) Analysis.  The members of the joint committee then formulated descriptions

of the skills defined by these categories and agreed upon detailed specifica-

tions that described the kinds of items that should be used to measure these

skills.  These skill descriptions are presented in Table 1, and the item speci-

fications have been distributed to the members of the faculty responsible for

reading remediation at the institutions in the System.  The major revision in

test content made by the committee pertained to the definition of the Vocabulary

category.  It was concluded that Vocabulary would be most appropriately assessed

by items that referred to words presented within the context of a passage; it

therefore was recommended that, like the other aspects of reading, Vocabulary

items refer to content presented in the passages of the Reading Test.

    Literature discussing the process of reading formed the primary basis for

the recommendations made by the joint committee.  In particular, the taxonomy or

classification system formulated by Barrett (1976) was heavily relied on by the

committee in its deliberations about how it would define those reading skills

covered by the Reading Test.  According to Pearson and Johnson (1978), Barrett's

system is the taxonomy that has been most widely used in reading courses and

workshops designed for college-level readers.

    In Barrett's taxonomy, four types of reading skills are defined:  Literal

Comprehension, Inferential Comprehension, Evaluation, and Appreciation.  With

some modification, the joint committee agreed that the first two categories, as

Barrett defined them, well-described certain skills of the Reading Test.  The

committee did make two amendments to Barrett's description of literal

comprehension tasks.  In his taxonomy, Barrett indicated that certain questions

about the main idea of a paragraph or passage could be answered using literal

rather than inferential reading skills.  The committee believed that main ideas

would not be explicitly stated in the reading passages likely to appear on the

Reading Test, and so it recommended that the main idea questions posed in the

Test be classified as assessing inferential reading skills.  Also the committee

suggested that the category of literal comprehension skills should include

comprehension of anaphoric and cataphoric references, which Barrett did not

include in his list of literal comprehension tasks (see Bormuth, 1970).

    With respect to Barrett's Evaluation category, the committee concluded that

this category was generally not applicable to the Reading Test.  According to

Barrett, evaluation requires a student to judge the adequacy and desirability of

passage content in light of the student's knowledge about the passage topic. 

Because evaluation involves not just comprehension of a passage but also know-

ledge of a topic, the committee decided that this category pertained to matters

that should not be assessed by the Reading Test (see Tuiman, 1973-1974).

    Barrett's Appreciation category was thought pertinent to the analytic

skills assessed by items of the Reading Test.  As Barrett defined this skill, it

involves the identification of literary techniques, forms, styles, and

structures employed by an author to evoke an intellectual or emotional response

from the reader.  The committee concluded that Barrett's discussion of this

skill was too narrowly focused on application to narrative and descriptive

literature and so suggested that the broader conceptualization of this skill

specified by Bloom, Madaus, and Hastings (1981) would more adequately describe

the analytic category of the Reading Test.  These three researchers called the

skill "Analysis," and they defined it as the process of decomposing any

communication into constituent parts "to clarify how the communication is

organized and the way in which it manages to convey its effects" (p. 249).

    Finally, with respect to the matter of vocabulary, this skill was not a

category included in Barrett's taxonomy.  Descriptions of this skill given by

Dale, O'Rourke, and Bamman (1971), however, were thought by the committee to

describe well the word-meaning-in-context tasks of the Reading Test.  These

descriptions were therefore adapted by the committee to define the vocabulary 

skills assessed by this test.

    Other specifications developed by the joint committee concerned the content

of the reading passages and the numbers of passages and items to be included in

the Reading Test.  These specifications indicated that the passages used in the

test should be 175 to 325 words in length and should be drawn from textbooks,

literary works, magazines, newspapers, and other written material that, in the

judgment of the committee, college graduates should be able to comprehend. 

Moreover, it was specified that these passages should concern various subjects

of the social sciences, humanities, and natural sciences and mathematics and

that the passages should differ in mode.  In light of the one-hour time limit

established for the test, it was also decided that ten passages and 60 items

should comprise each form of the test, with five to eight items accompanying

each passage.  Finally, the committee members agreed that each skill category

should be assessed by the following numbers of items:


         Vocabulary                             12 - 14 items
         Literal Comprehension                  12 - 14 items
         Inferential Comprehension              20 - 24 items
         Analysis                               12 - 14 items

         
The category of inferential comprehension was assigned the largest number of

items because the committee considered this skill the most central to students'

understanding of the types of passages included in the test.



PROCEDURES FOR PASSAGE SELECTION AND ITEM WRITING

    The joint committee works in conjunction with staff from the Regents'

Testing Program office to develop new forms of the Reading Test.  Using the

skill and item specifications devised for the test, members of the joint

committee select passages and write items that they regard as suitable for

inclusion in the test.  These materials are then submitted to the Regents'

Testing Program office where the items are reviewed for technical soundness and

edited when necessary.  Subsequently, staff from this office identify and orga-

nize into a test form passages and items that appear to conform to the specifi-

cations concerning the types of passages and items that must appear in the

Reading Test.  Some of the passages and items on each form are new, and some

that have been used on previous forms are used so that the new form can be

equated to past forms of the Reading Test (see Chapter III).  The preliminary

test form is then submitted to the joint committee, which judges whether the

passages and items comprising the form conform to the test specifications and

adequately sample the skills these specifications designate the test is to

measure.  After a preliminary test form is approved, it becomes a final form and

is used at a regular, quarterly administration of the Regents' Test.  The new

passages and items included on the final form are regarded as experimental, and

students' responses to the new items are examined after the form has been ad-

ministered.  Any items that appear, on the basis of these responses, to be 

flawed are not counted when students' Reading Test scores are calculated.     



THE PASSING SCORE SET ON THE READING TEST  

    As is the case with all standards for competency that are set (see

Hambleton, 1980a; Popham, 1978; Shepard, 1980), the passing score on the Reading

Test was set by judgmental methods.  That is, experts decided on rational

grounds the minimal level of performance that was needed to pass the test.  The

procedures that these experts used to make this judgment and the considerations

that influenced this judgment are described in the paragraphs that follow.    

    After the Reading Test was formally administered for the first time in the

Winter of 1972, the Subcommittee on Testing met to consider what level of

reading performance should be required to pass the test.  A standard score of 51

was tentatively chosen after the Subcommittee had reviewed the reading scores

that students had received at the first administration of the Reading Test.  The

standard score of 51 represented a percentile rank of 10, meaning that 10% of

the students in the Winter 1972 administration received reading scores below 51. 

After performance data from this and two subsequent test administrations were

examined, the cut-score was set at this level; the Subcommittee believed that no

more than 10% of the students in the University System should fail the Reading

Test until more information became available and also concluded that this

cut-off score would effectively serve to identify those students having the most

serious reading problems.

    In 1978, an Ad Hoc Committee on the Regents' Testing Program was convened

to consider, among other issues, the current passing score of 51; of concern was

what seemed to be an inexplicable disparity between the very high pass rates (of

about 98%) that had been recently observed on the Reading Test and the more

moderate pass rates that had been observed on the Essay Test.  After study of

data that had been collected, this committee proposed that the passing score on

the Reading Test gradually be raised to 61, which is the level at which it is

set today.  Some of these data had indicated that a standard score of 58 earned

on the Reading Test was comparable to the level of reading performance that was

required for exit from the remedial reading programs offered in the University

System.  This level of reading performance was considered to be necessary just

for college entry, yet the passing score on the Reading Test was meant to be

indicative of the level of reading proficiency expected of college graduates. 

The Ad Hoc Committee therefore concluded that a passing score that exceeded 58

was essential.  Other data showed (1) the pass rates that would result from the

use of various passing scores above 58, and (2) the relations between students'

scores on the Reading Test and their performance on the Essay Test and the

verbal section of the Scholastic Aptitude Test (SAT).  In light of these data,

the Committee recommended that the passing score be raised from 51 to 59 in the 

Fall Quarter of 1978, and that this score be advanced one point each year until

it reached a level of 61 in the Fall Quarter of 1980.



EVIDENCE OF CONTENT VALIDITY

    To understand the primary process used to establish the validity of the

Reading Test, it is useful to consider Anastasi's (1976) explanation of the

process of content validation, which describes fully the manner in which content

validity is established.  In her textbook on psychological testing, Anastasi

wrote:

         Content validity is built into a test from the outset through the 
         choice of appropriate items.  For educational tests, the preparation of 
         items is preceded by a thorough and systematic examination of relevant 
         course syllabi and textbooks, as well as by consultation with subject 
         matter experts.  On the basis of the information gathered, test
         specifications are drawn up for item writers.  These specifications 
         should show the content areas or topics to be covered, the instruc-
         tional objectives or processes to be tested, and the relative impor-
         tance of individual topics and processes.  On this basis, the number of 
         items of each kind to be prepared on each topic can be established.  
         (pp. 135-136)


    As a previous section shows, the procedures that have been used to develop

the Regents' Reading Test are like those that Anastasi described.  The

specifications written for the test were devised by a joint committee of experts

in the subjects of reading and English.  To formulate these specifications,

these experts consulted literature that treated the process of reading and

selected from that literature skill descriptions that appeared to most clearly

and completely describe the skills to be assessed by the Reading Test.  The test

specifications drawn up by these experts described not only the skills to be

covered by the Reading Test and each skill's relative importance but also they

provided detailed descriptions of the kinds of items that should be written to

assess each of the skills covered by the test.  As Cronbach (1971) has noted,

when a test's specifications are given in detail, a guide to item writers is

provided that greatly enhances the chance that appropriate items will be written

for the test.

    The passages that are selected and items that are written on the basis of

the test specifications are subsequently appraised first by testing experts for

technical soundness and then by the joint committee of English and reading

experts, which considers their conformity to the test specifications and the

appropriateness of their content.  Any items that appear flawed in some way are

revised during this review so that the final test form will well-represent the

test specifications and, hence, can be regarded as content valid.             



OTHER EVIDENCE OF VALIDITY


Relation between Reading Test Scores and Another Measure of Reading Skills


    Although the validity of the Reading Test is largely determined by its

content validity, additional evidence of the test's validity can be gained by

examining the correlation between the Reading Test and another measure of

students' reading skills.  The finding that these two measures correlate well

suggests that the two measures assess the same characteristic and, hence,

supports the claim that students' scores on the Reading Test can be validly

regarded as a measure of their reading skills (see Campbell, l964; Campbell &

Fiske, 1959).

    In Summer, 1982, the Regents' Testing Program office conducted a study of

the relation between students' scores on Form 20 of the Regents' Reading Test

and their scores on Reading Form 1A of the Sequential Tests of Educational

Progress (STEP II).  Reading Form 1A is part of the STEP II battery of

achievement tests that have a level of difficulty appropriate for college

freshmen and sophomores.  The Reading Form has two parts: a 30-item Vocabulary

Section, which presents single-sentence contexts for the words to be tested,

and a 30-item Reading Section, which is composed of passages and questions

these passages.  Thus, the STEP II Reading Form differs slightly from the

Regents' Reading Test in that the vocabulary words are presented in a separate

section rather than within the context of the passages to be read.  However,

the STEP vocabulary items resemble items of the Reading Test in that context

clues or inferential skills must be used to answer the items correctly.

    A total of 116 students from three junior colleges and two universities

participated in the Regents' Testing Program study.  These students were en-

rolled either in remedial reading courses or in required English courses at

their institutions.  They were administered the STEP Reading Form in their

classes.  The Regents' Reading Test was also administered in class to those

students who did not take the Regents' Test at the regular Summer quarter

administration.

    The results of this study were as follows.  The mean and standard devia-

tion of the students' scores on the STEP Reading Test were x- = 29.44 and

s = 10.18, respectively.  On the Regents' Reading Test, the mean and standard

deviation of these students' scores were x- = 39.36 and s = 8.9.  The

correlation between the total scores on the two measures was .82.  A raw score

of 38 on Form 20 of the Regents' Reading Test is needed for passing, and it was

found that this level of performance was predicted by a score on the STEP

Reading Test that corresponded to the 15th percentile, where this percentile

was calculated on the basis of normative data on college sophomores collected

for the STEP.

    The correlation between the Reading Test and STEP Reading Form is quite

high, which suggests that the skills assessed by the Reading Test are quite

similar to those measured by the STEP Reading Form.  This finding lends

considerable support to the claim that students' scores on the test reflect

their levels of reading skill.



Relation between Reading Test Scores and Selected Academic Variables

    Additional evidence of the validity of the Reading Test can be gained by

examining the relation between individuals' scores on the test and their scores

on other variables of interest.  By finding that the test correlates in the

expected manner with other variables presumed to be related to reading skill,

further support is gained for the claim that scores of the test can be

validly regarded as a measure of individuals' reading levels.

    Hickman (1973) and Prather and Smith (1975) examined the relation of

students' Reading Test scores to their high school and college grade point

averages and to their scores on the verbal and mathematical sections of the

Scholastic Aptitude Test.  The results of these two studies are presented in

Table 2.



                                         Table 2

                      Findings from Two Studies on the Correlations
            between the Regents' Reading Test and Selected Academic Variables

        
___________________________________________________________________________
                                               Hickman          Prather &
          Academic Variables                    (1973)*       Smith (1975)**
        
___________________________________________________________________________
          Aptitude Measures
          Scholastic Aptitude Test (Verbal)      .75               .59
          Scholastic Aptitude Test (Math)        .57               .37
    
          Grade Point Averages
          Cumulative - High School               .21               .09
          Cumulative - College                   .34               .47
          Freshman - College                     .28               .27
          English Composition - College          .21               .17
          
      
        
___________________________________________________________________________
    
           *Correlations based on 684 to 906 students attending five               
            different institutions in the University System of Georgia.
    
          **Correlations based on 1910 students attending one university in the 
            University System of Georgia.
    
                                         
    
    As is shown in the table, although different kinds of samples were used in

the two studies, similar correlations were obtained.  In both studies, the

correlation between students' Reading scores and their performance on the

verbal section of the Scholastic Aptitude Test (SAT-V) was substantial and

considerably greater than the correlation between these scores and students'

grade point averages.  Less substantial was the correlation between the Reading

Test and the mathematical section of the Scholastic Aptitude Test (SAT-M). 

Such findings are reasonable to expect.  Both the SAT-V and the SAT-M are

measures of abilities developed in the course of schooling.  It is likely that

these abilities are directly affected by the effectiveness with which written

materials can be read and comprehended.  Thus, the correlations between these

measures and the Reading Test obtains.  Of course, the relation between the

SAT-V and the Reading Test should be stronger than that between the SAT-M and

this test, because reading and verbal reasoning skills are more closely

inter-related than are reading and quantitative reasoning skills.  The weaker

and relatively low correlations between reading performance and students' grade

point averages that are shown in Table 2 also should be expected: although

reading skill might have a bearing on course performance, this influence is

unlikely to be strong because many factors other than reading skill affect

students' grade point averages (see Cronbach, 1971).



Relation between Reading Test Scores and Irrelevant Variables

    A claim of test validity is supported not only by findings that the test

correlates in the expected manner with other variables, but also by findings

that performance of the test does not relate to variables that are presumed to

be irrelevant to the skills assessed by the test (see Campbell, 1964).  For a

skills test like the Regents' Reading Test, there are two other variables that

are routinely examined in terms of their relations to individuals' scores on

the test.  These variables are (1) the speed with which examinees can respond

to questions posed on the test, and (2) bias due to the influence of examinees'

ethnic background.  Data have been collected to assess the relation of these

two variables to Reading Test performance; these data are described in the

paragraphs that follow.



SPEEDEDNESS.  There is much evidence to suggest that response speed and re-

sponse power are best regarded as different attributes (Kendall, 1964;

Terranova, 1972).  The Regents' Reading Test has been designed primarily as a

measure of reading power rather than reading speed.  That is, it is intended

that examinees' scores on the test will reflect the accuracy, not the rate,

with which they can read the passages and respond to the items comprising the

test.  For tests of skill such as the Reading Test, the rate of response should

not greatly influence individuals' scores on the test:  on some tests response

rate has been found to be more closely associated with irrelevant examinee

qualities like temperament than with the ability to respond correctly

(Himmelweit, 1946; Wesman, 1960).

    There are several ways to appraise the speededness of a test (see Donlon,

1978; Marahnich, 1980; Rindler, 1979).  Ideally, it is appraised by analyzing

the test performance of examinees who have been administered parallel forms of

a test under both timed and untimed test conditions (see Cronbach & Warrington,

1951).  If the test is unspeeded, one should find that the additional time

given to examinees effects no change in their relative test performances.

    Such complex testing procedures often are not feasible to carry out.  As a 

consequence, speededness is commonly estimated on the basis of data gathered

from a single test administration even though this simpler approach is re-

cognized as somewhat inadequate (Rindler, 1979).  Using the single test admini-

stration approach, the Educational Testing Service (ETS) has developed a set of

practical criteria that are used to make preliminary judgments about the

speededness of their tests.  According to Swineford (1974), ETS regards a test

as possibly speeded when (1) fewer than 100% of the examinees reach 75% of the

test, and (2) fewer than 100% of the items are reached by 80% of the examinees.

These criteria have been noted by Swineford to be somewhat arbitrary, but it is

thought that these criteria function well as signals of potential speededness. 

Donlon (1978) has noted that when a test meets these preliminary criteria it is

unlikely that the test is speeded.

    In one study, the effect of the time limits of the Reading Test was ex-

amined by administering the test without enforcing the 60-minute time limit

(Fort Valley College, 1974).  In this study, 161 students were told that they

could work on the test as long as they wished.  Note was made of the time

beyond 60 minutes that these students used.  Subsequent analyses indicated that

there was no difference in the passing rates of students who spent different

amounts of time working on the test.  This finding does suggest that an

increase in the 60-minute limit would not necessarily improve students' test

scores.  However, because the students in this study were not administered the

Reading Test under timed as well as untimed conditions, it is not possible to

discern from the data gathered in this study how the 60-minute limit affects

students' test performance.  Hence, the data do not allow determination of the

speededness of the test.

    In a second study recently carried out by the Regents' Testing Program

office, data were gathered from recent administrations of the Reading Test to

determine whether this test was potentially speeded according to the criteria

used by ETS.  The statistics calculated on the basis of this data are given in

Table 3.



                                         Table 3

                Appraisal of Speededness in Recent Administrations of the
                                  Regents' Reading Test


______________________________________________________________________________
________            
                               
                                              Form 17                  Form 18 
Speededness                                                                                       
                           
Criteria                           Spring, 1981    Spring, 1982     Summer, 1982
______________________________________________________________________________
________


Total number of items on                  70              70              60
the test                                 


Percentage of examinees                   99             100             100
answering 75% of the 
items on the test                         


Percentage of items answered             100             100             100
by 80% of examinees                     

______________________________________________________________________________
_______    
                                          



    In interpreting these data, it is important to keep in mind a distinction

between items that are "attempted," with which the ETS criteria are concerned,

and items that are "answered," which are referred to in Table 3.  An item is

attempted when an examinee has either marked an answer to the item or omitted

the item but gone on to answer subsequent items.  An item is not attempted when

this item appears near the end of a test and an examinee has omitted this item

and the remaining items on the test.  In item analyses that are done by ETS,

the number of examinees who have answered an item are distinguished from the

number who have attempted and omitted it and the number who have not reached

it.  The number of examinees reaching an item can then be determined by adding

the number who have attempted and omitted it to the number who have answered

it.  In the item analyses currently done for the Regents' Reading Test, all

examinees who have not answered an item are counted as omitting it, whether

they have attempted and omitted it or actually not reached it.  Because only

these two categories of responses can be distinguished on the basis of this

analysis - - answers and omits,  the values in Table 3 reflect only the numbers

of items answered.  Were the number of items attempted and omitted added to

these values in order to count the number of items that had been reached, these

values would probably be higher.  Thus, the values in Table 3 can be regarded

as overestimates of the degree of speededness evident in the Reading Test.

    In light of the values reported in Table 3, it appears improbable that the

Reading Test is speeded.  As the table indicates, at least 99% of the examinees

administered recent forms of the test completed three-fourths of the items on

these forms, and all of the items were answered by at least 80% of the

examinees.  Unless many examinees randomly marked answers to the final items as

the testing time ran out, these values indicate that examinees have little

difficulty finishing the Reading Test and that this test exceeds the ETS

requirements that should be met for a test to be considered unspeeded.  The

hypothesis of random responses can be ruled out by the response pattern evident

on the final items of the Reading forms referred to in Table 3.  Although not

indicated in this table, it was found that the correct answers to these items

were identified by 84% to 88% of the examinees who took these forms, which

suggests that most examinees' responses to these items were not at all random. 

In sum, these findings do not suggest that individuals' scores on the Reading

Test are influenced by the irrelevant variable of test speededness.  



BIAS DUE TO THE INFLUENCE OF ETHNIC BACKGROUND.  There are many
definitions of 

bias (see Flaugher, 1978; Scheuneman, 1981 ), but it is, perhaps, best

understood when viewed as a matter related to the validity of a measure (see

Shepard, 1981).  As noted earlier, validity refers to the degree to which the

information given by a test is relevant to the decision or use for which the

test is intended (Thorndike & Hagan, 1977).  Bias occurs when the information

given by a test does not have the same meaning for all groups that are tested 

- - that is, the information is more relevant for one group than it is for

another.  The groups for which a test may have this kind of differential

validity may differ in religion, sex, race, ethnicity, or the like.  In any

case, however, the finding that individuals' membership in a particular group

affects the meaning of their test scores is undesirable, since this means that

the intended decision or use to be made of these scores is less valid for this

group than it is for others.

    In literature treating the matter of bias, two general approaches to 

investigating bias are usually noted.  When a test is to be used to predict

performance on a criterion measure, the test-criterion relations for different

groups are usually compared (see Hunter & Schmidt, 1976; Linn, 1973).  Such

studies would be conducted for tests that are to be used for college admissions

or employee selection since these tests are intended to predict future success. 

For a test like the Regents' Reading Test, which is not intended to predict

performance on a particular criterion, studies of the test's internal and

external properties are carried out to determine whether there are any

differences in the meaning of the test for members of different groups. 

Studies of a test's internal properties entail investigations of the test's

content, the difficulty of its items, and its internal consistency, among other

things, in the interest of determining whether there is evidence that the test

behaves differently for different groups that are assessed.  External analyses

include studies of the relations of the scores on the test to other variables

in the interest of examining whether these relations are the same for different

groups (see Jenson (1980) for a detailed discussion of these two approaches).

    External and internal analyses of the Reading Test shed light on the

question of whether the test is differentially valid for groups of white and

black students who are assessed.  The findings from these analyses are

discussed in the paragraphs below, with those studies bearing on the test's

internal properties being discussed prior to those studies bearing on the

test's external properties.



Studies of bias based on internal analyses.

    ANALYSIS OF TEST CONTENT. Since the validity of a skills test is strongly

affected by the quality of the content of which it is comprised, studies of

test content constitute one important basis for detecting factors that could

contribute to the differential validity of the test for different racial

groups.  Studies of this content should be carried out in the course of

developing the test.  One of these studies should entail (1) an examination of

the clarity with which the domains of skills to be assessed by the test have

been described, and (2) an examination of the degree to which the items written

for the test represent these specifications.  When the specifications for the

test are judged to be clear, and the items written for the test are judged to

well-represent these specifications, many potential sources of invalidity are

eliminated from the test.  As Shepard (1981) explained,


       The meaning of a test score depends ultimately on how well  
       the items on the test represent the intended subject matter 
       or implied ability.  Establishing logical validity is often 
       characterized as a sampling problem, that is, the accuracy  
       of the inferences made from the test will depend on how well 
       the test content domain is specified and how well the items 
       sampled represent the test content domain.  The more am-
       biguous the definition of the intended domain or the more   
       elusive our grasp of the intended construct, the more po-
       tential there is for what is captured in the test to be a   
       distortion of the intended meaning.  (p.83)


    In addition to studies of domain clarity and item representativeness, the

content of a test should also be studied to identify any items that present

stereotypic images, ambiguous wording, or unfamiliar language that might alter

the meaning of the items for the members of a particular group (see Hambleton,

1980b; Cole & Nitko, 1981).

    As noted previously, the skill and item specifications for the Reading Test

were developed after careful review by a joint committee of English and reading

experts in the Winter of 1982.  After considering the clarity, completeness, and

appropriateness of these specifications, the specifications were approved and

have been subsequently used as guides for selecting passages and writing items

intended for the Reading Test.

    The items that were written for Form 20 of this test were first reviewed by

staff from the Regents' Testing Program office for technical soundness and were

edited where necessary.  These passages and items were subsequently organized

into a preliminary test form that was submitted to the joint committee for

further review.  This committee judged the conformity of the items to the test

specifications and noted any instances in which ambiguous wording, unfamiliar

language, or difficult vocabulary occurred.  After some revisions in the items

were made, the joint committee approved the items, indicating that the items

satisfied the requirements for item content, item structure, and representative-

ness set forth in the test specifications.  In the final version of Form 20 of

the Reading Test, no apparent features were found that would introduce bias and

differentially affect the meaning of the Reading test scores for members of

different racial groups.



    ANALYSES OF TEST RESPONSES.  Analyses of the responses to items constitute

the next stage in an internal analysis designed to detect the presence of bias. 

These analyses entail comparative studies of the statistical properties of the

item responses made by different groups.  The finding that there are marked

differences in these properties would suggest that there may be items that do

not have the same meaning for the tested groups and, hence, might be biased.

    In 1974 Litaker examined the difficulties of Reading Test items for stu-

dents attending four different types of institutions in the University System of

Georgia.  The types of institutions and the number of schools in each type were

as follows:  universities (3), senior colleges (9), junior colleges (11), and

four-year institutions that were historically black (3).  The data analyzed by

Litaker consisted of responses to Form F of the Regents' Reading Test, which was

taken by students in the Spring and Fall quarters of 1972 and 1973.  In his

study Litaker grouped the students from the four types of institutions into four

ability levels, using as a measure of this ability students' scores on the

objective writing test that had been a component of the Regents' Test at the

time.  For his final sample he selected students at each of the four ability

levels from each of four different kinds of institutions.  He subsequently

compared, within each ability level, the difficulty of the Reading Test items

for students attending the four types of institutions.

    In Table 4 the most important of Litaker's findings are given.  This table

shows the correlations, within each ability level, between the item difficulties

for pairs of institutional types.  These correlations are most important because

they reflect the degree to which the items of Reading Form F have the same rela-

tive difficulty for the two groups compared.  When relative item difficulties

for the two groups are very similar, the values of these correlations will be

close to 1.00.  Alternatively, correlations substantially less than 1.00 will be

found when there are items that are relatively more difficult or easy for one or

the other of the two groups compared.




                                         Table 4


              Correlations between the Difficulties of Reading Form F Items
   for Samples of Students of a Given Ability* Within Different Types of Institutions
                                     [Litaker, 1974]

______________________________________________________________________________
____________

Ability Level/
            /Types of Institutions Compared                               
                                                                     correlation       

Ability Level 1
                                            
  Universities vs. Senior Colleges                                      .992
  Universities vs. Junior Colleges                                      .983
  Senior Colleges vs. Junior Colleges                                   .992
  Universities vs. Historically Black Colleges                          .944
  Senior Colleges vs. Historically Black Colleges                       .961
  Junior Colleges vs. Historically Black Colleges                       .952

Ability Level 2

  Universities vs. Senior Colleges                                      .987
  Universities vs. Junior Colleges                                      .990
  Senior Colleges vs. Junior Colleges                                   .995
  Universities vs. Historically Black Colleges                          .882
  Senior Colleges vs. Historically Black Colleges                       .909
  Junior Colleges vs. Historically Black Colleges                       .897

______________________________________________________________________________
____________
  
   *Litaker decided students' abilities using the objective writing test  
   that was part of the Regents' Test at the time.  He selected for his 
   study students performing in the 2nd, 4th, 6th, and 9th deciles on this 
   test.  Because there was not adequate representation in the 6th and 8th 
   deciles among students attending the historically black colleges, only 
   the item difficulties of students who were in the 4th and 6th deciles 
   and attended these colleges were analyzed by Litaker and reported here.



    To appraise the correlations presented in Table 4, it is useful to compare

their values to those obtained by Angoff and Ford (1973) in their classic study

of bias in the Preliminary Scholastic Aptitude Test (PSAT).  For the verbal and

mathematic sections of this test, Angoff and Ford reported that values of .959

and .923, respectively, were obtained when the item difficulties for matched

black and white students were correlated.  As shown in Table 4, Litaker's

findings for the correlations that involved historically black colleges range

from .897 to .944, which are slightly lower than the correlations obtained by

Angoff and Ford.  Because these findings suggested that some items of the

Reading Test might be relatively more difficult or easy for students at the

historically black institutions, Litaker conducted further analyses in an

attempt to identify such items.  He found four items that were relatively easier

and four that were relatively harder for these students than for students

attending the three other types of institutions in the University System. 

Litaker examined the content of these aberrant items and speculated about the

reasons for their differential difficulty, but ultimately he concluded that

there was no clear explanation for their possible bias.  It is important to note

that Litaker's conclusion reflects the lack of success experienced by other

researchers who have attempted to discern the particular features of items that

make the items relatively more difficult or easy for members of a particular

group.  In studies by Jensen (1979), Sendoval and Miller (1980), and Plake

(1980), there was no more agreement from judgmental and empirical analyses of

bias than that level of agreement that would occur by chance.

    It also is important to note that it would be inappropriate to generalize

Litaker's findings and consider them applicable to recent versions of the

Reading Test.  The form of the Reading Test examined by Litaker was the first

form of the test that had been devised, and it entailed items drawn from a pool

that the Regents' Testing Program had leased from the Educational Testing

Service.  Since 1974, the items comprising forms of the Reading Test have not

been drawn from this pool.  Also, the content of these items has been consider-

ably refined since that time.  Therefore, whatever content characteristics

produced the possible bias evident in Form F may not still be evident in more

recent test forms.  Particularly since Litaker was unable to identify any type

of item characteristic that could be consistently related to the discrepant item

difficulties that he observed, there is no basis for the inference that biasing

content characteristics like those that he found will be evident in more recent

forms of the Reading Test.

    To examine the question of bias in a more recent form of the Reading Test,

the Regents' Testing Program office conducted a study like Litaker's on Form 15

of this test.  Compared in this study was the relative difficulty of Form 15

items for unmatched groups of black and white students who had been administered

this form during the 1978-1979 academic year.  Hence, this study differed from

Litaker's in that the item difficulties for black and whites students, rather

than for historically black and non-black institutions, were compared.        

    When the difficulties of Form 15 items for the unmatched black and white

students were correlated, a value of .938 was obtained.  This value is high and

compares favorably with the item difficulty correlations of .929 and .901 ob-

tailed by Angoff and Ford when they compared the relative difficulties of PSAT

verbal and mathematical items for unmatched groups of blacks and whites.  It

also suggests that the relative difficulty of Form 15 items for black and white

students was about the same.  An even higher correlation than .938 probably

would have been obtained if the relative difficulty of Form 15 items had been

compared using groups of black and white students who had been matched on

ability.  As Angoff and Ford observed, matching appears to reduce some of the

disparities in the test performance of different groups.

    Since the correlation of .938 fell slightly below 1.00, the Form 15 items

were examined individually in an attempt to identify those items for which there

was the most marked difference in relative difficulties when the black and white

student groups were compared.  Two items (No.s 21 and 63) that were found to be

slightly more difficult for the black students than were other items on the test

are shown in Table 5.



                                         Table 5

                     Items from Form 15 of the Regents' Reading Test
             That Were Relatively More Difficult for Black Students Than for
                                     White Students

 
______________________________________________________________________________
____________

                                                            Percent Choosing Each Option 
                                                           _______________________________ 
                                                                  White        Black            
                                                               Students     Students  
______________________________________________________________________________
____________

  21.  Aerobes are                           

     *1.  live organisms.                                         95           82
      2.  atmospheric gases.                                       2            7
      3.  animal residue.                                          2            8
      4.  decaying plants.                                         1            3       


  63.  This passage was obviously written

     *1.  in the mid to late 20th century.                        96           83
      2.  at the turn of the century.                              2            5
      3.  in the mid to late 17th century.                         1            3
      4.  in the early 20th century.                               1            3
          (omit)                                                  (1)          (5)      
                                                                 

______________________________________________________________________________
____________

 Note:  Percentages based on responses of 7,468 white students and 926 black students who 
        were administered Form 15 during the 1978-1979 academic year.         



    To determine why the items shown in Table 5 were, relative to other items,

slightly more difficult for black students, it is reasonable to consider the

material of the passages to which each item referred.  In the expository passage

to which Item 21 refers, a method of constructing a compost heap is described. 

The word aerobes is used in a sentence that provides several context clues that

suggest the meaning of this word.  The sentence is:

                
                The soil organisms which break down the plant 
                and animal residues and convert them into 
                compost are aerobes, i.e., they must have oxygen 
                from the atmosphere to carry on their life 
                activities.


Thus, Item 21 can be answered correctly by the student who is able to use "i.e."

and the pronoun reference "they" to make the link between the soil organisms and

the life activities referred to in the passage.  As is evident from Table 5,

most white and black students did answer the item correctly.  However,

the analysis of this item for bias indicated that, relatively speaking, the item

posed slightly more difficulty for some black students.  It is not clear,

though, why this was the case.

    With respect to Item 63, the reason for the discrepancy in black and white

students' performance may be more clear.  Item 63 refers to a narrative passage

about a man's rise to success as an advertising agent, singer, and poet during the

"Eisenhower" years."  The reference to the Eisenhower years and a reference to

computers would appear to be the only information in the passage which would

allow a student to discern that the passage was written about events occurring

in the mid to late 20th century, which is the correct answer to Item 63.  In                      

particular, linking Eisenhower or computers to the middle or late 20th century

requires information that has not been presented in the passage.  It may be that

a greater portion of black than white students lacked this information and so

found this item relatively more difficult.



Studies of bias based on external analyses

    Data that were gathered by Hickman (1973) in the course of a larger study

permit rough comparisons of the relation between academic variables and the

Reading Test scores obtained by students who differ in race.  In her study,

Hickman examined these relations for each of five institutions, which included

two junior colleges, one four-year college, one university, all of which were

predominantly non-black, and one 4-year college that was predominantly black. 

For the students at each institution, she calculated the correlation between

their Reading Test scores and both their scores on the Scholastic Aptitude Test

and their grade point averages.  Thus, through a comparison of the correlations

obtained for the predominantly black college to those obtained for the four

predominantly non-black institutions in Hickman's study, some conclusions can be

drawn about the similarity or differences in the meaning of the Reading Test

scores for students attending these different types of institutions.  Because

the institutions studied were not homogeneous with respect to race, only very

tentative inferences can be made about the meaning of the Reading Test scores

for students who differ in race.  Hickman's findings also have limited

generalizability because she included in her study only one predominantly black

college that may or may not have a student body that is representative of the

black student population in the University System.

    The results of Hickman's correlational analyses are presented in Table 6. 

As is evident from the table, with two exceptions the correlations obtained at

the predominantly black college differed little from those obtained at the other

four institutions that Hickman examined.  For example, the correlation of .27

between the Reading Test scores and cumulative high school grade point averages

obtained by students at the historically black institution was about the same as

those reported for students attending the junior and 4-year colleges in

Hickman's study; why the correlation reported for the University was somewhat

lower than those of these other institutions is not clear.  Further study of

Table 6 also shows that institutional type did not seem to influence the

relation of students' Reading Test scores to their SAT(M) scores and to their

cumulative college and freshman grade point averages; for all five institutions,

these correlations were about the same.



                                         Table 6

                    Correlations between Regents' Reading Test Scores
                         and Selected Academic Variables Within
                               Five Different Institutions
                                     [Hickman, 1973]

______________________________________________________________________________
__________

                                             Institutional Types

                         _______________________________________________________________

                                   Predominantly Non-Black               Predominantly
                                                                            Black  
    Academic           
________________________________________________________________
                           
    Variables              Jr.       Jr.       4-Year       University         4-year
                         College   College     College                         College
______________________________________________________________________________
__________

Test Scores
    
SAT(V)*                   .61       .76         .74             .70              .58
SAT(M)**                  .41       .57         .41             .39              .41

Grade Point Averages

Cumulative-High School    .27       .26         .26             .17              .27
Cumulative-College        .38       .49         .48             .44              .46
Freshman-College          .35       .46         .48             .42              .44
Eng. Composition-College  .35       .40         .44             .39              .32

______________________________________________________________________________
_________

 Note: Correlations based on sample sizes ranging from 88 to 560.  

  *SAT(V) refers to the verbal section of the Scholastic Aptitude Test.

 **SAT(M) refers to the mathematical section of the Scholastic Aptitude test.



    The two exceptions to this trend of similar correlations are the markedly

lower correlation between the Reading Test and the SAT(V) reported for the

predominantly black college and the slightly lower correlation between the

Reading Test and students' grade point average in English Composition reported

for this college.  The Reading Test - SAT(V) correlation of .58 that was ob-

tailed at the predominantly black college was quite a bit lower than the values

for this correlation obtained at the other institutions in Hickman's study (r's

= .61 to .76).  Similarly, the correlation between the Reading Test and

students' grades in English Composition obtained at the predominantly black

institution (r = .32) was slightly lower than the correlations obtained at the

other institutions (r's = .35 to .44).

    Restriction of range is the most plausible explanation for lower correla-

tions obtained at the predominantly black institution.  That is, the students

attending this institution received scores on the correlated variables that were

more homogeneous than the scores of students attending the other institutions

studied by Hickman.  This explanation is supported by the data presented in

Table 7, which gives the means and standard deviations of the Reading Test

scores, the SAT scores, and the grade point averages obtained by the students at

the five institutions involved in Hickman's study.  When the standard deviations

shown in this table are examined, it becomes clear that the students from the

predominantly black college were considerably more homogeneous in their Reading

Test scores, their SAT(V) scores, and their English composition grades than were

students attending the other four institutions in Hickman's study.  Since

homogeneity restricts the size of the correlation that can be obtained, it is

most probable that the correlations among these variables were relatively lower

for the predominantly black college because the students at this college were

less variable in their performance on both variables involved in the

correlations.



                                         Table 7
                                            
                    Means and Standard Deviations* of Scores Obtained
                      on the Regents' Reading Test and on Selected
                             Academic Variables by Students 
                             at Five Different Institutions
                                     [Hickman, 1973]

______________________________________________________________________________
_______________

                                                    Institutional Types
                       
_____________________________________________________________________   
Academic                                                                                
                                         Predominantly Non-Black        Predominantly 
                                                                            Black

Variables              
_____________________________________________________________________
                           
                           Jr.       Jr.       4-Year       University         4-year
                         College   College     College                         College
______________________________________________________________________________
_______________

Regents' Reading          63.25    63.47       64.94         67.16              48.65
Test Scores               (9.39)   (9.57)      (9.30)       (10.24)             (7.96) 


Other Test Scores
    
SAT(V)**                 422.98   411.94      456.31        455.22              302.62
                          83.82)  (85.38)    (100.25)       (95.20)             (57.33)  

SAT(M)***                431.90   448.9 2     451.92        471.30              328.66
                         (84.84)  (98.43 )    (89.70)       (94.09)             (59.66)


Grade Point Averages

Cumulative-High School     2.56     3.25        2.64          2.65                2.65     
                           (.70)    (.60)       (.62)         (.67)               (.58)    

Cumulative-College         2.68     2.53        2.45          2.39                2.56
                           (.55)    (.65)       (.58)         (.68)               (.48)

Freshman-College           2.57     2.45        2.37          2.31                2.64     
                           (.58)    (.67)       (.65)         (.63)               (.48)    

Eng. Composition-College   2.54     2.20        2.50          2.49                2.89
                           (.75)    (.70)       (.71)         (.59)               (.69)
______________________________________________________________________________
____________

 Note:  Correlations based on sample sizes ranging from 88 to 560.                      
     
   *Standard deviations reported within parentheses.
 
  **SAT(V) refers to the verbal section of the Scholastic Aptitude Test.

 ***SAT(M) refers to the mathematical section of the Scholastic Aptitude Test.






                                   Section II

 

                  DEVELOPMENT AND VALIDATION OF THE ESSAY TEST



DEVELOPMENT OF THE TEST PROCEDURE

       The first administration of the Regents' Test entailed two measures of

students' writing skills:  an essay test and a multiple-choice grammar and usage

test.  Students were given 30 minutes to work on each of these tests, and they

were assigned the essay topic on which they were to write.

    The decision to assess students' writing skills in this way was made after

much debate between members of the Subcommittee on Testing and testing experts

about the manner in which writing could properly be appraised.  The Subcommittee

members felt strongly that writing skills could be validly assessed only on the

basis of a writing sample.  In contrast, the testing experts preferred a

multiple choice test of writing skills because of the psychometric problem of

reliably scoring a writing sample and because of the administrative and economic

difficulties entailed in scoring large numbers of these samples.  Inclusion of

both a 30-minute essay test and a 30-minute multiple-choice test of writing

(grammar and usage) was the compromise these two groups ultimately reached

(Johnson, 1980; Thompson & Rentz, 1973).  Students were assigned their essay

topics in part because both the Subcommittee members and testing experts held

the view that if students were given a choice of topics they would spend too

much of the relatively short testing time selecting their topics and too little

of that time organizing and writing their essays.  It was also thought that

permitting a choice would make unduly complicated the scoring procedures used to

rate the essays (Thompson & Rentz, 1973).

    The multiple-choice writing test and the essay test that provided no choice

of topics were used in the Regents' Test until 1974, when several changes were

made in the writing assessment procedures.  The multiple-choice writing test was

dropped from the Language Skills Examination after the Summer, 1974,

administration.  This decision was made because data had been collected which

showed that students' essays were being reliably scored and the Subcommittee

concluded that students' grammar and usage skills could be considered by essay

raters when they read and scored the students' essays (Johnson, 1980). 

Concurrently, the Subcommittee decided to give students 45 rather than 30

minutes to write their essays and to allow students a choice between two essay

topics to write on.  These two matters were decided in light of data gathered by

the Regents' Testing Program office that showed that students' passing rates

might be improved by extending their essay-writing time to 45 minutes and that

these rates would not be adversely affected by giving students a choice between

two topics (Regents' Testing Program, 1974).  In 1978, members of the Ad Hoc

Committee recommended that the time allowed for the essay test be extended to

the current 60-minute limit because they believed that the quality of students' 

essays might improve if more time to work on them were allowed.

    With respect to the essay topics presented in the Essay Test, the Testing

Subcommittee specified that these topics should be narrow enough to elicit an

essay in 60 minutes but broad enough to bear on students' common knowledge and

experiences rather than on any specialized knowledge that only some students

might have (Thompson & Rentz, 1973).  Also these topics were not to (1) contain

difficult vocabulary, (2) appear to have a rural-urban or an ethnic bias, (3)

closely resemble topics previously used, (4) involve highly controversial or

emotional subjects, or (5) seem to encourage students to identify their

institutions in their essays. It was also specified that the two topics

presented on a form of the test should be sufficiently different from one

another so that students had a reasonable choice between topics, e.g., one topic

might bear on a contemporary idea or event, and the other might bear on a

personal event or experience.



PROCEDURES FOR SELECTING ESSAY TOPICS  

    The topics used on the earliest forms of the Essay Test were written by the

Testing Subcommittee.  More recently, proposed topics have been solicited

through the president of each institution, who is asked to obtain suggestions

for topics from students, faculty, and administrators.  Over 950 suggestions

were proffered in response to the most recent request for essay topics, which

was made in the Fall Quarter of 1980. 

    The essay topics that are submitted are first reviewed by the Scoring

Coordinators and the members of the Testing Subcommittee, who select those

topics that conform to the specifications for the Essay Test.  After being

revised where necessary, the topics selected are then submitted to the Academic

Committee on English for further revision and final approval.  The approved

topics are then put in pairs by the Regents' Testing Program staff for use on

test forms.  The Testing Subcommittee and Scoring Coordinators subsequently

review and revise the pairs of topics to ensure that the two topics presented on

each form are sufficiently different from each other to offer students a

reasonable choice between topics.



PROCEDURES FOR SCORING THE ESSAY TEST  

    Each student's essay is graded independently by three raters who use a

holistic scoring procedure to assign ratings to the essay.  Often, raters who

use holistic scoring procedures are required to read an essay quickly to gain an

overall impression of its quality and to use standards established by the raters

themselves to assign a rating to it (Godshalk, Swineford, & Coffman, 1966).  A

variant of this procedure is used to score the essays written for the Essay

Test.  Raters of these essays do assign ratings that reflect their judgment of

the overall quality of an essay; however, the standards used to assign these

ratings have been formulated by the Testing Subcommittee rather than the raters

themselves.  The development of these standards is described in a section below.

     Procedures that entail holistic or impressionistic judgments of essays are

used in many large-scale testing programs because essays can be scored quickly

when these procedures are used.  Because of this efficiency, it is possible to

have three raters read and score each essay submitted by the students who have

taken the Essay Test.  The analytic scoring method, which is another method of

directly appraising an essay, requires a rater to evaluate individual features

of the essay.  This method is more time-consuming and expensive than the

holistic approach, and it is not known to be any more reliable or valid (Coffman

& Kurfman, 1968; Raven, Veal, & Rentz, 1974).

    A disadvantage of the holistic approach is that holistic ratings provide no

information about the particular strengths and weaknesses of the rated essays. 

A holistic rating reflects a judgment about the quality of the essay as a whole,

and the particular features of the essay that led to this rating are not

described by the rater.  However, it is not reasonable to expect that the

results of a large scale testing program provide extensive diagnostic

information.  This kind of information is probably more appropriately and

effectively gained in the classroom.  For a reliable diagnosis about the strong

and weak aspects of a student's writing, several essays should be written at

different times by the student and then appraised analytically.  Students who

have taken the Essay Test and have questions about their essay scores can review

their essays with faculty who are familiar with the procedures used to rate the

Essay Test.  Although it is not possible to determine from this review the

reasons why a particular score was assigned to an essay, strengths and

weaknesses in the essay can be pointed out by the faculty member and, more

importantly, further writing samples can be requested so that an accurate

diagnosis can be made.

    Until 1981, the instructions that were given to essay raters included a

description of the scoring procedure they were to use and the set of model

essays.  These instructions had been developed by the Testing Subcommittee.

    In 1981, the instructions to the essay raters were expanded and a new set

of model essays was selected.  These expanded instructions did not entail any

change in the scoring standards or criteria; minor revisions were made in the

description of the essay scoring procedure, and a new section was added wherein

answers to questions that raters commonly pose were provided.  These revised

instructions, which are part of the Description of Essay Scoring Procedures,

were approved by the Academic Committee on English in February, 1981, and they

have been used by Regents' Test essay raters since Spring quarter, 1981.

    In order to monitor the procedures for scoring essays, the members of the 

Testing Subcommittee meet with the Scoring Coordinators each quarter after the

test has been administered and before the first essay scoring session.  At this

meeting members of the group read essays written on the topics used that quarter

and look for any specific problems that raters could have in rating papers on

the topics.  A discussion of anticipated problems provides the basis for the

guidelines that are developed for use in scoring the current essays.  At the

meeting, the group also selects practice essays for use at the scoring sessions. 

The ratings assigned to the practice essays are based on unanimous agreement of

the group members.  At the scoring session held later in the quarter, Scoring

Coordinators discuss both the guidelines and the practice essays with essay

raters.


    
Rationale for Scoring Standards  

    As noted above, the standards used as a basis for scoring the essays

written for the Regents' Test were developed by the Testing Subcommittee, whose

members had an average of 20 years of experience in grading the compositions of

high school and college students.  The Subcommittee's choice of standards was

based both on the members' extensive experience in appraising the writing of

students from widely different backgrounds and on its reading and discussion of

hundreds of essays written for the experimental form of the Essay Test

administered in the spring of 1971.

    The following statement by a former chairman of the of the Testing Sub-

committee conveys the considerations on which the essay-scoring standards have

been based:


         The fundamental assumption which justifies the whole test is 
         that language is the primary tool of thought as well as of 
         expression and that the inability to manipulate language 
         accurately not only impairs communication but blights the 
         entire thinking process.  The writer, by giving up such 
         advantages as gestures, intonation, and the personal 
         interplay between speaker and hearer, is forced to a far 
         greater precision of diction and organization of material 
         than the speaker.  Thus in the examination it was necessary 
         to demand tight, logical organization as well as precise word 
         choice and unambiguous phrasing.
         
         . . . .What we sought to establish in the Rising Junior test 
         was a level of composition which would not disgrace a letter 
         of application for a job and which would convey clearly such 
         information as a chemist, a business man, or a policeman 
         would need to convey to his superiors, his colleagues, or his 
         customers.  I believe that the standards which our committee
         set represent a bare minimum of what the business world would
         accept in terms of organization, coherence, explicitness, and 
         freedom from gross linguistic bad manners.
         
         I believe that the professors who established the standards 
         for the essay, working as they did out of extensive and 
         nationwide experience of student writing, were thoroughly 
         qualified to define both the criteria by which writing should 
         be judged and the level at which it should be considered 
         minimally acceptable.  I can testify that the labor, extend-
         ing over a number of days and over hundreds of papers, was 
         carried out conscientiously and with a deep realization of 
         the responsibility which we carried.  Although many believe 
         the standards too low, I believe we can defend the minimally 
         acceptable papers as only slightly below the standard re-
         quired in most Freshman English courses; the Rising Junior 
         test is, after all, taken under somewhat more harassing 
         conditions than a Freshman in-class theme (Pendexter III, 
         undated). 

   

   Although a two-point, pass-fail rating scale would have adequately served

the purposes of the Essay Test, the Testing Subcommittee decided to use a

four-point scale so that students and their institutions would be given more

information about the quality of the students' writing performance.  With three

acceptable levels of writing, exceptional writing performance would not go

unrecognized, and marginal levels of skill could be distinguished from the

clearly acceptable levels.  The Subcommittee also decided that one failing score

was sufficient because, in the words of one committee member, "We all felt

strongly that distinguishing between failing and failing miserably would be

needlessly depressing to the student who received the lowest grade."

   The use of a model essay to represent each division point on the four-point

scale was thought to provide the most effective means of precisely defining the

scale.  Each of these models describes just one point on this scale.  Were

models representing the ratings of "1," "2," "3," and "4" used, less precise

definition of the scale would invariably result because a range of performance

is represented by each of these ratings.  The use of models representing the

mid-point of each rating on the scale would similarly result in a less

well-defined scale:  when an essay is judged to fall between two midpoints, a

rater has no clear means for deciding what rating the essay should be assigned.

   In order to have some systematic method of selecting model essays, in 1971

members of the Testing Subcommittee identified 22 aspects of writing and indi-

cated how these aspects should be weighted when the model essays were selected. 

The committee indicated that 40% of the weight should be given to organizational

aspects, 40% to rhetorical aspects, and 20% to the mechanical aspects of the

essays considered.  These categories and the specific features of writing to

which they pertain are listed in Tables 8 to 10 and are discussed in a

subsequent section where two studies that used the Testing Subcommittee's list

of 22 criteria are described.

   During the first few years in which the Regents' Test was administered, the

Testing Subcommittee selected new model essays for each topic each quarter so

that essay raters could be given model essays written on the topics used in the

Essay Test administered that quarter.  Selecting these "topic-specific" model

essays was time-consuming and, even with the careful procedures for selecting

these models, there was no way to ensure that the selected models were

equivalent.  Therefore, in Spring 1976, it was suggested that one standard set

of model essays that could be used each quarter be identified and used in lieu

of the topic-specific sets of model essays that had to be changed each quarter. 

Research subsequently conducted by the Regents' Testing Program in Summer, 1976,

indicated that students' pass rates on the Essay Test and rater reliability

would not be significantly affected by using the standard models in lieu of

topic-specific models.  Consequently, the Testing Subcommittee agreed to select

a set of standard models that would be used as bases for rating all essays

written for the Regents' Test.  These models were chosen from among the essays

that had been previously used as topic-specific models. 

   In 1978, the Ad Hoc Committee on the Regents' Test examined the standards

that were used to judge the quality of the essays written for the Essay Test. 

After considering the procedures used to select the model essays that establish

these standards and appraising the rationale underlying these procedures, the Ad

Hoc Committee concluded that the essay scoring procedure was sound, and it

recommended that responsibility for setting the scoring standards remain with

the Academic Committee on English.



EVIDENCE OF CONTENT VALIDITY

   As in the case of the Reading Test, the validity of the Essay Test rests

primarily on evidence that it is content valid.  As noted in Section I above, a

test developer builds content validity into a test by (1) carefully considering

what content and skills the test should cover, (2) by carefully defining the

content and skills the test will assess, and (3) by choosing items for the test

that adequately represent the content and skill specifications.  Experts in the

subject matter and skills to be assessed should be engaged at each of these

stages (Anastasi, 1976).

   Forms of the Essay Tests have been developed in such a manner.  The aspects

of writing to be appraised by the test were discussed at length by the Testing

Subcommittee in 197l.  These aspects were illustrated by the model essays that

have been used to define the Essay Scoring scale and are discussed in the 

Description of Essay Scoring Procedures given to the raters who grade the

Regents' essays.  With respect to the topics that the Essay test is to cover, as

noted above, potential topics for the test are reviewed by the Testing Sub-

committee, and topics are selected that conform to the aforementioned criteria

used by this committee.  The selected topics are then reviewed by the full

Academic Committee on English, which consists of representatives from all 33

institutions in the University System.  Any topic that still appears inappro-

priate or flawed is either revised, when possible, or eliminated.  The Testing

Subcommittee and the Scoring Coordinators conduct a final review of the approved

topics after they have been paired for use on forms of the Essay Test.



OTHER EVIDENCE OF VALIDITY

Relation between Essay Test Scores and Other Measures of Writing Skill

    Like the Reading Test, the validity of the Essay Test largely rests on the

quality of its content, but additional support for this validity can be gained

from studies of the relation between students' essay scores and other measures

of their writing skill; the finding that the scores on the two measures relate

well would provide support for the claim that the Essay Test does, in fact,

appraise students' writing skills (see Campbell, 1964; Campbell & Fiske, 1959). 

    Three studies have been conducted to study such relations.  In the studies

by Ravan (1973) and Henderson (1977), Regents' essays that had been holistically

graded were rated again using another scoring method so that comparisons could

be made between the essays' holistic scores and the scores resulting from

another method of appraising the quality of writing in essays.  In contrast,

Veal and Rentz (1975) compared students' holistic essay scores to their

performance on an objective test of writing skills in order to examine the

relation between the Essay Test and an entirely different approach to the

measure of writing skills.  The results of these studies are described in the

paragraphs that follow.

     Ravan (1973) and Henderson (1977) examined the relation between the holis-

tic scores students received on the Essay Test and the scores these students

obtained when their essays were analytically graded.  In contrast to the global

rating of writing quality associated with the holistic method, when an essay is

appraised analytically, a rater reads the essay slowly and then rates individual

components of the essay.  Both Ravan and Henderson had raters individually rate

22 components of the essays that they read.  These 22 components are those that

the Testing Subcommittee specified in 1971 and used as the initial basis for

selecting the model essays that would define the score scale for the Essay Test. 

If the holistic scores assigned to Regents' essays validly reflect the levels of

quality described by the essay score scale, these scores should establish a rank

order of essays that is the similar to that which results when these essays are

analytically graded on the 22 components.

    For her study, Ravan selected from the Winter, 1973, test administration 40

essays that had been given holistic scores of 1, 2, 3, or 4, with 10 essays

selected to represent each of these scores.  Two raters then analytically graded

each of the 40 essays on the 22 components that were noted above and are listed

in Table 8.  For the components pertaining to Organization and Rhetoric, the

raters used a 4-point scale that ranged from (1) substandard to (4) superior. 

The components pertaining to Mechanics were also rated on a 4-point scale that

differed slightly in that it ranged from (1) demonstrates incompetence to (4)

demonstrates competence.  The results of these analytic ratings are reported in

Table 8. 



                                     Table 8
                                        

           Mean Analytic Ratings on 22 Components for Regents' Essays
                    Given Holistic Ratings of 1, 2, 3, and 4
                                  [Ravan, 1973]

______________________________________________________________________________
__

                                        Holistic Scores
Analytic                                                              Component
                         _________________________________________
Components                                                              Means    
                               1         2         3         4        
______________________________________________________________________________
__

Category of
Organization:

 1.  Limiting the
      Subject                1.40      1.60      1.60       2.40        1.75*

 2.  Evidence of a
      Thesis                 1.40      2.10      2.00       3.00        2.12*

 3.  Development of
      Thesis:  Unity         1.10      1.80      2.00       2.60        1.86*

 4.  Development of Thesis:
      Logical Development    1.00      1.80      1.80       2.80        1.85*
 
 5.  Development of Thesis:
      Coherence              1.00      1.50      2.00       2.70        1.80*

 6.  Development of Thesis:
      Evidence               1.00      1.50      1.50       2.60        1.65*


      Category Mean          1.15       1.7      1.82       2.68        1.84



Category of Rhetoric:

Diction:

 7.  Clarity                 1.20      1.80      2.80       2.90        2.17*

 8.  Economy                 1.30      1.80      2.50       2.80        2.10*

 9.  Precision               1.20      1.90      2.50       2.90        2.12*

10.  Consistency             1.80      2.20      2.90       3.20        2.53*


Sentence Structure:

11.  Clarity                 1.30      1.90      2.60       2.90        2.17*

12.  Variety                 1.60      2.00      2.80       3.10        2.38*

13.  Economy                 1.40      2.00      2.50       3.00        2.23*      

14.  Parallelism             1.90      2.70      2.60       3.20        2.60*


Paragraph Structure:

15.  Unity                   1.20      1.70      2.30       2.80        2.00*

16.  Logical Development     1.20      1.70      2.30       2.80        2.00*

17.  Coherence               1.30      1.70      2.20       2.80        2.00*


Point of View:

18.  Appropriateness         1.10      1.60      2.40       3.00        2.02*

19.  Consistency             1.00      1.80      2.10       2.80        1.92*

      Category Mean          1.35      1.91      2.50       2.94        2.17


Category of Mechanics:

20.  Spelling                2.60      3.30      3.80       3.80        3.38*

21.  Punctuation             2.60      3.20      3.10       3.60        3.12

22.  Usage                   2.40      2.90      3.10       3.60        3.00

      Category Mean          2.53      3.13      3.33       3.67        3.17

                            

OVERALL ANALYTIC RATING      1.45      2.02      2.43       2.97
______________________________________________________________________________
__

Note: Each component mean reflects the analytic rating assigned by two raters to 
      10 essays given the specified holistic score.  

* On this component, an analysis of variance indicated significant differences    
in the mean analytic scores assigned to essays differing in their holistic    
scores (p < .05).

                            
                                        
    As is shown by the final statistics listed in Table 8, Ravan found that the

overall analytic scores assigned to the 40 essays confirmed the rank order esta-

blished by these essays' holistic scores.  That is, essays given higher holistic

scores also were assigned total analytic ratings that were higher, which sug-

gests that the holistic scores do effectively reflect the overall quality of

writing found in the Regents' Test essays.  Ravan also found that the holistic

scores generally reflected the quality of individual features of essays that

were graded.  As shown in the table, the analytic scores for the three

categories of Organization, Rhetoric, and Mechanics, reflected a rank order like

that established by these essays' holistic scores.  Thus, Ravan found the

expected qualitative differences in the organization, rhetoric, and mechanics of

essays assigned different holistic scores.  Table 8 also shows that even for the

components of the analytic categories, the analytic ratings for all components

but punctuation and usage progressed from low to high in accord with the

holistic scores.  For example, on all 22 components, the analytic scores

assigned to the essays rated holistically as 2's were higher than the analytic

scores assigned to the essays holistically rated as 1's.  With respect to the

components of punctuation, usage, and also spelling, Ravan noted that both good

and poor essays were assigned relatively high scores on these mechanical aspects

of writing.  These findings indicated that organizational and stylistic problems

rather than mechanics had impeded the communications of the poor writers in her

sample.   Ravan's general conclusion was that her data demonstrated that the

holistic procedures used to grade the Essay Test were valid.  As she noted, "the

results of the analytic procedure placed the forty essays in four ranks of essay

quality, ranks which correspond to the essay ranks previously determined (by

holistic scoring)" (p. 110).

    Henderson's study differed slightly from Ravan's in that he was interested

in the question of whether only those essays that had been holistically graded

as 1's and 2's (Failing and Barely Passing) would be judged to differ in quality

when analytically graded.  For his study, Henderson selected 72 essays to be

analytically graded, where eighteen essays were obtained from each of four

institutions.  Of the 72 essays, 36 had been given holistic grades of 1, and 36

had been given holistic grades of 2.  Twelve raters graded each of the 72 essays

on the abovementioned 22 components using a two-point scale having the values

(0) Fail and (1) Pass.  Henderson then calculated the average component ratings

assigned to the 36 essays that had been holistically graded as 1's, and the

average component ratings assigned to the 36 essays holistically graded as 2's. 

He also calculated the percentage of analytic "pass" ratings each set of 36 was

assigned on each of the 22 components.

    As noted in Table 9, Henderson found that, on all components, the set of

essays holistically graded as 2's were assigned mean analytic ratings that were

significantly higher (p < .001) than the mean analytic ratings assigned to the

set that had been holistically graded as 1's.  In Henderson's view, these

findings showed that the holistic grades of 1 and 2 were valid indicators of

different levels of writing quality.  He noted, "(this) evidence that every one

of the twenty-two criteria does differentiate between a pass/fail holistic

rating for the essay unquestionably validates the holistic rating method at

Levels 1 (fail) and 2 (minimal pass).  The (holistic) rating procedure is,



                                         Table 9

               Mean Analytic Ratings on 22 Components for Regents' Essays
                            Given Holistic Ratings of 1 and 2
                                    [Henderson, 1977]

______________________________________________________________________________
__________

                                        Holistic Scores
Analytic                                                                Component
                         _________________________________________
Components                                                              Means    
                                    1                    2             
______________________________________________________________________________
___________

Category of
Organization:

 1.  Limiting the
      Subject                    7.08                10.28              8.68

 2.  Evidence of a
      Thesis                     8.53                11.33              9.93            

 3.  Development of
      Thesis:  Unity             5.22                 9.72              7.47

 4.  Development of Thesis:
      Logical Development        3.56                 8.28              5.92            
 
 5.  Development of Thesis:
      Coherence                  3.53                 8.25              5.89

 6.  Development of Thesis:
      Evidence                   3.67                 7.75              5.71            


Category of Rhetoric:

Diction:

 7.  Clarity                     6.22                10.00              8.11

 8.  Economy                     5.72                 8.58              7.15

 9.  Precision                   4.39                 8.22              6.31

10.  Consistency                 7.97                10.67              9.32            

Sentence Structure:

11.  Clarity                     5.53                10.28              7.90            

12.  Variety                     6.47                 9.44              7.96

13.  Economy                     5.19                 8.36              6.78

14.  Parallelism                 6.75                10.19              8.47  


Paragraph Structure:

15.  Unity                       6.22                10.06              8.14            

16.  Logical Development         2.87                 6.94              4.90            

17.  Coherence                   4.39                 8.67              6.53


Point of View:

18.  Appropriateness             8.69                10.97              9.83

19.  Consistency                 8.25                10.78              9.51            

      
Category of Mechanics:

20.  Spelling                    7.81                11.19              9.50            

21.  Punctuation                 8.33                11.00              9.67            

22.  Usage                       6.86                10.94              8.90            
                                                     

OVERALL ANALYTIC RATING          6.06                 9.63      
______________________________________________________________________________
____________

   Note: Each component score reflects the analytic ratings (1 = pass, 0 = fail) assigned 
         to 36 essays by 12 raters when the 12 ratings assigned to each essay are summed over 
         essays and this total is divided by 36.  For all components analyses of variance 
         indicated that the analytic scores assigned to essays holistically graded as 1's 
         differed significantly from those assigned to essays holistically graded as 2's (p < 
         .05).



therefore, functional.  The fact that the results are not just marginally

significant but significant at the .001 level of confidence reinforces the

degree of assurance for the validation contention" (p.84).

    Examining the analytic scores in more detail, Henderson observed that the

essays given the failing holistic grades of 1 differed most markedly from those

given the passing holistic grade of 2 on components 3-6, 9, 13, 16, and 19,

which are components pertaining to organization and rhetoric.  As is indicated

in Table 10, on these components only 13% to 40% of the analytic ratings

assigned to the 36 failing essays were "passes," whereas such passes were given

in 58% to 81% of the analytic ratings given to the 36 barely passing essays.  In

contrast, Henderson noted that, in general, the failing essays did not have

serious mechanical problems; 65% to 69% of the analytic ratings assigned to 

these essays on spelling, punctuation, and usage were "passes."  On the basis of

these findings, Henderson concluded, in accord with Ravan, that mechanics had

not been the major factor causing the failures among the essays in his sample.




                                        Table 10
     
                 Percentage of "Passing" Analytic Ratings Assigned on 22
            Components to Regents' Essays Holistically Graded as 1's and 2's
                                    [Henderson, 1977]
     
                                            
     _________________________________________________________________________
     
                                                    Holistic Ratings
     Analytic                        _________________________________________
     Components                                                              
                                                1                    2             
     _________________________________________________________________________
     Category of
     Organization:
     
     1.  Limiting the                         
         Subject                              59                    87

     2.  Evidence of a
         Thesis                               71                    94

     3.  Development of
         Thesis:  Unity                       43                    81

     4.  Development of Thesis:
         Logical Development                  30                    69
 
     5.  Development of Thesis:
         Coherence                            29                    69

     6.  Development of Thesis:
         Evidence                             31                    66


     Category of Rhetoric:
     
     Diction:

     7.  Clarity                              52                    83

     8.  Economy                              48                    69

     9.  Precision                            37                    68

    10.  Consistency                          66                    89


    Sentence Structure:

    11.  Clarity                              46                    86

    12.  Variety                              54                    78
    
    13.  Economy                              13                    70

    14.  Parallelism                          56                    85


    Paragraph Structure:

    15.  Unity                                52                    84

    16.  Logical Development                  24                    58

    17.  Coherence                            37                    72

    Point of View:

    18.  Appropriateness                      73                    91

    19.  Consistency                          68                    90

      
    Category of Mechanics:
  
    20.  Spelling                             65                    93

    21.  Punctuation                          69                    92

    22.  Usage                                57                    91


   
______________________________________________________________________________
___

     Note: Each component score reflects the mean number of passing grades (1 = pass)         
           assigned by 12 raters to 36 essays.



    The third investigation to be noted is that of Veal and Rentz (1975), who

examined the relation between the Essay Test and an objective writing test.  The

objective test that these researchers used was a part of the Regents' Test at

the time they collected their data.  It was composed of multiple-choice items

that appraised grammar and usage skills.  From the Spring, 1972, administration

of the Regents' Test, Veal and Rentz obtained the scores of 292 students who had

taken both the objective test and the Essay Test.  To examine the relation be-

tween these two measures, Veal and Rentz first grouped the students into quar-

tiles on the basis of their objective test performance.  They then calculated

the percent of students in each quartile who obtained passing scores on the

Essay Test.  Since grammar and usage are two factors that Regents' Test raters

take into account when grading students' essays, it is reasonable to expect some

positive relation between performance on the objective test and the pass rates

that are attained on the Essay Test.  However, this relation should not be very

strong because the description of the essay scoring procedures indicates that

factors such as organization and style also will influence the grades assigned

to Regents' Test essays.

    Veal and Rentz's findings are presented in Table 11.  In general, these

findings indicate that, as expected, there is a positive relation between

students' performance on the objective test and their scores on the Essay Test. 

Veal and Rentz did not calculate a measure of the precise relation between these

measures, but it is evident from study of Table 11 that increases in students'

performance on the objective test were accompanied by increases in the percen-

tages of students who obtained passing Essay Test scores of 2, 3, or 4.  Of

those students who performed in the lowest quartile on the objective test, for

example, only 38.7% attained passing status on the Essay Test.  In contrast,

passing status was attained by 69.7% of the students performing in the second

quartile on the objective test, and higher pass rates were obtained by those

performing in the third and fourth quartiles on this test.  Thus, Veal and

Rentz's findings show that students' performance on the Essay Test is related to

their performance on another test of their writing skills.  This finding lends

some support to the claim that students' essay test scores are valid measures of

their writing skill.



                                      Table 11

        The Pass Rates Attained on the Regents' Essay Test by Students Performing
                    at Different Levels on an Objective Writing Test
                                  [Veal & Rentz, 1974]
 

______________________________________________________________________________
____________

                          Performance on Objective Writing Test
______________________________________________________________________________
____________

    1st Quartile           2nd Quartile           3rd Quartile            4th Quartile
   (Mean = 40.55)         (Mean = 49.55)         (Mean = 55.18)          (Mean = 63.75)
______________________________________________________________________________
____________      

      .387                  .697                  .806                   .971
    (n = 78)              (n = 82)              (n = 69)               (n = 63)

______________________________________________________________________________
____________        



Relations between Essay Test Scores and Selected Academic Variables

    As noted previously, additional evidence of a test's validity can be pro-

vided by studies that show that the test is correlated in the manner expected

with other variables of interest.  In the case of the Essay Test, positive

correlations between this test and certain academic variables are reasonable to

expect.  As part of their studies of the Regents' Test, Hickman (1973) and

Prather and Smith (1975) examined the relations of students' Essay Test scores

to their high school and college grade point averages and to their scores on the

verbal and mathematical sections of the Scholastic Aptitude Test.  The findings

from these two studies are given in Table 12.



                                        Table 12

                      Findings from Two Studies on the Correlations
             between the Regents' Essay Test and Selected Academic Variables


         
___________________________________________________________________________

                                               Hickman          Prather &
          Academic Variables                   (1973)*       Smith (1975)**
         
___________________________________________________________________________

          Aptitude Measures
          Scholastic Aptitude Test (Verbal)      .37              .29
          Scholastic Aptitude Test (Math)        .21              .24

          Grade Point Averages
          Cumulative - High School               .23              .11
          Cumulative - College                   .26              .30
          Freshman - College                     .21              .23
          English Composition - College          .20              .13
      
         
___________________________________________________________________________

           *Correlations based on 660 to 892 students attending five different 
            institutions in the University System of Georgia.

          **Correlations based on 1,910 students attending one university in the 
            University System of Georgia.



    As is evident from the table, the correlations obtained in the two studies

were similar.  Students' Essay Test scores were most strongly correlated with

their performance on the verbal section of the Scholastic Aptitude Test (SAT-V),

but this correlation was not high.  Some relation between these two measures

should be expected.  The SAT-V assesses individuals' verbal reasoning skills,

which undoubtedly influence the effectiveness of their writing, but it does not

directly assess those skills in organizing and expressing ideas that are also

germane to successful performance on the Essay Test.  Also, the size of the

correlations between the SAT-V and the Essay Test reported by Hickman and by

Prather and Smith might have been higher if the four-point range of Essay Test

scores were not so small.

    With respect to the low correlation between SAT-M and students' Essay Test

scores, restriction of range also may be influential here, but these

correlations would be expected to be low in any case.  Although some portion of

the ability influencing individuals' performance on the SAT-M might also in-

fluence their writing skills, the SAT-M and the Essay Test largely assess

different skills. 

    Finally, the low correlations found between students' Essay Test scores and

their grade point averages (GPA's) should be considered.  Since students' GPA's

can range over only a 5-point scale and the Essay Test scores are on a 4-point

scale, both variables involved in these correlations have restricted ranges that

limit the size of these correlations.  Also, because many factors other than

writing skill affect students' grades in most courses that they take (Cronbach,

1971), the relation of their essay scores to their cumulative GPA's is unlikely

to be large.  However, the low relation between the essay scores and students'

English composition grades is somewhat surprising.  Presumably, these grades

primarily reflect students' writing skill.  Hickman (1973) suggested that the

low relation was obtained because the essays graded in English composition

classes are unlike those entailed in the Essay Test in terms of both the time

allowed for writing and the audience to which the two types of essays are

addressed.  Also, she noted that students' grades in composition are based on

several appraisals of the students' writing, whereas their Essay Test score is

based on only one.  However, it is also possible that the low relation occurs

because the grading of students' essays is systematic, whereas the grades as-

signed in English composition courses are based on standards that vary across

courses and instructors and are influenced by many factors that are

uncontrolled.

    The meaning of the correlations reported by Hickman and by Prather and

Smith is clarified by data collected by Citron (1980).  As part of a larger

study, she examined the relations of students' SAT-V scores and their grades in

English to the pass rates they attained on the Essay Test.  For her study,

Citron collected data on 1,971 students who attended the Georgia Institute of

Technology and took the Regents' Test between Summer, 1977, and Spring, 1978.

    Citron's findings are presented in Tables 13 and 14.  As shown in Table 13,

performance on the SAT-V actually bears a stronger relationship to the passing

rates on the Essay Test than the correlations reported above convey.  Only 26%

of the students with SAT-V scores between 200 and 299 passed the Essay Test,

whereas 75% of those with SAT-V scores between 400 and 499 were given passing

scores on this test, and all but a few of the students at the top of the SAT-V

score range passed the Essay Test.  Like Hickman (1973) and Prather and Smith

(1975), Citron had obtained a low correlation (r = .285) between the SAT-V and

the Essay Test, but this correlation clearly understates the fairly strong

relation between performance on these two measures shown by Table 13.



                                          Table 13
    
                      Relation between Scores on the Verbal Section
                       of the Scholastic Aptitude Test (SAT-V) and
                        Passing Rates on the Regents' Essay Test
                                     [Citron, 1980]
 
             ________________________________________________________________
                                            
                SAT-V                                  Percent Passing
                Scores                                 Regents' Essay Test 
             ________________________________________________________________
     
                200-299                               26    (n = 23)
                300-399                               61    (n = 152)
                400-499                               75    (n = 634)
                500-599                               86    (n = 760)
                600-699                               89    (n = 336)
                700-800                               92    (n = 66)
             ________________________________________________________________            


 
   In Table 14, where the relation of students' Essay Test scores to their

English GPA's is displayed, a marked relation between these two measures also is

evident.  Among those students with English GPA's in the 1.0 to 1.9 range, only

58% passed the Essay Test.  In contrast, 78% of those having English GPA's be-

tween 2.0 and 2.9 and 90% of those with GPA's above 3.0 passed the test.  Al-

though a correlation was not calculated for the data presented in Table 14, the

relation between Essay Test scores and English GPA is evidently stronger than

that conveyed by the low correlation coefficients calculated by Hickman and by

Prather and Smith. 



                                        Table 14

                      Relation between English Grade Point Averages
                      and Passing Rates on the Regents' Essay Test
                                     [Citron, 1980]


             __________________________________________________________________

                 English Grade Point                   Percentage Passing
                Averages                              Regents' Essay Test
             __________________________________________________________________

                  0 -  .99                            NA*
                1.0 - 1.99                            58    (n = 168)
                2.0 - 2.99                            78    (n = 804)
                3.0 - 4.00                            90    (n = 90)
             __________________________________________________________________ 

              *The pass rate is not given for students having English GPA's 
               less than 1.0 because it would not be meaningful.  Citron 
               assigned GPA's of 0 to the 219 students in her sample who 
               had not taken English and grouped these students with the 21 
               students who had actually obtained English GPA's of less than 
               1.0.



Relation Between Essay Test Scores and Irrelevant Variables   

    As was suggested a propose the Reading Test, support for a claim of test

validity is also gained by findings that a test does not relate to variables

that are thought to be unrelated to the quality assessed by the test (see

Campbell, 1964).  With respect to the Essay Test, two variables that should be

unrelated to individuals' scores on this test are those of handwriting and

neatness since essay raters are not directed to consider these two variables

when grading Regents' essays.  Therefore, by finding that the holistic scores

assigned to essays are unaffected by the handwriting and neatness evident in

the essays, support is gained for the validity of the claim that the holistic

scores are not influenced by such irrelevant variables.  The impact on these

scores of two other presumably irrelevant variables, speededness and bias due

to the influence of ethnic background, was also examined and is discussed in

the pages that follow.


HANDWRITING AND NEATNESS. Gwinn & Renfrow (1980) investigated the effects of

handwriting on the holistic scores assigned to Regents' essays and found no

evidence that handwriting affected these scores.  In their study, these re-

searchers used three essays that were on the borderline between passing and

failing.  Each essay was copied three times, once in a very clear, neat hand-

writing, once in an average handwriting, and once in a legible but very poor

handwriting.  Eighteen graders then rated the three essays, with each grader

rating one essay from each of the three levels of handwriting. The mean

ratings for essays written in good, average, and poor handwriting were 1.33

(s=.49), 1.44 (s=.51), and 1.56 (s=.62), respectively.  An analysis of variance 

showed that there were no significant differences in the mean essay ratings

assigned to the essays that differed in the quality of their handwriting

[F(1,17) = 1.21; p > .05]. 

    In 1981, Renfrow conducted a second study examining the effects of hand-

writing on essay ratings.  In this case, Renfrow analyzed a systematic sample

of 154 essays written by students at Georgia State University during the Fall,

1979, administration of the Regents' Test.  As part of a larger study, she

rated each of the 150 essays on the composite variable of handwriting and neat-

ness.  A nine-point scale was used for her ratings.  The handwriting ratings

were then correlated with the sums of the three holistic ratings assigned to

the essays by Regents' Test raters in a regular scoring session.  A very small

negative correlation (r = -.11) was found between handwriting and the essay

ratings.  Thus, the conclusions of the second study were similar to those of

the first: student handwriting does not seem to be a salient characteristic

affecting the rating of essays written for the Regents' Testing Program;

students who write good essays in poor handwriting do not seem to be penalized

for their handwriting, and students who write poor essays do not seem to be

rewarded for good handwriting.



Speededness.  As in the case of the Reading Test, the Essay Test is intended to

be a power test, not a speed test.  That is, students' Essay Test scores are

expected to primarily reflect the quality of their writing skills rather than

the rate at which they can compose and write an essay.

    Because students who are administered the Essay Test select and write on

just one topic, in effect the test is a one-item test.  Therefore, some of the

less complex methods of appraising test speededness can not be applied to the

Essay Test because these methods pertain only to tests that have multiple items

(see Donlon, 1978; Rindler, 1979).  The ideal, albeit most complex, way to ap-

praise speededness would be to compare the scores that students obtain on

parallel test forms that they have taken in both timed and untimed administra-

tions (see Cronbach & Warrington, 1951).  This kind of experimental study, while

highly desirable, has not been conducted for the current 60-minute Essay Test

both because of its administrative complexity and because of the difficulty

entailed in creating test conditions like those students actually encounter when

taking the Essay Test. 

    Some studies of the Essay Test were carried out when the time limit on this

test was less than 60 minutes.  These studies do not unequivocally demonstrate

that the 60-minute limit introduces no speededness, but their findings do

suggest that students' scores on the 60-minute Essay Test are less likely to be

influenced by the factor of speededness than they would be if the test was

shorter.

    One of these studies was carried out by the Regents' Testing Program in

1974 to investigate, in part, the effect on students' performance of an increase

in the time limits on the Essay Test from 30 minutes to 40 minutes.  In this

study, one to three classes in English composition at each of five institutions

in the University System were randomly assigned to 30-minute and 45-minute test

administrations.  In all, 338 students were allowed 30 minutes and 291 students

were allowed 45 minutes to take the Essay Test.  The essays that were written

were then graded at the regular Winter Quarter grading session.  Subsequently,

the essay scores were analyzed, and it was found that 216 (74%) of the students

working under the 45-minute time limit had attained passing scores, whereas only

220 (65%) of the students working under the 30-minute time limit had passed the

test.  Thus, the 15-minute increase in the time limit produced a higher pass

rate.  This finding suggests that speededness may have been a factor that

depressed performance on the 30-minute test, and that an extension of the time

limit to 45 minutes would diminish the effects of this factor and improve test

performance.

    In light of these findings, the time limits on the Essay Test were subse-

quently extended to 45 minutes, and Henderson (1977) later concluded that these

limits should be extended further to one hour.  As previously noted, Henderson

examined 22 analytic, component ratings assigned to Regents' essays that had

been given holistic scores of 1 (Fail) and 2 (Barely Passing).  In the

discussion of his findings, Henderson noted that both the failing and the barely

passing essays had been given relatively low analytic scores on the components

pertaining to thesis and paragraph development and to economy and precision of

diction.  In Henderson's view, these low scores could have been a product of the

45-minute time limit on the Essay Test.  He contended that this time limit

prevented students both from developing their theses fully and from revising

their writing to make its wording more economical and precise.  Henderson re-

commended that the time limit on the test be extended from 45 minutes to one

hour.  This was done in 1978.

    The impact of the 60-minute time limit on students' test performance has

not been formally examined, but there is no evidence from informal studies of

students' essays that they have difficulty completing the test: the final para-

graphs of most students' essays are found to be complete, and students usually

make revisions in their essays, which suggests that they have time to review and

edit their work.  Also, as Willig (1980) determined from his survey of

composition teachers in Georgia, most teachers of writing (71%) think that the

1-hour time limit is appropriate.  Thus, there appears to be little evidence

that the 60-minute limit is hindering students' test performance.  However, a

formal study of the performance effects of the 60-minute limit would be carried

out if problems produced by this limit became evident. 



BIAS DUE TO THE INFLUENCE OF ETHNIC BACKGROUND.  In the case of the Essay
Test,

bias can be viewed in the same way as that suggested a propose the Reading Test -

- that is, as a matter related to the validity of the test.  More specifically,

bias can be thought of as pertinent to the finding that the scores on a test do

not have the same meaning for all groups of individuals who are tested.  This

finding is undesirable because individuals' scores on a test should reflect the

skill or the ability one wishes to assess and not irrelevant variables such as

group membership.

    Some of the types of internal and external analyses described as appro-

priate for detecting bias in the Reading Test are also applicable in the

investigation of bias in the Essay Test.  As noted previously, internal analyses

include considerations of the content of the test, the difficulty of the test,

its internal consistency, and the like, in the interest of determining whether

the test behaves differently for different groups that are assessed.  External

analyses include studies of the relations between the test and other variables

in order to examine whether these relations are the same for different groups

(see Jensen (1980) for a detailed discussion of these two types of analyses).

    Internal and external analyses that have been conducted on the Essay Test

bear on the question of whether scores on this test have the same meaning for

groups of black and white students that are assessed.  The findings from these

studies are reported in the pages that follow.  The results of the internal

analyses are treated prior to those from the external analyses.



Studies of bias based on internal analyses

    ANALYSES OF TEST CONTENT.  As the validity of the Essay Test, like that of

the Reading Test, is primarily determined by the validity of its content,

studies of this content comprise one important means of detecting factors that

could distort the meaning of the test for members of different groups (Shepard,

1981).  Two types of studies of test content should be carried out.  One of

these entails an examination of the content validity of the test, that is, a

study of (1) the clarity with which the domains of skills to be assessed by the

test are described, and (2) the degree to which the items written for the test

represent these domains (APA et. al., 1974).  The second type of study entails

examining the items of a test to detect any occurrences in these items of

ambiguous wording, stereotypic images, or unfamiliar language that might alter

the meaning of the test for the members of a particular group.

    The procedures used to appraise the content of the Essay Test have been

previously described, but can be briefly summarized here.  The Testing

Subcommittee selected model essays to define the manner in which the Regents'

Test essays should be appraised and provided both analyses of these essays and a

detailed description of the manner in which the essays should be scored.  The

Testing Subcommittee also defined the content of the Essay Test, indicating that

essay topics used on the Regents' Test should be narrow enough to elicit an

essay in 60 minutes, but broad enough to bear on students' common knowledge and

experiences rather than on specialized knowledge that only some students might

have (Thompson & Rentz, 1973).  Also, the topics were not to (1) contain

difficult vocabulary, (2) appear to have a rural-urban or ethnic bias,         

(3) closely resemble topics previously used, (4) involve highly controversial or

emotional subjects, or (5) seem to encourage students to identify their

institutions in their essays. 

    Potential essay topics are solicited from students, faculty, and

administrators in the University System so that a pool of diverse topics becomes

available for use on the Essay Test.  These topics are reviewed by the Testing

Subcommittee, and those that do not fit its specifications for the Test are

either revised or eliminated.  The topics that are regarded as acceptable are

subsequently submitted to the full Academic Committee on English for further

consideration and final approval.  This Committee, composed of representatives

from all 33 institutions in the System, considers the wording and difficulty of

the topics and suggests revisions and deletions in the list of topics where

necessary.  Since four members of the committee come from the historically black

institutions in the University System, the committee's review of essay topics is

expected to detect any topics having content that might not be equally familiar

to black students and white.



ANALYSES OF TEST RESPONSES.   Analyses of individuals' responses to a test

constitute a second component of an internal analysis designed to detect the

presence of bias.  These analyses are comparative in nature, conducted to

discern whether there are differences in the test responses of different groups

that might indicate the presence of bias. 

    To detect biasing features of test content, it is usual to conduct a study

of the relative difficulties of a test's items for members of different groups

(Shepard, 1981).  For the Essay Test, this type of study has not yet been done

because of its administrative complexity: to conduct such a study, numerous

Essay Tests would have to be administered either to the same samples of black

and white students or to many samples of black students and white students that

are equivalent in ability.  Such a study, however complex, is desirable to carry

out and shall be conducted by the Regents' Testing Program office if its

inherent problems can be worked out.

    Because the essays written for the Essay Test are not scored mechanically,

the raters who grade these essays constitute a second possible source of bias

(see Guion, 1978).  Raters' appraisals of an essay may be influenced by vari-

able that are unrelated to writing quality so that, as a consequence, these

variables inappropriately affect the ratings given to members of a particular

ethnic group.

    This matter was investigated by the Regents' Testing Program office in

1974.  Of interest in the study was the question of whether raters who differed

in race assigned grades to the essays written by black students that were dif-

ferent from those that were assigned to the essays written by white students. 

The raters of Regents' essays are given no information concerning the race of

the essay writers.  Therefore, if students' race is found to affect the essay

scores that raters of different race assign, this bias effect is probably due to

subtle features of students' essays and would not be recognized unless such a

study were done.

    The study involved 3,218 essays that were graded at the regular, quarterly

scoring sessions held at three Regents' Essay Scoring Centers. It was not

possible to determine the race of the students or the raters. Therefore, these

raters and students were classified in terms of their affiliation with a

predominantly black or a predominantly non-black school.  The effects of

students' institutional type, raters' institutional type, and the interaction of

these factors on students' essay scores was examined by calculating the mean

scores assigned to the essays that had been classified by students' and raters'

institutional type.  These calculations are presented in Table 15 and depicted

in Figure 1.



                                        Table 15

             Mean Ratings* Assigned by Essay Raters from Predominantly Black
       and Non-Black Institutions to Essays Written by Students from Predominantly
                            Black and Non-Black Institutions

                                                                                 

                                             Rater's   Institutions            
Essay Writer's                __________________________________________________

                                   Predominantly               Predominantly  
Institutions                         Non-black                     Black     
______________________________________________________________________________
__
                                                                     

Predominantly Non-black              1.89                         1.96
                                     (.72)                        (.72)
                                    n = 5,777                    n = 3,094            

Predominantly Black                  1.38                         1.44
                                     (.58)                        (.57)
                                    n = 495                      n = 288

______________________________________________________________________________
___

*Standard deviations given within parentheses







As is somewhat evident in Table 15 and very clear in Figure 1, raters from the predominantly black institutions assigned both groups of students slightly higher mean essay scores than did raters from predominantly non-black institutions. However, the raters from both types of institutions were similar in that both gave higher mean essay scores to the students from predominantly non-black institutions than they did to those students from the predominantly black schools. Thus, raters from both types of institutions rated students from predominantly black and non-black institutions in the same way and did not appear to be differently influenced by type of institution from which an essay writer came. Therefore, there was no evidence from this study that raters showed bias in the grades they assigned to the essays written by students from predominantly black and non-black institutions. Studies of bias based on external analyses Hickman (1973) examined the relations between the Essay Test and selected academic variables as part of a larger study, and her findings permit some rough determination of whether these relations differ for black and non-black groups of students. In her study, Hickman calculated these relations for each of five different institutions. This group of five institutions included one four-year college that was predominantly black, and two junior colleges, one four year college, and one University that were primarily non-black. For each of these five institutions, Hickman calculated the correlations between students' Essay Test Scores and both their scores on the Scholastic Aptitude Test and their grade-point averages. Thus, through a comparison of the correlations for the predominantly black college with those found at the four predominantly non-black institutions, some tentative conclusions can be drawn about the similarities or differences in the meanings of black students' and non-black students' essay scores. Of course, such conclusions must be regarded as rough because the institutions in Hickman's study were not homogeneous with respect to race and, therefore, are not adequate indicators of students' race. Also, the generalizability of these findings is questionable because Hickman used in her study only one predominantly black college, which may or may not be representa- tive of other black institutions or black students in the University System. The results of Hickman's correlational analyses are presented in Table 16. As is evident from the table, the correlations reported for the predominantly black college are, in general, like those reported for the second junior college she examined, and they are higher than the correlations reported for the four- year college, the university, and the other junior college that were examined. For example, the correlations between students' Essay Test scores and their SAT(V) scores are substantially higher at the predominantly black college (r = .46) and at the second junior college (r = .45) than are these correlations when calculated for the other three institutions that Hickman studied (r's = .29 to .32). The correlations between the Essay Test and students' grade point averages conform to a similar pattern, with the correlations for the pre- dominantly black college and the second junior college exceeding those reported for the other three institutions examined. Table 16 Correlations between the Regents' Essay Test and Selected Academic Variables Within Five Different Institutions [Hickman, 1973] ______________________________________________________________________________ __________ Institutional Types ________________________________________________________________ Predominantly Non-Black Predominantly Black Academic ________________________________________________________________ Variables Jr. Jr. 4-Year University 4-year College College College College ______________________________________________________________________________ __________ Test Scores SAT(V)* .29 .45 .30 .32 .46 SAT(M)** .07 .30 .10 .17 .20 Grade Point Averages Cumulative-High School .20 .31 .13 .18 .35 Cumulative-College .25 .41 .25 .26 .41 Freshman-College .22 .33 .19 .20 .40 Eng. Composition-College .21 .39 .03 .30 .44 ______________________________________________________________________________ __________ Note: Correlations based on sample sizes ranging from 88 to 302. *SAT(V) refers to the verbal section of the Scholastic Aptitude Test. **SAT(M) refers to the mathematical section of the Scholastic Aptitude test. It is not possible to ascertain from Hickman's data the reasons why the correlations for the predominantly black college should exceed those reported for three of the four other institutions that she studied. Also, it is not possible to ascertain whether similar findings would be obtained if other in- stitutions in the System were compared. It can be said, however, that the correlations that Hickman reported indicate that, for the institutions studied, the Essay scores obtained by students from the predominantly black institutions generally bear a slightly stronger relation to selected academic variables than do the scores obtained by students from the predominantly non-black institu- tions. Faculty Perceptions of the Regents' Essay Test The perception of a test by those who use it can shed light on validity. If, for example, a test were perceived to have inappropriate content or to measure irrelevant variables, there is basis for questioning whether the measure will provide the type of information that is desired. In the case of the Regents' Essay Test, it is useful to examine how it is regarded by faculty members in the University System of Georgia, since these faculty teach the students who take the test and may often be involved in their remediation. If it were found that this faculty believes that the test has improper content or unsound scoring procedures or that the test is detrimental in its effects, the validity of the test would have to be questioned. In the interest of determining faculty members' perceptions of the Regents' Test, Willig (1980) distributed a questionnaire to composition teachers in the University System of Georgia. Many of the questions posed pertained to the Essay Test, since a few critics of the test had publicly questioned certain features of this test (see House, 1980; Watters, 1979). Willig sent a questionnaire to each of the 498 full-time faculty members who taught composition in the System. Approximately 60% of these questionnaires were completed and returned. Of these respondents, approximately 15% had taught composition fewer than five years, and 61% had taught it for more than ten years. A summary of the responses to the questionnaire is presented in Table 17. In general, the composition teachers indicated support for the Regents' Test and lack of agreement with the critics of the Test. Over 75% of the respondents indicated that they were in favor of the Test and indicated that it is "a meaningful check of minimal reading and writing skills." Most of the respon- dents indicated that the one-hour time limit is sufficient (71%), that the writing of one essay is adequate (85%), and that the present anonymous system of grading is preferable to grading done on individual campuses (82%). Further- more, most of the respondents indicated that the Essay Test neither overempha- sizes nor underemphasizes grammar (74%) and that the Test does not discriminate against black students (75%). Table 17 Faculty Responses* to Questions about the Regents' Test [Willig, 1980] ______________________________________________________________________________ ___________ What is your overall opinion of the Regents' Test? I am in favor of it. (75%) I am opposed to it. (18%) I am neutral. ( 7%) How does the existence of the Regents' Test affect the teaching of composition at your institution? It tends to improve the overall teaching of composition. (57%) It tends to harm the overall teaching of composition. (17%) It has no noticeable positive or negative effect. (26%) The test is a meaningful check of minimal reading and writing ability. (76%) The test is not a meaningful check of minimal reading and writing ability.(**) I have no opinion.(**) The test discriminates against black students. (13%) The test does not discriminate against black students. (75%) I have no opinion. (12%) Students should have more than one hour to write the required essay. (24%) One hour is sufficient writing time for the limited demands of this essay. (71%) I have no opinion. (5%) The writing section of the Regents' Test emphasizes grammar too much. (5%) The writing section of the Regents' Test emphasizes grammar too little. (10%) The emphasis on grammar is appropriate. (74%) I have no opinion. (11%) The writing section of the Regents' Test should require more than one essay. (8%) The writing section of the Regents' Test is adequate with one essay. (85%) I have no opinion. (7%) The Regents' Test should be graded by professors on the campus where it is administered. (13%) The anonymous "mass-grading" of the Regents' Test as presently done is adequate. (82%) I have no opinion. (5%) Students should be allowed to use a dictionary during the writing section. (67%) Students should not be allowed to use a dictionary during the writing section. (25%) I have no opinion. (8%) ______________________________________________________________________________ ____________ * Responses reported in terms of percent of respondents choosing each option. ** Data not available. The one criticism of the Test that received support from the majority of respondents concerned the rule against students' use of a dictionary for the Essay Test. (Use of a dictionary has never been permitted because of the pro- bless with test administration and test security that providing dictionaries would cause. Raters of the Essay Test are aware that students do not have access to dictionaries and are supposed to take this into account while rating. There is no evidence that spelling is a major cause of failing scores [Henderson, 1977; Ravan, 1973].) Chapter III ADDITIONAL TECHNICAL INFORMATION
 

                      Reliability of Reading Test Scores

    Reliability is concerned not with what a test measures but with how con-

sistently it measures whatever is measured.  Unreliability in test scores

results from variation due to chance factors such as guessing, the health of 

an examinee on a particular day, and the sampling of test items.  There are

different methods for examining reliability that take into account different

types of random error and provide different types of information about

consistency.  Most traditional estimates of reliability such as alternate

forms correlations and internal consistency coefficients are based on

correlational methods.  Thus, these estimates provide information on the

consistency with which examinees are rank-ordered and are highly dependent on

the variability of test scores.  Because discriminating among examinees is not

the major purpose of the Regents' Test and no attempt is made to maximize

variability, the internal consistency and alternate forms reliability

estimates for this test must be interpreted with caution.

    Because the Regents' Reading Test is used to determine whether students

score above a predetermined criterion rather than to compare students with

each other, the most important estimate of reliability is an estimate of the

consistency with which pass-fail decisions about students are made.  In order

to obtain such an estimate, a representative group of examinees should be

given two different forms of the Reading Test under the same conditions and

with no instruction between the two test administrations.  The similarity of

pass-fail decisions yielded by the two administrations could then be examined.

Unfortunately, such a study has not been conducted on the Regents' Reading

Test because of practical problems.  The major problem is the difficulty of

administering two forms under the same conditions.  For example, if one form

were to be used for practice and another for the official test administration,

the two administrations would differ on conditions such as student motivation

and anxiety.  An alternative is to conduct two official test administrations

and use the student's highest score as the final score.  However, implementing

this study would cause problems.  It would not be fair to use a sample of

students or schools, as this would give some students a greater opportunity to

pass the test than others.  On the other hand, the study could not be

implemented at all institutions because some would find it administratively

impossible to give the Reading Test twice to all students.

    Because it was not feasible to administer two forms of the test under the

same conditions to a representative sample, other data that had been collected

were used to estimate the results of such a study.  This analysis, as well as

alternate forms and internal consistency reliability of the Reading Test, is

discussed in the remainder of this section.



Consistency of Decisions

    The consistency of pass-fail decisions was examined through a comparison

of the results from two quarters for a sample of examinees who had initially

failed the Regents' Test.  This comparison is less than ideal for two reasons: 

1) the sample of examinees is not representative because it consists only of

students who initially failed one or both parts of the Regents' Test, and 2)

some students in the sample had remedial work between the two administrations

which, if effective, should cause inconsistency in pass-fail classifications. 

Despite these problems, the data provide useful information about the

consistency of decisions.

    The sample consisted of the 2,613 students who took Form 15 of the

Reading Test in Winter, 1979 and repeated the Regents' Test with Form 16 of

the Reading Test in Spring, 1979.  Table 18 shows the classifications of

students on the two administrations of the test.



                                   Table 18

                      Classification of Repeaters on Two
                      Administrations of the Reading Test


                                Classification
                                    Test 1

                                       
                                FAIL      PASS
                            |-------------------|
                            |         |         |
                            |    a    |    b    | 
                            |         |         | 
                      FAIL  |   230   |    99   |
                            |   8.8%  |  3.8%   |
     Classification         |         |         |
                            |---------|---------|  
        Test 2              |    c    |    d    |
                            |         |         |
                      PASS  |   360   |  1924   |
                            |  13.8%  |  73.6%  |   
                            |         |         |            
                            ---------------------


     Consistency is indicated by the proportion falling in cells a and d. 

This proportion, which is called the coefficient of agreement, is .824.  This

value is misleading because some of the inconsistency reflected in cell c is

the result of remediation rather than error.  Because some of the students who

failed took remediation before repeating the test, the lower failure rate on

the second administration is to be expected.  Therefore, only cell b provides

an unambiguous indication of inconsistency due to unreliability.  If it is

assumed that cell b is in fact the best estimate of unreliability of

classification, and if that value is used as an estimate for cell c, then

total unreliability is approximately 8%.  In other words, classification

consistency is estimated at 92%.  

    While the estimated consistency of classification of 92% is quite high,

this value would be higher if a representative sample of students had been

included in the study.  Because the sample consisted of repeaters only, the

mean of the sample was lower than that of the total group, and more students

had scores near the cutoff score.  Inconsistency of classification is more

likely for scores bordering on the cutoff score than for those well above the

cutoff; therefore, the inconsistency in the sample should be an overestimate

of inconsistency in the total group.



Correlation Between Alternate Forms

     The data from the study described above were also used to estimate the

alternate forms reliability coefficient for the Regents' Reading Test.  The

correlation between Form 15 scores and Form 16 scores for the sample of

repeaters was .70.  This coefficient is an underestimate of the alternate

forms reliability because the variability of the sample was smaller than the

variability usually found in the total group and because some of the

inconsistency is the result of remedial work taken by some students between

the two administrations.  Also, as noted above, this coefficient must be

interpreted in light of the purpose of the Regents' Test.  The coefficient

indicates the extent to which the examinees' relative positions were

maintained from the first test to the second test.  The Regents' Test is not

developed to maximize this type of consistency because discriminating among

students is not the major purpose of the test.


Internal Consistency

    The KR-20 reliability estimates for the two most recently used forms of

the Reading test are presented in the following table.



                               Table 19

                           KR-20 Reliability
                   Estimates for Form 17 and Form 20
     __________________________________________________________              
      Quarter       Form       No. of Items       N       KR-20               
     _____________________________________________________________ 
                                                                
     Spring 1981    17            69            7757     .865
     Spring 1982    17            69            7169     .869
     Summer 1982    20            58            3180     .864
     _____________________________________________________________



    Although these KR-20 reliability coefficients provide information about

the reliability of discriminating among examinees and may be underestimates of

the reliability of the Regents' Reading Test, the values, which are quite

high, provide additional evidence of Reading Test reliability.

    Additional KR-20 reliability estimates were calculated for a sample of

116 examinees taking both Form 20 of the Regents' Reading test and Form 1A of

the STEP reading test in Summer Quarter, 1982.  For this sample, the KR-20

reliability of the Regents' Reading Test was .88, and the KR-20 reliability of

the STEP was .89.  This comparison provides further evidence that the

reliability of the Reading Test is quite high: its reliability is comparable

to that of the STEP, a test that has been designed to provide maximum

discrimination among students in the entire range of scores.



  

                        Analysis of Reading Test Items


    Item analysis results for the Spring Quarter, 1984, administration of

Form 23 of the Regents' Reading test are provided in Table 20.  Presented for

each item is the following information: item classification, the percentage of

students choosing each of the four options, the p-value, the point-biserial

correlation, the biserial correlation, and the Rasch difficulty.



                                      TABLE 20

                        Item Analysis Data for the Spring, 1984
                               Administration of Form 23

           ________________________________________________________________

             ITEM     ITEM     % CHOOSING EACH OPTION     P-    PB    BIS  
            NUMBER    CLASS         1    2    3    4    VALUE  CORR   CORR          
           ________________________________________________________________  


              1     Literal         2   93*   3    2     93    .19    .35
              2     Literal         0    2    3   94*    94    .33    .66
              3     Inference       4    1   93*   2     93    .33    .62
              4     Analysis       88*   6    5    1     88    .16    .27
              5     Literal         2   78*   5   14     78    .42    .59
              6     Inference       2    7   87*   4     87    .35    .55
              7     Vocabulary      4   10    1   85*    85    .36    .56
              8     Vocabulary     23    7   67*   3     67    .36    .47
              9     Literal        86*   4    8    2     86    .24    .38
             10     Inference      74*  17    2    6     74    .39    .53
             11     Inference      80*   2    2   16     80    .23    .33
             12     Vocabulary      6   75*  13    6     75    .32    .44
             13     Inference       8   55*   2   35     55    .41    .52
             14     Analysis        7    6    2   85*    85    .37    .57
             15     Analysis       14    4   16   65*    65    .27    .35
             16     Vocabulary      4    8   22   66*    66    .45    .58
             17     Vocabulary     30  56*    4   10     56    .42    .53
             18     Literal        75*   5    4   17     75    .39    .54
             19     Inference      11   10   73*   5     73    .32    .44
             20     Inference       3   76*  10   11     76    .43    .58
             21     Analysis        6    5   77*  11     77    .37    .52
             22     Analysis        4    6    2   88*    88    .31    .51
             23     Inference      48*  13   30    9     48    .41    .51
             24     Inference      82*   1    1   16     82    .37    .55
             25     Literal         1    2   96*   1     96    .22    .49
             26     Inference       8    4    6   82*    82    .39    .58
             27     Analysis       79*   8    5    8     79    .41    .59
             28     Vocabulary      4   74*  11   11     74    .43    .58
             29     Inference      11    5    6   78*    78    .39    .54
             30     Inference       4    3   88*   5     88    .36    .59
             31     Inference      88*   4    4    3     88    .35    .57
             32     Literal         3    3  81*   13     81    .49    .70
             33     Analysis       30  44*  24     1     44    .42    .52
             34     Vocabulary      2  88*   6     4     88    .45    .72
             35     Literal         6  85*   3     5     85    .44    .67
             36     Inference      76*  8   13     2     76    .30    .41
             37     Vocabulary     91*  2    3     4     91    .36    .63
             38     Literal        19   6    2    72*    72    .43    .58
             39     Inference       2  91*   3     3     91    .31    .54
             40     Literal         2  72*  16     9     72    .42    .55
             41     Literal         1   1    0    98*    98    .24    .64
             42     Analysis        8   2   81*    9     81    .39    .56
             43     Inference      81* 12    5     2     81    .28    .40
             44     Inference       1   5   81*   13     81    .39    .56
             45     Literal        12  74*   1    13     74    .27    .37
             46     Vocabulary      1  96*   1     2     96    .22    .48
             47     Analysis       12  17    4    67*    67    .31    .40
             48     Analysis       50*  8   29    12     50    .38    .48
             49     Vocabulary      3   1   94*    1     94    .28    .55
             50     Inference       1   6   80*   12     80    .45    .64
             51     Vocabulary      5  86*   2     6     86    .43    .68
             52     Literal        28  61*   4     5     61    .31    .39
             53     Inference       3  10    5    80*    81    .34    .48
             54     Vocabulary     72* 20    4     3     72    .56    .74
             55     Inference       6  75*   7     9     75    .51    .70
             56     Vocabulary     82*  6    8     2     82    .45    .66
             57     Analysis       76*  8    3     9     76    .56    .77
             58     Inference       9   4   26    58*    58    .56    .71
             59     Analysis        5   4    4    84*    84    .44    .66
             60     Inference      19  26   18    32*    32    .37    .49



    The "ITEM CLASS" column indicates the skill category classification of

each item.  The four classifications included on the Regent's Reading Test are

Vocabulary (VOC), Literal Comprehension (LIT), Inferential Comprehension

(INF), and Analysis (ANA). These skills were briefly described in Table 1.

    The "% CHOOSING EACH OPTION" columns indicate the percentage of students

that chose each distractor and the percentage that chose the correct answer. 

The correct answer for each item is indicated with an asterisk (*).

    The p-value is the percentage of students getting the item correct.  It

is identical to the percentage choosing the option marked with an asterisk.

    The point-biserial correlation (PB CORR) is the Pearson product moment

correlation between item score (correct-incorrect) and total test score.  It

is an index of item discrimination and indicates the extent to which those who

did well on the total test tended to get the item right more often than those

who did less well on the test.  The maximum value of point-biserial

correlations is always less than 1.0, and the maximum value for very hard or

very easy items is less than the maximum value for items of middle difficulty.

Thus, the values of the point-biserial correlations are dependent on item

difficulty and should be interpreted in light of these difficulties.

    The biserial correlation (BIS CORR), which is always higher than the

point-biserial correlation, is an estimate of the Pearson product moment corre-

lation between the total test score and a hypothetical continuum of

performance on an item.  While this estimate is based on some assumptions that

are not always tenable, the biserial correlation is useful in that it provides

an index of discrimination that is independent of item difficulty.

    The last column of the table presents the Rasch difficulty values (RASCH

DIFF) for each item.  The Rasch difficulty values are transformations of the

p-values.  These difficulties are described in more detail in the section of

this chapter concerned with equating.  Of interest here is the fact that the

difficulties can be related to the cutoff score on the test.  A scale score of

61, which is the minimum passing score, corresponds to a Rasch difficulty

value of 1.1.  More than 50% of students with scores at or above the cutoff

correctly answer items with difficulties below 1.1; less than 50% of the

students at the cutoff correctly answer items with difficulties greater than

1.1.

    The results of the item analysis are not used as the sole basis for selec-

tion of items for forms of the Regents' Reading Test, as item content is con-

sidered more important than performance data.  However, the item analysis data

are routinely used in the revision of items.  For example, a popular

distractor for an extremely hard item may be revised to make the item easier;

an item with a low discrimination index is carefully examined for any evidence

that the item is ambiguous and needs revision.

    When an item is used for the first time on a form of the Reading Test,

the results of the item analysis are used to determine whether the item should

be included in the scoring of the test.  Occasionally, the data indicate

problems that were not foreseen in the test development process.  When

problems with an item are found, the item is not included in the scoring or

equating of the test.  Two such items, items 12 and 57, were deleted from the

scoring of Form 20 of the Regents' Reading Test.  Thus, Form 20 consisted of

58 rather than 60 items that were scored and used in the equating.            




  


                        Equating of Reading Test Forms


    The score that is used to describe a student's performance on the Reading

Test is a translation of the student's total raw (number right) score to a

standard score scale that is common to all forms of the test.  Scaled scores

rather than raw scores are used so that a student's score is independent of

the particular form of the test taken.  It is not possible to develop alter-

nate forms of a test that are equivalent in difficulty; some forms are

slightly easier or more difficult than others.  Therefore, it is important

that these differences in difficulty be taken into account in the reporting of

scores.  The scaled scores take these differences into account: a scaled score

of 61 indicates the same level of skill regardless of the relative difficulty

of the particular form taken.  For one form of the test, a scaled score of 61

may represent 70% of the items answered correctly; for a slightly more diff.-

cult form, a scaled score of 61 may represent 68% of the items answered

correctly.  Because of the use of scaled scores, a student is not penalized

for taking a more difficult form or rewarded for taking an easier form. 

Furthermore, the use of the scaled scores allows performance comparisons to be

made from one quarter or year to another; although different forms are used,

equal scaled scores indicate the same level of performance from one

administration to another. 

    Before students' reading scores are reported, each form of the Reading

Test must be equated to the common score scale.  This equating is accomplished

through the use of a bank of items whose difficulties have been calibrated

with Rasch procedures.  The Rasch model is a latent trait model that expresses

the probability of an examinee's correctly answering an item as a function of

two parameters - - the item difficulty and the examinee's ability (or

achievement, in the case of the Regents' Test).  Item difficulty and examinee

ability are calibrated on the same logistic scale.  In fact, item difficulty

is defined as the point on the ability scale at which an examinee has a .50

probability of getting the item correct.

    An advantage of using the Rasch model for equating tests is that, if the

data fit the model, the estimates of item difficulty are independent of the

ability of the particular group of examines, and the estimates of examinee

ability are independent of the particular set of items on the test.  When

items drawn from a Rasch-calibrated item bank are used to construct a test,

the test should automatically be equated to other tests constructed from the

item bank (Wright, 1977).  When difficulties are calibrated, the ability of

the sample of examines is taken into account so that, unlike the traditional

p-value, the difficulty estimate should be the same regardless of whether the

particular sample is of relatively high or low ability. 

    Whenever an item is used on a form of the Regents' Reading Test, a Rasch

difficulty for the item is estimated.  If the item appears to be functioning

appropriately, it is included in an item bank.  When a new form of the test is

developed, items from the item bank, as well as new and revised items, are

included on the form.  The items from the bank, which are to be used in the

equating, are chosen on the basis of fit to the Rasch model as well as on the

basis of content.  After the new form is administered, Rasch difficulties are

calculated for all items.  Fit to the model is re-examined for those items

with bank difficulties, and stability of these difficulties is checked.  An

item is not used in the equating if a problem is found.  Through a comparison

of the bank difficulties and the difficulties calculated from the

administration of the new form, an equating constant is derived.  This

constant is used in conjunction with further Rasch analyses to convert

number-right scores to scores on the common scale.  

    The tentative conversion table derived from the process described above 

is then verified through other procedures.  Because accuracy of equating at

the cutoff score is crucial, the raw score to scale score conversion of scores

near the cutoff in the tentative conversion table is examined with equipercen-

tile and conditional p-value procedures.

    Items that are common to the new form and one of the existing forms serve

as a basis for equipercentile equating.  The new form and the existing form

are equated to scores on the common items; scores on the two forms correspon-

ding to the same scores on these items are presumed to be equivalent (Angoff,

1971).  While the number of common items is usually not sufficient for equi-

percentile equating of scores throughout the score scale, the results are

useful as a means of verifying the Rasch equating.  The scores examined are

those in a narrow range around the cutoff, and these scores have sufficiently 

high frequencies to allow interpretation of the results of equipercentile

equating.

    An additional procedure used to verify the equating is an examination of

conditional p-values for items common to the new form and an existing form

(Burk, 1980).  This procedure is based on the assumption that examines with

the same scaled score should have the same level of achievement on common

items regardless of the test form used to obtain the scores.  The adequacy of

equating is verified through a comparison of the proportion of examines with

equivalent scores from the two forms who correctly answer items common to both

forms.  The equating is considered accurate if the two forms yield similar

p-values for examines with equivalent scores near the cutoff.

    For Form 20, the equipercentile and conditional p-value procedures veri-

fied the equating near the cutoff produced by the Rasch model.  When Form 16

was equated, the verification procedures and other analyses indicated that a

small adjustment was needed.  Further examination of the Rasch equating indi-

caged some instability in the difficulties of a few items.  When this insta-

bility was taken into account, the raw score corresponding to the cutoff was

lowered by one point.  The equating was then consistent with the results of

the equipercentile and conditional p-value methods.

    Before scores are reported, the logistic scale scores obtained from the

Rasch model and verified through the equipercentile and conditional p-value

procedures are linearly transformed to a scale that ranges from 0 to 99. 

Specifically, each logistic score is multiplied by 10 and increased by 50

points.  This transformation has no effect on the equating; its only purpose

is to place the scores on a more convenient scale. 

    The final conversion table for Form 23 is presented in Table 21.  The

cutoff score, a scaled score of 61, is equivalent to a raw score of 43 on this

form.


                                   Table 21

                  Raw Score* to Scaled Score Conversion Table
                   for Form 23 of the Regents' Reading Test

______________________________________________________________________________
_
      Raw Score    Scaled Score                 Raw Score    Scaled Score    
______________________________________________________________________________
_

          1              5                         31             51     
          2             12                         32             52
          3             17                         33             53
          4             20                         34             53
          5             23                         35             54
          6             25                         36             55
          7             27                         37             56
          8             28                         38             57
          9             30                         39             58
         10             31                         40             58
         11             33                         41             59
         12             34                         42             60
         13             35                         43             61**
         14             36                         44             62
         15             37                         45             63
         16             38                         46             64
         17             39                         47             65
         18             40                         48             67
         19             41                         49             68
         20             42                         50             69
         21             43                         51             71
         22             44                         52             72
         23             45                         53             74
         24             45                         54             76
         25             46                         55             78
         26             47                         56             81
         27             48                         57             84
         28             49                         58             88
         29             49                         59             96
         30             50                         60             99
 _________________________________________________________________ 
 *The raw score is the number of items answered correctly.                                   
 **minimum passing score                                                          



 


                       Reliability of Essay Test Scores

                                    
    The two major sources of inconsistency on the Essay Test arise from the

sampling of writing and the rating of essays.  Students write one essay at

each test administration, and three graders rate each essay.  The reliability

of the Essay Test could be increased by having students write more than one

essay on more than one day with more than three raters grading each essay. 

However, practical considerations limit the extent to which the reliability

can be increased.

    In any consideration of reliability, the importance of the decision to be

made on the basis of a test score must be taken into account.  It would not be

reasonable to base a major decision such as eligibility for graduation on only

one writing sample.  However, under Regents' policy, a student has many

opportunities to take the test before graduation.  Thus, if a student who

fails the test complies with policy, he or she would have written on more than

one topic on more than one day and received grades from numerous raters before

fulfilling other requirements for graduation.  Furthermore, a student may

continue to take the test as many times as necessary after all other

requirements for graduation have been completed.  Thus, no irreversible

decisions about college graduation are made on the basis of Regents' Test

results.

    As discussed in the section on Reading Test reliability, information on

reliability should be obtained by administering different forms of the test

under the same conditions to a representative group of examines.  Consistency

of classifications made on the basis of the different administrations could

then be examined.  Such a study has not been conducted for the Essay Test

because of the practical problems discussed in the section on Reading Test

reliability.  The reliability of raters, however, is reported each year, and

additional studies have been conducted to examine this reliability.  This

information on rater reliability is described in the remainder of this section.



Rater Reliability Reports

    Every Fall Quarter, the performance of each rater who graded essays in

any of the previous four quarters is examined.  Reported for each rater are

the number of days rated each quarter; the total number of days rated for four

quarters, the total number of essays rated; the number of essays rated per

day; the agreement percentage, which is the percentage of essays for which the

rating was identical with the rating of at least one of the other two raters

scoring the essay; and the percentage of 1's, 2's, 3's, and 4's given over the

four-quarter period. 

    Copies of this report are provided to members of the Academic Committee

on English and to Scoring Coordinators.  Committee members receive the report

on raters from their institutions, and Scoring Coordinators receive reports on

all raters attending their scoring centers.  Raters are instructed to examine

their performance statistics in relation to the average performance statistics

for the system.  Committee members are instructed to use the statistics to

identify discrepant raters at their institutions and, when raters with

discrepant performance are identified, to make sure that they are not sent to

another scoring session until they have reviewed the Essay Scoring Manual and

the source of the discrepancies has been identified.

    The rater performance summary statistics for Fall, 1980, through Summer,

1981, are presented in Table 22.  These data are provided with the rater

reports to assist in the interpretation of data for individual raters.  An

indication of rater reliability is the "Percentage Agreement" statistic of

80.22.  This value is the mean percentage agreement defined as the percentage

of essays for which an individual raters' rating is identical to the rating of

at least one of the other two raters scoring the essay.  Also provided is a

different type of percentage agreement statistic, the percentage of times at

least two of the three raters agreed on the essay rating.  This agreement for

1980-1981 was 97.2%.



                                      Table 22 
 
                      Rater Performance Summary Statistics
                      for Fall, 1980 through Summer, 1981




                                                           System Average
Number of Essays Rated Per Day                                 112.94
(based on average of 4.15 hrs. per day)

Percentage Agreement                                            80.22


Distribution of Ratings:

                      Rating                    Percentage
      
                        1                          38.69
                        2                          55.13
                        3                           5.76
                        4                            .41

__________________________________________________________________________

Distribution of Essays Scores (This is the distribution of student scores
reported to institutions; use the above distribution for comparisons with
rater reports.)

                      Score                     Percentage
      
                        1                          36.58
                        2                          60.12
                        3                           3.21
                        4                            .09

Percentage of times at least two of the three raters agreed: 97.2
(Use "Percentage Agreement" above for comparisons with rater reports.)



Results of Review Process

    The results of the review process for essay scores provide information

about the frequency with which mistakes are made in the rating of essays. 

Since January, 1980, students who failed the Essay Test with failing scores

from two raters and a passing score from one rater have had the opportunity to

request a re-evaluation of the essay by a systemwide review panel.  This

review panel consists of members of the Testing Subcommittee and the Scoring

Coordinators.  The specific procedures for review are described in the Essay Scoring Manual.

    The review process was implemented so that mistakes in the rating of

essays could be corrected.  The results of the review process indicate that

very few mistakes are made in the rating of failing essays.  From Fall, 1980,

through Summer, 1981, 11,331 essays received failing scores.  Of these, 6,042

were given a passing grade by one rater and were thus eligible for review.  Of

these essays, 201 were submitted for re-scoring on the recommendation of the

on-campus review panels.  In the final review, 56 essays were given passing

scores by the systemwide review panel.  Thus, failing grades were reversed on

review for fewer than 1% of the essays that were eligible by virtue of

receiving one passing rating.  Furthermore, the members of the systemwide

review panel have noted that very few serious mistakes in essay ratings are

found; most of the essays that pass on review are on the borderline between

passing and failing, and, even when the systemwide reviewers assign passing

grades to these essays, they can also identify reasons why they were failed on

the initial scoring.



Additional Analyses

    Singleton examined the reliability of Regents' Test essay ratings through

four different procedures.  Her summary of the types of estimates used and of

the results of her analyses is provided in Table 23 (1976, p. 106). 



                                        Table 23

      Estimates of Rating Reliability for the Essay Portion of the Language Skills
                                       Examination
                                 [from Singleton, 1976]


______________________________________________________________________________
____________
                Percentage of      Product-Moment      Reliability of      Coefficient of
 Statistic     Rater Agreement      Correlation       Average Ratings        Concordance          
______________________________________________________________________________
____________

Sample        N = 92,469         N = 162            N = 17,095          N = 43,508

Value of
Statistic         92.97             .624                 .7248               .8208

Interpre-     At least 2 of      Correlation        If one averaged     An estimate of
tation        3 raters 0         between stu-       three ratings for   the degree of
              agree on an        dent scores        each ratee and      concordance
              essay rating       and scores as-     could correlate     for three in-
              92.97% of          signed by a        the set of av-      dependent 
              the time           panel of experts   erages from com-    judgments - 
                                                    parable raters,     given that all
                                                    the results would   scoring pat-
                                                    be about .7248      terns with a
                                                                        score differ-
                                                                        ence of one 
                                                                        point are ad-
                                                                        justed to con-
                                                                        concordant 
                                                                        patterns  

______________________________________________________________________________
____________
                                 



    The "Percentage of Rater Agreement" of 92.97% reported by Singleton was 

based on the ratings of 92,469 essays graded over seventeen quarters.  This 

value, which was 97.2% for 1980-1981, is the percentage of times at least two 

of the three raters agreed on the essay rating.  (The increased percentage 

agreement for 1980-1981 does not necessarily reflect higher reliability; the 

increase is due partly to the fact that fewer ratings of 3 and 4 have been 

given in recent years.)

    The second statistic reported, the "Product-Moment Correlation," is the 

correlation between ratings assigned to 162 essays by members of the Testing 

Subcommittee and the ratings assigned during the regular scoring session.  The 

Testing Subcommittee did not use the usual procedure for rating essays: in 

addition to the four-point scale, they also used borderline scores (e.g., 1.5 

to indicate a score between 1 and 2).  The final score for an essay was the 

mean of the ratings rather than the middle score usually used as the final 

score.  The correlation between these ratings and the ratings assigned during 

the regular scoring session was .624.  As discussed in the section on reading 

reliability, correlational estimates of reliability for the Regents' Test must 

be interpreted with caution because they indicate only the extent to which 

consistency in the rank-ordering of examines is maintained and do not provide 

information on consistency in relation to the cutoff score.

    The third estimate of rater reliability reported by Singleton is an intra-

class correlation calculated according to the procedure described by Ebel 

(1967).  Ratings from 17,095 essays written during a two-year period were 

analyzed with this procedure, which is based on the analysis of variance.  The 

estimate of reliability for single ratings was .47.  The reliability of the 

average of three ratings was estimated as .72.  This can be interpreted as an 

estimate of the correlation between two sets of scores when each score is 

based on the average of three ratings.

    The final analysis of reliability reported by Singleton was an intraclass 

correlation based on the degree of concordance among essay ratings.  For this 

analysis, all score patterns with a difference of only one point by one rater 

(e.g., ratings of 1,1,2 or 2,2,3) were adjusted to concordant patterns.  This 

adjustment seems reasonable because the one deviant rating does not affect the 

final score of an essay. The average correlation between pairs of ratings, 

which is an estimate of the reliability of single ratings, was .60.  The 

Spearman-Brown formula was used to estimate the reliability of the average of 

three ratings.  This value, which may be interpreted as the correlation 

between the averages of two sets of three ratings when patterns with a 

one-point discrepancy are adjusted for concordance, was .82.

    Singleton found that the results of her analyses were comparable with the 

results found in other situations in which essay tests are rated. On the 

basis of her analyses, she concluded that the Essay Test was scored consis-

tently and that the reliability of the ratings was sufficient for the intended 

use of the Regent's Test.  




                                  Chapter IV

  



                    REGENTS' TEST RESULTS FROM 1972 TO 1982


    Presented in Table 24 are the percentages of examines passing the 

Reading, Essay, and total Test each year (Fall Quarter through Summer Quarter) 

from Winter Quarter, 1972, through Summer Quarter, 1982.  Results are reported 

separately for repeaters and first-time examines.  (No data are reported for 

repeaters for the first two years because of the small number of examines in 

that category during these periods.)



                                          Table 24
                                                  
                           Regents' Test Results from 1972 to 1982
                                                  
 
______________________________________________________________________________
__________
             Percent Passing Under Reading Cutoff               Percent Passing
             Score in Effect at the Time of Test        Under Reading Cutoff Score of 61
                       Administration                        (Effective Fall, 1980)
            ________________________________________    
__________________________________
               Reading         Essay         Total             Reading             Total
            ___________    ___________   ___________     __________________  
_____________
    Year    Rep.  1st      Rep.  1st     Rep.  1st          Rep.      1st      Rep.    1st    
   ______  ______ _____    _____ _____   _____ _____     _________ ________   ______ 
_____ 
    1972          90.4           67.4          65.0                  66.2              51.7
   72-73          90.2           74.1          70.8                  67.0              56.2
   73-74   82.5   91.0    56.4   75.1   51.0   71.7        41.2      63.6     29.2     53.9
   74-75   89.8   97.6    46.2   80.2   54.6   75.3        54.5      80.0     38.0     64.9
   75-76   98.6   99.4    49.4   69.8   49.2   69.7        76.7      88.5     43.3     66.0
   76-77   98.5   99.4    50.5   69.3   50.3   69.2        78.8      88.9     44.4     65.2
   77-78   96.8   98.7    51.3   68.5   50.8   68.3        73.6      86.1     43.2     63.2
   78-79   82.9   89.9*   52.5   68.4   48.2   65.2*       76.1      85.8     45.6     63.2
   79-80   66.3   87.3**  45.7   71.4   48.3   65.9**      44.9      85.1     44.2     65.3
   80-81   47.0   83.7*** 49.5   69.7   46.8   63.4***     47.0      83.7     46.8     63.4
   81-82   45.7   82.9    52.1   69.3   47.9   62.6        45.7      82.9     47.9     62.6
 
______________________________________________________________________________
____________

             *Reading cutoff score raised from 51 to 59
            **Reading cutoff score raised another point to 60
           ***Reading cutoff score raised another point to 61



    Results for the Reading and total Test are presented in two ways:  the 

first columns show the percentages of students who actually passed the test 

under the cutoff scores in effect at the time the test was taken; the last 

four columns show the percentages of students who would have passed the test 

under the current cutoff score of 61 (effective Fall Quarter, 1980).  The per-

centage passing under the new cutoff score should be used for any year-to-year 

comparisons.

    The total test results for repeaters beginning with the Winter, 1980 ad-

ministration were computed differently from the repeaters' results for 

previous quarters.  Before Winter, 1980, all repeaters had to retake both the 

Reading Test and the Essay Test.  The policy was changed as of Winter, 1980, 

to allow students who passed one part and failed the other to retake only the 

part that was failed.  As of Winter, 1980, the total percentage passing 

statistics for repeaters include students who passed both parts of the test 

during that year and students who passed one part that year and the other part 

in a previous year.  Thus, because of the changes in the administration of the

test and in the computation of the statistics, results for repeaters for ad-

ministrations after Fall, 1979, are not directly comparable with results from 

previous administrations.  Results for first-time examines are not affected by 

this change.

    Another problem with the comparability of results across years is that 

changes in the policy resulted in changes in the population of students taking 

the test.  For example, before the most recent policy went into effect in 

Winter, 1980, there was less pressure on students who failed the test to repeat 

the test each quarter.  The new policy requires students who fail to take re-

mediation and to retake the test.  There is no way to predict the effect of the 

new policy on the percentage of students passing the test:  a decreased percen-

tags could be predicted because the poorer students are required to take the 

test more often, or an increased percentage could be predicted because of the 

remediation requirement.  In either case, the comparability of results is 

affected.  While trends in performance from year to year can be examined, it is 

not possible to determine the causes of any changes observed in passing rates.

    Some trends are evident in the year-to-year comparison of the percentages 

of students passing the test.  For first-time examines, Essay Test performance 

increased from 1972 to 1974-1975, decreased in 1975-1976, and has fluctuated 

only slightly since 1975-1976.  The percentages of repeaters passing the Essay 

Test show a slight increase each year from 1974-1975 to 1978-1979, a decrease 

in 1979-1980, and an increase since 1979-1980.  These fluctuations may be a 

result of changes in the policy.

    While the percentages of students passing the Essay Test do not indicate 

substantial improvement over time, some of the essay raters have indicated im-

provement that would not be evident in the statistics.  Raters have reported 

that the failing essays, in general, are not as poorly written as they were in 

the beginning of the testing program; a large decrease in the number of 

egregious essays has been noted.

    Performance on the Reading Test improved from 1973-1974 to 1976-1977.  The 

percentage of first-time examines passing the Reading Test decreased slightly 

each year after 1976-1977.  The lower performance of repeaters on the Reading 

Test since 1979-1980 is probably caused by the change in policy.  Under the old 

policy, students who failed the Essay Test and passed the Reading Test had to 

retake the Reading Test when they retook the Essay Test.  Because these re-

peaters usually passed the Reading Test each time they took it, the percentage 

of repeaters passing the Reading Test was rather high.  Under the new policy, 

only the students who fail the Reading Test have to retake it; thus, the 

population of repeaters is different from what it was before the new policy 

went into effect.  The lower passing rates shown in the years 1980 to 1982 are 

more likely a product of the change in the population of students required to 

repeat the test than a product of an actual decrease in these repeaters' level 

of performance on the test.  Had the performance of those repeaters who had 

initially failed the Reading Test decreased, this decrease would have effected 

a decrease in the percentage of repeaters passing the total test.  As is 

evident in Table 24, the percentage of students passing the total test showed 

little decrease in 1979-1980; so it seems likely that the apparent decline in 

the Reading Test performance of repeaters is caused by the change in the policy 

rather than by change in student performance.             



        
  
            

                                 References


Anastasi, A.  Psychological testing.  New York: Macmillan, 1976.

American Psychological Association, American Educational Research Association,
 
    National Council on Measurement in Education.  Standards for educational 

    and psychological tests.  Washington, D.C.: American Psychological 

    Association, 1974.

Angoff, W.H.  Scales, scores, and norms.  In R.H. Thorndike (Ed.) Educational

    measurement. Washington, D.C.: American Council on Education, 1971.

Angoff, W. H. & Ford, S. F. Item-race interaction on a test of scholastic 

    aptitude.  Journal of Educational Measurement, 1973, 10, 95-106. 

Barrett, T. Taxonomy of reading comprehension.  In R. Smith & T.C. Barrett 

    (Eds.) Teaching reading in the middle grades.  Reading, MA: Addison-

    Wesley, 1976.

Bloom, B.S., Madaus, G.F., & Hastings, J.T.  Evaluation to improve learning.

    McGraw-Hill, 1981.

Bormuth, J.  On the theory of achievement test items.  Chicago: The University

    of Chicago Press, 1970.

Burk, K.  Verifying the results of equating for minimum competency tests.  

    Paper presented at the annual meeting of the American Educational 

    Research Association, Boston, 1980.

Buros, O. K. (Ed.)  The seventh mental measurements yearbook (Vol I)

    Highland Park, NJ: Gryphon Press, 1972.

Campbell, D. T.  Recommendations for APA test standards regarding construct, 

    trait or discriminant validity.  American Psychologist, 1960, 15, 546-553.

Campbell, D. T. & Fiske, D. W.  Convergent and discriminant validation by the 

    multitrait-multimethod matrix.  Psychological Bulletin, 1959, 56, 51-105.

Citron, H. R.  Analysis of predictive variables for the essay scores on the 

    Regents' Test in one Georgia institution.  Unpublished doctoral 

    dissertation, Georgia State University, 1980.

Coffman, W. E. & Kurfman, D.  A comparison of two methods of reading essay 

    examinations.  American Educational Research Journal, 1968, 5, 99-107.

Cole, N. S. & Nitko, A. J.  Measuring program effects.  In R. A. Berk (Ed.) 

    Educational evaluation methodology: State of the Art.  Baltimore, MD: 

    Johns Hopkins University Press, 1981.

Cronbach, L. J.  Test validation.  In R. L. Thorndike (Ed.) Educational 

    Measurement.  Washington, D.C.: American Council on Education, 1971.

Cronbach, L. J. & Warrington, W. G.  Time limit tests: Estimating their 

    reliability and degree of speeding.  Psychometrika, 1951, 14, 167-168.

Dale, E., O'Rourke, J., & Bamman, H.  Techniques of teaching vocabulary.

    Palo Alto, CA: Field Educational Publications, 1971.

Donlon, T. F.  An exploratory study of the implications of test speededness.  

    Graduate Record Examinations Report 546-27.  Princeton, NJ:  Educational 

    Testing Service, 1978.

Ebel, R. L.  Estimation of the reliability of ratings.  In W. A. Mehrens & 

    R. L. Ebel (Eds.), Principles of Educational and Psychological           

    Measurement.  Chicago: Rand McNally, 1967.

Fort Valley College.  An examination of the effects of time for test 

    administration on students' performance on the Language Skills 

    Examination of the Regents' Testing Program.  Report from Fort Valley 

    College (Abstract only), 1974.

Flaugher, R. L.  The many definitions of test bias.  American Psychologist,

    1978, 33, 671-679.

Godshalk, F. I., Swineford, F., & Coffman, W. E.  The measurement of writing 

    ability.  New York: College Entrance Examination Board, 1966.

Guion, R. M.  Scoring content samples:  The problem of fairness.  Journal

    of Applied Psychology, 1978, 63, 499-506.

Hambleton, R.K.  Test score validity and standard-setting methods.  In R.A.

    Berk (Ed.) Criterion-referenced measurement: State of the art.

    Baltimore, MD: Johns Hopkins University Press, 1980.  (a)

Hambleton, R.K.  Review methods for criterion-referenced test items.  Paper

    presented at the annual meeting of the American Educational Research

    Association, Boston, April, 1980.  (b)

Henderson, F. N.  An analysis and comparison of essay evaluations among raters 

    from four institutions in the University System of Georgia.  Unpublished 

    major applied research project, Nova University, 1977.

Hickman, M. A.  Study of the relationships between selected antecedent 

    variables and the Language Skills Examination of the University System 

    of Georgia, 1972.  Dissertation Abstracts International, 1973, 33, 

    4877-4878A.  (University Microfilm No. 73-5710, 120).

Himmelweit, H. T.  Speed and accuracy of work as related to temperament.  

    British Journal of Psychology, 1946, 36, 132-144.

House, E. B.  Testing and teaching: A critique of the Georgia Regents' Test.

    Paper presented at the annual meeting of the Conference on College 

    Composition and Communication, Washington, D.C., March, 1980.

Hunter, J. E. & Schmidt, F. L.  A critical analysis of the statistical and 

    ethical implications of various definitions of "test bias."  Psychological

    Bulletin, 1976, 83, 1053-1071.

Jenson, A. R.  An examination of cultural bias in the Wonderlic Personnel 

    Test.  Intelligence, 1977, 1, 51-64.

Jenson, A. R.  Bias in mental testing.  New York: Free Press, 1980.

Johnson, W. J.  The origin and development of the University System of 

    Georgia's Regents' Testing Program.  Paper presented at the annual 

    meeting of the Mid-South Educational Research Association, November, 1980.

Kendall, L. M.  The effects of varying time limits on test validity.  

    Educational and Psychological Measurement, 1964, 24, 789-800.

Linn, F. L.  Fair test use in selection.  Review of Educational Research, 1973,
    43, 139-161.

Litaker, R. G.  An investigation of item bias in the Language Skills 

    Examination.  Unpublished doctoral dissertation, University of Georgia, 

    1974.  Dissertation Abstracts International, 1974, 35, 6366A   (Univer-

    s Microfilm No. 75-8175,99).

Marahnich, N.  An empirical comparison of four indicators of test speededness. 

    Paper presented at the annual meeting of the American Educational 

    Research Association, Boston, 1980.

Pearson, P.D., & Johnson, D.D.  Teaching reading comprehension.  New York: 

    Holt, Rinehart and Winston, 1978.

Pendexter III, H.  Personal communication. Undated.

Plake, B. S.  A comparison of statistical and subjective procedures to 

    ascertain item validity: One step in the test validation process.  

    Educational and Psychological Measurement, 1980, 40, 397-404.

Popham, W.J.  As always, provocative.  Journal of Educational Measurement,

    1978, 15, 297-300.

Prather, J. E. & Smith, G.  Factors influencing student performance on a 

    language skills examination: The Regents' Test.  Office of Institutional 

    Planning Report No. 76-1.  Atlanta, GA: Georgia State University, 1975.

Ravan, F. O., Veal, L. R., & Rentz, R. R.  A validity study of the essay test 

    of the Georgia Language Skills Examination.  Paper presented at the 

    Annual Meeting of the National Council on Measurement in Education, New 

    Orleans, 1974.

Regents' Testing Program.  An examination of student performance on the essay 

    test of the Language Skills Examination under different conditions of 

    time and choice of topic. (Abstract) 1974.

Rindler, S. E.  Pitfalls in assessing test speededness.  Journal of 

    Educational Measurement, 1979, 16, 261-270.

Scheuneman, J. D.  A new look at bias in aptitude tests.  In P. Merrifield 

    (Ed.) New directions for testing and measurement - Measuring human       

    abilities (No. 12).  San Francisco, CA: Josses-Bass, 1981.

Sendoval, J. & Miille, M. P. W.  Accuracy judgments of WISC-R item difficulty 

    for minority groups.  Journal of Consulting and Clinical Psychology, 

    1980, 48, 249-253.

Shepard, L. Standard setting issues and methods.  Applied Psychological 

    Measurement, 1980, 4, 447-467.  

Shepard, L. A.  Bias in test items.  In B. F. Green (Ed.) New directions for 

    testing and measurement - Issues in testing: Coaching, disclosure, and 

    ethnic bias  (No. 11) San Francisco, CA: Josses-Bass, 1981.

Singleton, D. J.  The reliability of ratings on the essay portion of the 

    Language Skills Examination.  Unpublished doctoral dissertation, 
    University of Georgia, 1976.

Swineford, F.  Test analysis manual.  Statistical Report 74-06.  Princeton, 

    NJ: Educational Testing Service, 1974.

Terranova, G.  The relationship between test scores and test-time.  The 

    Journal of Experimental Education, 1972, 40, 81-83.

Thompson, D. J. & Rentz, R. R.  Large-scale essay testing: Implications for 

    test construction.  Paper presented at the International Symposium on 

    Educational Testing, The Hague, The Netherlands, July, 1973.

Thorndike, R.L. & Hagan, E.  Measurement and evaluation in psychology and 

    education.  New York: Wiley, 1977.

Tuiman, J.J. Determining the passage dependency of comprehension in 5 major

    tests.  Reading Research Quarterly, 1973-1974, 9, 206-223.

Veal, R. & Rentz, R.  Large-scale essay testing.  Paper presented at the 

    annual meeting of the National Council on Measurement in Education, 

    Chicago, 1974.

Watters, P.  Faith, hope, and parity.  Change Magazine, October, 1979, 

    pp. 10-13.

Wesman, A. G.  Some effects of speed in test use.  Educational and 

    Psychological Measurement, 1960, 20, 267-274.

Willig, C.  The University System of Georgia Regents' Test: A faculty 

    perspective.  Paper presented at the annual meeting of the Mid-South 

    Educational Research Association, November, 1980.

Wright, B. D.  Solving measurement problems with the Rasch model.  Journal of 

    Educational Measurement, 1977, 14, 97-116.     


                              Appendix A
CURRENT Regents' Testing Program Policy and Procedures (from the Academic Affairs Handbook)

Board Policy
Administrative Procedures (includes Special Administration for students with disabilities)
Use of Dictionaries on the Essay Test
"Grandfather" Issue
 
                              Appendix B

                 MEMBERS OF THE COMMITTEE ON THE REGENTS'                 
                        READING TEST, 1981-1982       


Ms. Marolyn Howell                              Dr. Helen Naugle
Developmental Studies                           English Department
Abraham Baldwin Agricultural College            Georgia Institute of Technology
 
Dr. William Dodd                                Ms. Verdery Deal
Developmental Studies                           Developmental Studies
Augusta College                                 Georgia Southern College

Mrs. Annie Russell                              Dr. Joan Elifson
Developmental Studies                           Developmental Studies
Emanuel county Junior College                   Georgia State University

Miss Patricia Ann Solomon                       Ms. Judy L. Shank
Developmental Studies                           Developmental Studies
Albany Junior College                           Southern Technical Institute

Mrs. Rosa Tift, Chairperson                     Dr. William Diehl
Developmental Studies                           Developmental Studies
Albany State College                            University of Georgia

Dr. Philip Scriven                              Dr. Bob W. Jerrolds
Developmental Studies                           Reading Education
Savannah State College                          University of Georgia

Ms. Brenda Jackson                              Mrs. Annie Robinson
Developmental Studies                           Developmental Studies
Georgia Southwestern College                    Fort Valley State College

Dr. Monica Jean Hiler                           Dr. Henrietta Miller
Developmental Studies                           Developmental Studies
Gainesville Junior College                      Clayton Junior College

Dr. Nancy Bland                                 Ms. Dorothy Randall
Elementary Education                            Developmental Studies
Armstrong State College                         Bainbridge Junior College

Dr. Joan H. Marshall                            Dr. George M. McNinch
Learning Center                                 Department of Education
Columbus College                                West Georgia College

Dr. Ola M. Brown                                Mrs. Elle Billiard
Education Department                            Developmental Studies
Valdosta State College                          Atlanta Junior College

Ms. Teresa T. Deen
Developmental Studies
Kennesaw State College

 

       

                      MEMBERS OF THE TESTING SUBCOMMITTEE OF THE
                      ACADEMIC COMMITTEE ON ENGLISH, 1981-1982





Dr. William J. Johnson
Languages and Literature
Augusta College

Dr. Luetta Milledge
Department of English
Savannah State College

Dr. James W. Mathews
Department of English
West Georgia College

Dr. Larry Corse
Humanities Division
Clayton Junior College

Dr. Thomas A. Wilkerson
Humanities Division
Dalton Junior College

Dr. Jean B. Bridges
Humanities Division
Emanuel County Junior College

Dr. Betty Jo Strickland
Humanities Division
Brunswick Junior College


Last updated: November 8, 1996