The Content of the Regents' Test
Administration and Scoring of the Regents' Test
History of the Regents' Testing Program
Procedures Used for Test Development and Validation
Development and Validation of the Reading Test
Development of Test SpecificationsDevelopment and Validation of the Essay Test
Procedures for Passage Selection and Item Writing
The Passing Score Set on the Reading Test
Evidence of Content Validity
Other Evidence of Validity
Development of Test Procedure
Procedure for Selecting Essay Topics
Rationale for Essay Scoring Standards
Evidence of Content Validity
Other Evidence of Validity
Reliability of Reading Test Scores
Analysis of Reading Test Items
Equating of Reading Test Forms
Reliability of Essay Test Scores
Statement of Policy by the Board of Regents
Concerning the Regents' Testing Program
List of Members of the Testing Subcommittee
of the Academic Committee on English and Members of
the Committee on the Regents' Reading Test
List of Tables
Table
1 Skill Categories of the Regents' Reading
Test
2 Findings from Two Studies on the Correlations
Between the Regents' Reading Test and Selected
Academic Variables
3 Appraisal of Speededness in Recent Administrations
Of the Reading Test
4 Correlations between the Difficulties of Reading
Form F Items for Samples of Students of a Given
Ability Within Different Types of Institutions
5 Items from Form 15 of the Regents Reading Test
That Were Relatively More Difficult for Black
Students than for White Students
6 Correlations between the Regents' Reading Test
And Selected Academic Variables Within Five
Different Institutions
7 Means and Standard Deviation of Scores Obtained
On the Regents' Reading Test and on Selected
Academic Variables by Students at Five
Different Institutions
8 Mean Analytic Ratings on 22 Components for Regents'
Essays Given Holistic Ratings of 1, 2, 3, and 4
9 Mean Analytic Ratings on 22 Components for Regents'
Essays given Holistic Ratings of 1 and 2
10 Percent of "Passing" Analytic Ratings Assigned on
22 Components to Regents' Essays Holistically
Graded as 1's and 2's
11 The Pass Rates Attained on the Regents' Essay
Test by Students Performing at Different Levels on
an Objective Writing Test
12 Findings from Two Studies on the Correlations
between the Regents' Essay Test and Selected
Academic Variables
13 Relations between Scores on the Verbal Section
of the Scholastic Aptitude Test (SAT-V) and Passing
Rates on the Regents' Essay Test
14 Relation between English Grade Point Averages and
Passing Rates on the Regents' Essay Test
15 Mean Ratings Assigned by Essay Raters from
Predominantly Black and Non-Black Institutions to
Essays Written by Students from Predominantly Black
and Non-Black Institutions
16 Correlations between Regents' Essay Test and
Selected Academic Variables Within Five
Different institutions
17 Faculty Responses to Questions about the
Regents' Test
18 Classification of Repeaters on Two Administrations
of the Reading Test
19 KR-20 Reliability Estimates for Form 17 and
Form 20
20 Item Analysis Data for the Spring, 1984
Administration of Form 23
21 Raw Score to Scaled Score Conversion Table
For Form 23 of the Regents' Reading Test
22 Rater Performance Summary Statistics for
Fall, 1980 through Summer, 1981
23 Estimates of Rating Reliability for the Essay
Portion of the Language Skills Examination
24 Regents' Test Results from 1972 to 1982
List of Figures
Figure
1 Mean ratings assigned by essay raters from
Predominantly black and non-black institutions
to essays written by students from predominantly
black and non-black institutions
OVERVIEW OF THE REGENTS' TESTING PROGRAM
By a policy statement issued in 1972, the Board of Regents of the
University System of Georgia instituted the Regents' Testing Program. As
described in this statement, the Program serves as one means by which each
institution in the University System can ensure that students receiving degrees
from the institution possess "literacy competence," which was defined as
"certain minimum skills of reading and writing." The Board of Regents
identified two specific objectives for the Testing Program:
(1) "to provide Systemwide information on the status of student
competence in the areas of reading and writing; and
(2) to provide a uniform means of identifying those students who fail to
attain the minimum levels of competence in the areas of reading and
writing."
The Regents' Test was developed to satisfy these objectives. It is
composed of two components, a reading test and an essay test. Students' scores
on the tests are used to determine whether they have the minimum levels of
reading and writing skills required for graduation.
According to the Regents' policy, students may be required to take the
test in the quarter after they have attained 45 hours of degree credit, and
they must take the test once before they have acquired 60 hours of credit. If
a student has not passed both components of the test by the quarter in which 75
credit hours are acquired, enrollment in remedial courses is required until
passing status on the two components of the test has been attained. There is
no limit on the number of times a student may take remediation and retake the
test. The full text of the current Regents' policy is given in Appendix A.
Provided in the paragraphs that follow is a brief description of the con-
tent of both the reading and the essay components of the Regents' Test. Also
described is the manner in which these tests are administered and scored and
the manner in which students' scores are reported.
The Content of the Regents' Test
The Reading Test
The Reading Test, which has an administration time of one hour, is a
60-item, multiple-choice test comprised of ten reading passages and five to
eight questions about each passage. The passages usually range from l75 to 325
words in length, treat topics drawn from a variety of subject areas (social
science, mathematics and natural science, and humanities), and entail various
modes of discourse (exposition, narration, and argumentation). The questions
that accompany the passages of the Reading Test have been designed to assess
four major aspects of reading: (1) Vocabulary, (2) Literal Comprehension, (3)
Inferential Comprehension, and (4) Analysis. A description of these skills is
given in Table 1, and a description of the types of items that are used to
measure each of the skills is available from the faculty member at each insti-
tution who is responsible for reading remediation. A sample form of the
Regents' Reading Test, which provides examples of the types of passages and
items comprising the test, has been distributed to a Regents' Test coordinator
at each institution.
Table 1
Skill Categories of the Regents' Reading Test
Vocabulary: entails identifying the meanings of words as they are used in
passages. The student may use context clues, structural analysis and/or a
general understanding of the meaning of the passage to determine the meaning
of a word.
Literal Comprehension: entails recognizing information and ideas presented
explicitly in passages. Literal comprehension items require a student to
recognize (1) details or facts, (2) a sequence of events, (3) a comparative
relationship, (4) a cause and effect relationship, or (5) the referent for
which a word or group of words has been substituted in a passage.
Inferential Comprehension: entails synthesizing and interpreting material
that is presented in a passage. Inferential comprehension items involve the
following skills: (1) identifying the main idea of a passage or paragraph,
(2) inductive reasoning, (3) deductive reasoning, and (4) interpretation of
figurative or other language.
Analysis: is concerned with how or why a passage is written rather than what
a passage is about. In general, analysis items require inferences to be made
about the style, purpose, or organization of a passage.
Test Specifications
The test consists of ten passages with five to eight items for each passage.
In all, there are sixty items on the test. The categories of Vocabulary,
Literal Comprehension, and Analysis are each assessed by twelve to fourteen
items. There are twenty to twenty-four items for the Inferential
Comprehension category.
Passages on the test are from textbooks, literary works, magazines,
newspapers, and other written material that, in the judgment of committee
members, all students receiving college degrees should be able to comprehend.
The Essay Test
Students who take the Essay Test have one hour in which to choose and
write on one of two topics that are given. A partial list of the topics that
have been used on past forms of the Essay Test is provided in the Regents'
Testing Program Essay Scoring Manual, which has been distributed to all
institutions in the System.
Students taking the Essay Test are given the following directions:
Organization of your essay is important. Think toward a
good thesis sentence, some specific supporting points,
and a definite conclusion. In general, passing the
essay will require that you (1) state and develop a
central idea; (2) have an organization which is indica-
tive of an overall plan; (3) deal with the assigned
topic; and (4) avoid serious errors in diction, sentence
structure, and paragraph development.
Administration and Scoring of the Regents' Test
Administration
Each quarter, during a two-day testing period specified by the Regents'
Testing Program office, the Regents' Test is administered to eligible students
at all institutions in the University System. Just before the testing period,
the Regents' Testing Program office sends to the Regents' Test Coordinator at
each institution the test materials that are needed. Because each institution
is responsible for its own test administrations, the Test Coordinator oversees
the distribution of these materials and arranges for supervisors and proctors
to administer the test. An Administration Manual that is provided by the
Regents' Program Testing office details the testing procedures that are to be
followed so that all test administrations are standardized. Administration
sites are also monitored periodically by staff from the Regents' Testing
Program office to ensure that the standardized procedures are followed at each
institution. After the last test administration at an institution, all testing
materials are returned to the Regents' Testing Program office so that the
students' test responses can be scored.
Scoring the Reading Test
Students' responses to the items of the Reading Test are recorded on
machine-readable answer sheets so that these responses can be read and scored
by computer. A standard score is used to describe the Reading Test
performance of each examinee. This score is derived by translating the
student's total raw (number-right) score on the test to a Rasch score scale
with a range from 0 to 99. Whether the student has met the minimum
requirements established for reading is determined by comparing this
translated score to the passing score that has been set for the Reading Test.
Scoring the Essay Test
The essays to be scored are distributed by the Regents' Testing Program
office among six scoring centers in the state. All institutions in the System
send representatives to be raters at the nearest scoring center. The number
of raters sent by each institution is determined by the ratio of its sophomore
enrollment to the sophomore enrollment of the entire System. As each essay is
identified only by the student's social security number, the essay raters do
not know the identity or the institution of the students whose papers are
graded.
Each essay is graded independently by three raters who use a holistic
procedure to assign ratings to the essay. When rating the essays, raters use
a four-point scale. A "4" on the scale indicates superior performance, a "3"
clearly passing performance, a "2" barely passing performance, and a "1"
substandard or failing performance. Model essays define the four points of the
rating scale by indicating the meaning of the division points (i.e., 4/3, 3/2,
2/1) between the ratings on the scale:
Ratings: 4 3 2 1
------|-----|-----|------
Models: 4/3 3/2 2/1
One model essay is used to represent each division point. An essay that is
judged to be better than the 4/3 model is given a "4"; an essay judged to be
better than the 3/2 model but not as good as the 4/3 model is given a "3"; an
essay judged to be better than a 2/1 model but not as good as a 3/2 is given a
"2"; and an essay judged to be poorer than the 2/1 model is given a "1." The
set of standard model essays used to define the division points on the scale
is included in the Description of Essay Scoring Procedures, which is provided
in the Regents' Testing Program Essay Scoring Manual. Also included in this
description are analyses of the model essays, definitions of the four score
levels used as the basis for selecting model essays, and answers to questions
that raters frequently ask about the procedures for scoring the Essay Test.
These materials are provided to all raters before each quarterly scoring
session. For raters who are grading essays for the first time, additional
information and samples of essays that have been graded are provided in the
Essay Scoring Manual.
The final score assigned to an essay is usually the rating on which at
least two out of three raters agree. When there is no agreement among the
raters, the final score is the middle rating of the three assigned to the
essay. One consequence of this scoring procedure is that no essay can receive
a failing grade unless at least two of three raters have given it a failing
grade. Further description of the essay scoring procedure is provided in the
Essay Scoring Manual.
As is indicated in the Regents' policy, given in Appendix A, a student
may request a formal review of a failing essay if (1) there is one passing
score among the three grades the essay was assigned and (2) the student has
completed all English composition courses required by the institution. The
review is initiated on the student's campus. If the student's appeal is
sustained, the essay is sent to the Regents' Testing Program office to be
rescored by a systemwide review panel.
Score Reporting
Within the three-week period following a quarterly administration of the
Regents' Test, each institution in the University System is issued a Report of
Results. In an institution's Report, data are provided that describe the test
performance of each student from the institution who participated in the
quarterly administration of the Regents' Test. Also provided to each institu-
tion is an Institutional Summary Report, which includes the following informa-
tion: a summary of the performance of the institution's examinees on the
Reading Test and the Essay Test for first-time examinees, repeaters, and these
two groups combined; a description of the institution's performance on each
skill category of the Reading Test; and, to facilitate year-to-year compari-
sons, an historical summary of results for first-time examinees and repeaters.
To allow comparisons with similar institutions, various statistics are reported
by institutional type (university, senior college, and junior college). Also
provided is a report of the test performance of students at each institution
in the System.
Personnel at each institution are responsible for reporting scores to
individual students.
Chapter II
DEVELOPMENT AND VALIDATION OF THE REGENTS' TEST
Given that the primary purpose of the Regents' Testing Program is to
appraise students' reading and writing skills, the procedures used to develop
the Regents' Test are central to determination of its validity. Because
development and validation of this test are highly inter-related, both of these
issues are discussed in this chapter. Prior to this discussion, a brief history
of the Testing Program is given to explain the rationale underlying its
inception.
HISTORY OF THE REGENTS' TESTING PROGRAM
In the middle of the 1960's, the University System of Georgia defined a
core curriculum that all students attending the academic institutions of the
System were henceforth expected to complete. The requirements pertaining to the
core were somewhat general in nature, since they identified only the types of
courses (e.g., "literature" and "composition") that students would have to take
in order to complete a specified number of credit hours in each of four areas:
humanities, social studies, science/mathematics, and the major subject.
In the late 1960's, well before the subject of accountability became a
national preoccupation, the Chancellor of the University System of Georgia
expressed interest in ascertaining what skills had been acquired by those
students who had participated in the core curriculum (Johnson, 1980). Thus,
during the 1968 - 1969 academic year, samples of students were administered the
College Level Entrance Examination (CLEP) in the interest of measuring their
level of skill in the three core areas of Humanities, Social Studies, and
Mathematics/Science. In the spring of 1970, the Survey of College Achievement
was administered in lieu of the CLEP because it covered the same subject areas
and yet required less time to administer.
The notion of testing only reading and writing skills was born out of
deliberations between the Chancellor and the presidents of institutions in the
Georgia System during the spring of 1970. In light of findings that suggested a
statewide as well as national decline in the level of college students' abili-
ties to read and write, the Chancellor and these administrators deemed that
college students' proficiency in these skills should be the focus of a statewide
testing program. It was also concluded that skills gained from core curriculum
courses would be difficult to define on a systemwide basis since students could
satisfy the core requirements in mathematics, science, and social studies
through a number of different courses.
Reading and writing skills were judged to fall within the province of the
Academic Committee on English, which is an advisory committee comprised of
English department heads from the 33 institutions of the University System. A
Testing Subcommittee was appointed by the Academic Committee on English to work
with testing experts on the development of an appropriate test that could be
administered experimentally in the spring of 1971.
Concerned with devising a test that could be administered and scored
efficiently and inexpensively, the Subcommittee and testing experts agreed that
multiple-choice items could be used to assess students' reading comprehension
and some of their writing skills, namely, their knowledge of grammar and word
usage. However, the Subcommittee also determined that a writing sample should
be required as part of the writing skills tests; it was believed that students'
ability to organize and express their ideas was important to measure and that
this ability could be validly appraised only by such a measure (Johnson, 1980).
In light of these views, the Testing Subcommittee and testing experts
devised an experimental version of the test that consisted of three parts: a
multiple-choice test of reading comprehension, a multiple-choice test of writing
(grammar and usage), and an essay test. The items comprising the
multiple-choice reading and writing components of this test were drawn from
retired forms of the Sequential Test of Educational Progress I and the Cooperative
English Test through a lease agreement with the Educational Testing Service.
The essay topic assigned to each form of the test was selected and approved by
the Academic Committee on English. Examinees were to be given 30 minutes to
write on the selected essay topic as well as 30 minutes to work on each of the
two multiple-choice components. The experimental version of the test, which was
called the Language Skills Test of the University System Junior Testing Program,
was administered to samples of students during the spring of 1971. The test was
then formally administered systemwide in the winter of 1972 to the System's
6,500 "rising juniors," who had between 60 and 75 hours of college credit.
As the Junior Testing Program was being developed, the Board of Regents was
formulating a policy statement about the testing program, which it called the
University System Regents' Testing Program. This statement, issued in 1972,
described the purposes and procedures of the Program. As noted in Chapter I,
the specified purpose of the Program was to serve as one means by which each
institution in the University System could ensure that students receiving de-
grees from the institution possessed "literacy competence," which was defined as
"certain minimum skills of reading and writing." Undergraduate students who
were enrolled in degree programs and had acquired between 60 and 75 quarter
hours of degree credit (i.e., "rising juniors") were required to take the test
to demonstrate competence in reading and writing. Satisfactory performance on
the Language Skills Test was evidence of competence, but institutions could use
other methods to certify the competence of students who failed this test.
Since 1972, some modifications have been made in the Board of Regents'
policy and in the content of the Language Skills Test, but the procedure of
using this test to assess students' basic reading and writing skills has been
routinely carried out once during each quarter of the school year since 1972.
In 1973, after the test had been administered for several quarters, passing the
reading component and one of the two writing components of the Language Skills
Test became a requirement for graduation. In 1974, the objective writing compo-
nent was dropped from the testing program so that, since that time, passing both
the reading and the essay components of the test has been a graduation
requirement. In 1979, the Regents established the current eligibility require-
ments for the test, which specify that students may be required to take the test
in the quarter after they have attained 45 hours of degree credit, and that they
must take the Language Skills Test before they have acquired 60 hours of credit.
If a student has not passed one or both components of the test by the quarter in
which 75 hours have been acquired, enrollment in remedial courses is required
until passing status on all components of the test has been attained (see
Appendix A for the full text of the Regents' policy).
With respect to the content of the test, in l974 it was decided that
students taking the Essay Test should be given a choice between two topics on
which to write and, by 1978, the time limits for both the Reading and the Essay
Test were extended to one hour. Also in 1974, responsibility for developing the
reading items was given to the Testing Subcommittee because the lease on the
item pool from ETS had expired. Subsequently, in Winter, 1982, this responsi-
bility was consigned to a joint committee that consisted of the Testing
Subcommittee and the Committee on the Regents' Reading Test. A description of
this joint committee and their activities is described in a section that
follows.
PROCEDURES USED FOR TEST DEVELOPMENT AND VALIDATION
Development and validation of the Regents' Test are properly discussed
together because the validity of this test rests primarily on its content
validity, which was established in the course of developing the test. Validity
refers to the degree to which a test provides information that is relevant to
the particular descriptions or decisions that are to be made using the scores of
the test (Hambleton, 1980a; Thorndike & Hagan, 1977). Traditionally, several
kinds of validity have been defined by test specialists. These different kinds
of validity refer to the relevance of the information provided by a test to dif-
ferent score interpretations or uses, and they rest on different methods for
establishing this relevance (APA, AERA, & NCME et al., 1974; Anastasi, 1976).
For a skills test, it is most important to demonstrate content validity, which
refers to the appropriateness of claiming that the behaviors assessed by a test
represent the behaviors that the test is intended to assess (APA et al., 1974).
This kind of validity is established not by empirical studies of test scores,
but rather by judgments of the degree to which the items of a test adequately
sample the specified types of behaviors that the test is intended to assess. A
test developer can claim that a test is content valid when (1) the skills the
test is to assess have been clearly specified, and (2) experts have judged that
these skills are adequately sampled by the items that have been written for the
test (APA et al., 1974).
Described in the sections that follow are the procedures that have been
used both to develop forms of the Regents' Reading and Essays Tests and to show
that these tests are valid.
Section I
Development and Validation of The Reading Test
DEVELOPMENT OF TEST SPECIFICATIONS
The initial specifications for the Reading Test were written by the Testing
Subcommittee of the Academic Committee on English prior to the development of
the first forms of this test in Spring, 1971. As noted above, the items for
these initial forms were to be drawn from a leased pool of items that had been
used by the Educational Testing Service (ETS) on forms of the Sequential Test of
Educational Progress I (STEP I) and the Cooperative English Test. The speci-
fications that the Subcommittee adopted resembled those that had been used by
ETS to define what was measured by the STEP I reading test. The Subcommittee's
specifications indicated that the following categories should be covered by
items of the Reading Test: (1) Reproduce Ideas, (2) Translate Ideas and Make
Inferences, (3) Analyze Motivation, (4) Analyze Presentation, and (5) Criticize
Selection. These skills were to be assessed as they were in the STEP I - - by
items that referred to written passages presented in the Reading Test.
Vocabulary was later added to this list when the objective writing test, which
had contained some vocabulary items, was dropped from the Regents' Testing
Program in 1974. The Subcommittee specified that Vocabulary would be assessed
in a separate section of the Reading Test by multiple-choice items that
presented a sentence context for the words to be defined.
The current specifications for the Reading Test were developed in the
winter of 1982 by a joint committee that was charged with the responsibilities
of (1) evaluating the content of the current Reading Test, and (2) developing
detailed descriptions of the skills the test should measure and of the types of
items that should be used to measure these skills. As noted above, this joint
committee consisted of the Testing Subcommittee and the Committee on the
Regents' Reading Test. The members of the reading committee were specialists in
reading on the faculty of University System institutions. They all were
specially appointed by the presidents of their institutions to participate in
specifying the content for the Regents' Reading Test. In Appendix B, a list is
provided of both the members of the Reading Committee and the members of the
Testing Subcommittee.
After reviewing the existing specifications for the Reading Test, the joint
committee suggested that several revisions be made. Noting that the skills
covered by the Reading Test could be delimited by four rather than six
categories, the joint committee recommended that these categories be designated
(1) Vocabulary, (2) Literal Comprehension, (3) Inferential Comprehension, and
(4) Analysis. The members of the joint committee then formulated descriptions
of the skills defined by these categories and agreed upon detailed specifica-
tions that described the kinds of items that should be used to measure these
skills. These skill descriptions are presented in Table 1, and the item speci-
fications have been distributed to the members of the faculty responsible for
reading remediation at the institutions in the System. The major revision in
test content made by the committee pertained to the definition of the Vocabulary
category. It was concluded that Vocabulary would be most appropriately assessed
by items that referred to words presented within the context of a passage; it
therefore was recommended that, like the other aspects of reading, Vocabulary
items refer to content presented in the passages of the Reading Test.
Literature discussing the process of reading formed the primary basis for
the recommendations made by the joint committee. In particular, the taxonomy or
classification system formulated by Barrett (1976) was heavily relied on by the
committee in its deliberations about how it would define those reading skills
covered by the Reading Test. According to Pearson and Johnson (1978), Barrett's
system is the taxonomy that has been most widely used in reading courses and
workshops designed for college-level readers.
In Barrett's taxonomy, four types of reading skills are defined: Literal
Comprehension, Inferential Comprehension, Evaluation, and Appreciation. With
some modification, the joint committee agreed that the first two categories, as
Barrett defined them, well-described certain skills of the Reading Test. The
committee did make two amendments to Barrett's description of literal
comprehension tasks. In his taxonomy, Barrett indicated that certain questions
about the main idea of a paragraph or passage could be answered using literal
rather than inferential reading skills. The committee believed that main ideas
would not be explicitly stated in the reading passages likely to appear on the
Reading Test, and so it recommended that the main idea questions posed in the
Test be classified as assessing inferential reading skills. Also the committee
suggested that the category of literal comprehension skills should include
comprehension of anaphoric and cataphoric references, which Barrett did not
include in his list of literal comprehension tasks (see Bormuth, 1970).
With respect to Barrett's Evaluation category, the committee concluded that
this category was generally not applicable to the Reading Test. According to
Barrett, evaluation requires a student to judge the adequacy and desirability of
passage content in light of the student's knowledge about the passage topic.
Because evaluation involves not just comprehension of a passage but also know-
ledge of a topic, the committee decided that this category pertained to matters
that should not be assessed by the Reading Test (see Tuiman, 1973-1974).
Barrett's Appreciation category was thought pertinent to the analytic
skills assessed by items of the Reading Test. As Barrett defined this skill, it
involves the identification of literary techniques, forms, styles, and
structures employed by an author to evoke an intellectual or emotional response
from the reader. The committee concluded that Barrett's discussion of this
skill was too narrowly focused on application to narrative and descriptive
literature and so suggested that the broader conceptualization of this skill
specified by Bloom, Madaus, and Hastings (1981) would more adequately describe
the analytic category of the Reading Test. These three researchers called the
skill "Analysis," and they defined it as the process of decomposing any
communication into constituent parts "to clarify how the communication is
organized and the way in which it manages to convey its effects" (p. 249).
Finally, with respect to the matter of vocabulary, this skill was not a
category included in Barrett's taxonomy. Descriptions of this skill given by
Dale, O'Rourke, and Bamman (1971), however, were thought by the committee to
describe well the word-meaning-in-context tasks of the Reading Test. These
descriptions were therefore adapted by the committee to define the vocabulary
skills assessed by this test.
Other specifications developed by the joint committee concerned the content
of the reading passages and the numbers of passages and items to be included in
the Reading Test. These specifications indicated that the passages used in the
test should be 175 to 325 words in length and should be drawn from textbooks,
literary works, magazines, newspapers, and other written material that, in the
judgment of the committee, college graduates should be able to comprehend.
Moreover, it was specified that these passages should concern various subjects
of the social sciences, humanities, and natural sciences and mathematics and
that the passages should differ in mode. In light of the one-hour time limit
established for the test, it was also decided that ten passages and 60 items
should comprise each form of the test, with five to eight items accompanying
each passage. Finally, the committee members agreed that each skill category
should be assessed by the following numbers of items:
Vocabulary 12 - 14 items
Literal Comprehension 12 - 14 items
Inferential Comprehension 20 - 24 items
Analysis 12 - 14 items
The category of inferential comprehension was assigned the largest number of
items because the committee considered this skill the most central to students'
understanding of the types of passages included in the test.
PROCEDURES FOR PASSAGE SELECTION AND ITEM WRITING
The joint committee works in conjunction with staff from the Regents'
Testing Program office to develop new forms of the Reading Test. Using the
skill and item specifications devised for the test, members of the joint
committee select passages and write items that they regard as suitable for
inclusion in the test. These materials are then submitted to the Regents'
Testing Program office where the items are reviewed for technical soundness and
edited when necessary. Subsequently, staff from this office identify and orga-
nize into a test form passages and items that appear to conform to the specifi-
cations concerning the types of passages and items that must appear in the
Reading Test. Some of the passages and items on each form are new, and some
that have been used on previous forms are used so that the new form can be
equated to past forms of the Reading Test (see Chapter III). The preliminary
test form is then submitted to the joint committee, which judges whether the
passages and items comprising the form conform to the test specifications and
adequately sample the skills these specifications designate the test is to
measure. After a preliminary test form is approved, it becomes a final form and
is used at a regular, quarterly administration of the Regents' Test. The new
passages and items included on the final form are regarded as experimental, and
students' responses to the new items are examined after the form has been ad-
ministered. Any items that appear, on the basis of these responses, to be
flawed are not counted when students' Reading Test scores are calculated.
THE PASSING SCORE SET ON THE READING TEST
As is the case with all standards for competency that are set (see
Hambleton, 1980a; Popham, 1978; Shepard, 1980), the passing score on the Reading
Test was set by judgmental methods. That is, experts decided on rational
grounds the minimal level of performance that was needed to pass the test. The
procedures that these experts used to make this judgment and the considerations
that influenced this judgment are described in the paragraphs that follow.
After the Reading Test was formally administered for the first time in the
Winter of 1972, the Subcommittee on Testing met to consider what level of
reading performance should be required to pass the test. A standard score of 51
was tentatively chosen after the Subcommittee had reviewed the reading scores
that students had received at the first administration of the Reading Test. The
standard score of 51 represented a percentile rank of 10, meaning that 10% of
the students in the Winter 1972 administration received reading scores below 51.
After performance data from this and two subsequent test administrations were
examined, the cut-score was set at this level; the Subcommittee believed that no
more than 10% of the students in the University System should fail the Reading
Test until more information became available and also concluded that this
cut-off score would effectively serve to identify those students having the most
serious reading problems.
In 1978, an Ad Hoc Committee on the Regents' Testing Program was convened
to consider, among other issues, the current passing score of 51; of concern was
what seemed to be an inexplicable disparity between the very high pass rates (of
about 98%) that had been recently observed on the Reading Test and the more
moderate pass rates that had been observed on the Essay Test. After study of
data that had been collected, this committee proposed that the passing score on
the Reading Test gradually be raised to 61, which is the level at which it is
set today. Some of these data had indicated that a standard score of 58 earned
on the Reading Test was comparable to the level of reading performance that was
required for exit from the remedial reading programs offered in the University
System. This level of reading performance was considered to be necessary just
for college entry, yet the passing score on the Reading Test was meant to be
indicative of the level of reading proficiency expected of college graduates.
The Ad Hoc Committee therefore concluded that a passing score that exceeded 58
was essential. Other data showed (1) the pass rates that would result from the
use of various passing scores above 58, and (2) the relations between students'
scores on the Reading Test and their performance on the Essay Test and the
verbal section of the Scholastic Aptitude Test (SAT). In light of these data,
the Committee recommended that the passing score be raised from 51 to 59 in the
Fall Quarter of 1978, and that this score be advanced one point each year until
it reached a level of 61 in the Fall Quarter of 1980.
EVIDENCE OF CONTENT VALIDITY
To understand the primary process used to establish the validity of the
Reading Test, it is useful to consider Anastasi's (1976) explanation of the
process of content validation, which describes fully the manner in which content
validity is established. In her textbook on psychological testing, Anastasi
wrote:
Content validity is built into a test from the outset through the
choice of appropriate items. For educational tests, the preparation of
items is preceded by a thorough and systematic examination of relevant
course syllabi and textbooks, as well as by consultation with subject
matter experts. On the basis of the information gathered, test
specifications are drawn up for item writers. These specifications
should show the content areas or topics to be covered, the instruc-
tional objectives or processes to be tested, and the relative impor-
tance of individual topics and processes. On this basis, the number of
items of each kind to be prepared on each topic can be established.
(pp. 135-136)
As a previous section shows, the procedures that have been used to develop
the Regents' Reading Test are like those that Anastasi described. The
specifications written for the test were devised by a joint committee of experts
in the subjects of reading and English. To formulate these specifications,
these experts consulted literature that treated the process of reading and
selected from that literature skill descriptions that appeared to most clearly
and completely describe the skills to be assessed by the Reading Test. The test
specifications drawn up by these experts described not only the skills to be
covered by the Reading Test and each skill's relative importance but also they
provided detailed descriptions of the kinds of items that should be written to
assess each of the skills covered by the test. As Cronbach (1971) has noted,
when a test's specifications are given in detail, a guide to item writers is
provided that greatly enhances the chance that appropriate items will be written
for the test.
The passages that are selected and items that are written on the basis of
the test specifications are subsequently appraised first by testing experts for
technical soundness and then by the joint committee of English and reading
experts, which considers their conformity to the test specifications and the
appropriateness of their content. Any items that appear flawed in some way are
revised during this review so that the final test form will well-represent the
test specifications and, hence, can be regarded as content valid.
OTHER EVIDENCE OF VALIDITY
Relation between Reading Test Scores and Another Measure of Reading Skills
Although the validity of the Reading Test is largely determined by its
content validity, additional evidence of the test's validity can be gained by
examining the correlation between the Reading Test and another measure of
students' reading skills. The finding that these two measures correlate well
suggests that the two measures assess the same characteristic and, hence,
supports the claim that students' scores on the Reading Test can be validly
regarded as a measure of their reading skills (see Campbell, l964; Campbell &
Fiske, 1959).
In Summer, 1982, the Regents' Testing Program office conducted a study of
the relation between students' scores on Form 20 of the Regents' Reading Test
and their scores on Reading Form 1A of the Sequential Tests of Educational
Progress (STEP II). Reading Form 1A is part of the STEP II battery of
achievement tests that have a level of difficulty appropriate for college
freshmen and sophomores. The Reading Form has two parts: a 30-item Vocabulary
Section, which presents single-sentence contexts for the words to be tested,
and a 30-item Reading Section, which is composed of passages and questions
these passages. Thus, the STEP II Reading Form differs slightly from the
Regents' Reading Test in that the vocabulary words are presented in a separate
section rather than within the context of the passages to be read. However,
the STEP vocabulary items resemble items of the Reading Test in that context
clues or inferential skills must be used to answer the items correctly.
A total of 116 students from three junior colleges and two universities
participated in the Regents' Testing Program study. These students were en-
rolled either in remedial reading courses or in required English courses at
their institutions. They were administered the STEP Reading Form in their
classes. The Regents' Reading Test was also administered in class to those
students who did not take the Regents' Test at the regular Summer quarter
administration.
The results of this study were as follows. The mean and standard devia-
tion of the students' scores on the STEP Reading Test were x- = 29.44 and
s = 10.18, respectively. On the Regents' Reading Test, the mean and standard
deviation of these students' scores were x- = 39.36 and s = 8.9. The
correlation between the total scores on the two measures was .82. A raw score
of 38 on Form 20 of the Regents' Reading Test is needed for passing, and it was
found that this level of performance was predicted by a score on the STEP
Reading Test that corresponded to the 15th percentile, where this percentile
was calculated on the basis of normative data on college sophomores collected
for the STEP.
The correlation between the Reading Test and STEP Reading Form is quite
high, which suggests that the skills assessed by the Reading Test are quite
similar to those measured by the STEP Reading Form. This finding lends
considerable support to the claim that students' scores on the test reflect
their levels of reading skill.
Relation between Reading Test Scores and Selected Academic Variables
Additional evidence of the validity of the Reading Test can be gained by
examining the relation between individuals' scores on the test and their scores
on other variables of interest. By finding that the test correlates in the
expected manner with other variables presumed to be related to reading skill,
further support is gained for the claim that scores of the test can be
validly regarded as a measure of individuals' reading levels.
Hickman (1973) and Prather and Smith (1975) examined the relation of
students' Reading Test scores to their high school and college grade point
averages and to their scores on the verbal and mathematical sections of the
Scholastic Aptitude Test. The results of these two studies are presented in
Table 2.
Table 2
Findings from Two Studies on the Correlations
between the Regents' Reading Test and Selected Academic Variables
___________________________________________________________________________
Hickman Prather &
Academic Variables (1973)* Smith (1975)**
___________________________________________________________________________
Aptitude Measures
Scholastic Aptitude Test (Verbal) .75 .59
Scholastic Aptitude Test (Math) .57 .37
Grade Point Averages
Cumulative - High School .21 .09
Cumulative - College .34 .47
Freshman - College .28 .27
English Composition - College .21 .17
___________________________________________________________________________
*Correlations based on 684 to 906 students attending five
different institutions in the University System of Georgia.
**Correlations based on 1910 students attending one university in the
University System of Georgia.
As is shown in the table, although different kinds of samples were used in
the two studies, similar correlations were obtained. In both studies, the
correlation between students' Reading scores and their performance on the
verbal section of the Scholastic Aptitude Test (SAT-V) was substantial and
considerably greater than the correlation between these scores and students'
grade point averages. Less substantial was the correlation between the Reading
Test and the mathematical section of the Scholastic Aptitude Test (SAT-M).
Such findings are reasonable to expect. Both the SAT-V and the SAT-M are
measures of abilities developed in the course of schooling. It is likely that
these abilities are directly affected by the effectiveness with which written
materials can be read and comprehended. Thus, the correlations between these
measures and the Reading Test obtains. Of course, the relation between the
SAT-V and the Reading Test should be stronger than that between the SAT-M and
this test, because reading and verbal reasoning skills are more closely
inter-related than are reading and quantitative reasoning skills. The weaker
and relatively low correlations between reading performance and students' grade
point averages that are shown in Table 2 also should be expected: although
reading skill might have a bearing on course performance, this influence is
unlikely to be strong because many factors other than reading skill affect
students' grade point averages (see Cronbach, 1971).
Relation between Reading Test Scores and Irrelevant Variables
A claim of test validity is supported not only by findings that the test
correlates in the expected manner with other variables, but also by findings
that performance of the test does not relate to variables that are presumed to
be irrelevant to the skills assessed by the test (see Campbell, 1964). For a
skills test like the Regents' Reading Test, there are two other variables that
are routinely examined in terms of their relations to individuals' scores on
the test. These variables are (1) the speed with which examinees can respond
to questions posed on the test, and (2) bias due to the influence of examinees'
ethnic background. Data have been collected to assess the relation of these
two variables to Reading Test performance; these data are described in the
paragraphs that follow.
SPEEDEDNESS. There is much evidence to suggest that response speed and re-
sponse power are best regarded as different attributes (Kendall, 1964;
Terranova, 1972). The Regents' Reading Test has been designed primarily as a
measure of reading power rather than reading speed. That is, it is intended
that examinees' scores on the test will reflect the accuracy, not the rate,
with which they can read the passages and respond to the items comprising the
test. For tests of skill such as the Reading Test, the rate of response should
not greatly influence individuals' scores on the test: on some tests response
rate has been found to be more closely associated with irrelevant examinee
qualities like temperament than with the ability to respond correctly
(Himmelweit, 1946; Wesman, 1960).
There are several ways to appraise the speededness of a test (see Donlon,
1978; Marahnich, 1980; Rindler, 1979). Ideally, it is appraised by analyzing
the test performance of examinees who have been administered parallel forms of
a test under both timed and untimed test conditions (see Cronbach & Warrington,
1951). If the test is unspeeded, one should find that the additional time
given to examinees effects no change in their relative test performances.
Such complex testing procedures often are not feasible to carry out. As a
consequence, speededness is commonly estimated on the basis of data gathered
from a single test administration even though this simpler approach is re-
cognized as somewhat inadequate (Rindler, 1979). Using the single test admini-
stration approach, the Educational Testing Service (ETS) has developed a set of
practical criteria that are used to make preliminary judgments about the
speededness of their tests. According to Swineford (1974), ETS regards a test
as possibly speeded when (1) fewer than 100% of the examinees reach 75% of the
test, and (2) fewer than 100% of the items are reached by 80% of the examinees.
These criteria have been noted by Swineford to be somewhat arbitrary, but it is
thought that these criteria function well as signals of potential speededness.
Donlon (1978) has noted that when a test meets these preliminary criteria it is
unlikely that the test is speeded.
In one study, the effect of the time limits of the Reading Test was ex-
amined by administering the test without enforcing the 60-minute time limit
(Fort Valley College, 1974). In this study, 161 students were told that they
could work on the test as long as they wished. Note was made of the time
beyond 60 minutes that these students used. Subsequent analyses indicated that
there was no difference in the passing rates of students who spent different
amounts of time working on the test. This finding does suggest that an
increase in the 60-minute limit would not necessarily improve students' test
scores. However, because the students in this study were not administered the
Reading Test under timed as well as untimed conditions, it is not possible to
discern from the data gathered in this study how the 60-minute limit affects
students' test performance. Hence, the data do not allow determination of the
speededness of the test.
In a second study recently carried out by the Regents' Testing Program
office, data were gathered from recent administrations of the Reading Test to
determine whether this test was potentially speeded according to the criteria
used by ETS. The statistics calculated on the basis of this data are given in
Table 3.
Table 3
Appraisal of Speededness in Recent Administrations of the
Regents' Reading Test
______________________________________________________________________________
________
Form 17 Form 18
Speededness
Criteria Spring, 1981 Spring, 1982 Summer, 1982
______________________________________________________________________________
________
Total number of items on 70 70 60
the test
Percentage of examinees 99 100 100
answering 75% of the
items on the test
Percentage of items answered 100 100 100
by 80% of examinees
______________________________________________________________________________
_______
In interpreting these data, it is important to keep in mind a distinction
between items that are "attempted," with which the ETS criteria are concerned,
and items that are "answered," which are referred to in Table 3. An item is
attempted when an examinee has either marked an answer to the item or omitted
the item but gone on to answer subsequent items. An item is not attempted when
this item appears near the end of a test and an examinee has omitted this item
and the remaining items on the test. In item analyses that are done by ETS,
the number of examinees who have answered an item are distinguished from the
number who have attempted and omitted it and the number who have not reached
it. The number of examinees reaching an item can then be determined by adding
the number who have attempted and omitted it to the number who have answered
it. In the item analyses currently done for the Regents' Reading Test, all
examinees who have not answered an item are counted as omitting it, whether
they have attempted and omitted it or actually not reached it. Because only
these two categories of responses can be distinguished on the basis of this
analysis - - answers and omits, the values in Table 3 reflect only the numbers
of items answered. Were the number of items attempted and omitted added to
these values in order to count the number of items that had been reached, these
values would probably be higher. Thus, the values in Table 3 can be regarded
as overestimates of the degree of speededness evident in the Reading Test.
In light of the values reported in Table 3, it appears improbable that the
Reading Test is speeded. As the table indicates, at least 99% of the examinees
administered recent forms of the test completed three-fourths of the items on
these forms, and all of the items were answered by at least 80% of the
examinees. Unless many examinees randomly marked answers to the final items as
the testing time ran out, these values indicate that examinees have little
difficulty finishing the Reading Test and that this test exceeds the ETS
requirements that should be met for a test to be considered unspeeded. The
hypothesis of random responses can be ruled out by the response pattern evident
on the final items of the Reading forms referred to in Table 3. Although not
indicated in this table, it was found that the correct answers to these items
were identified by 84% to 88% of the examinees who took these forms, which
suggests that most examinees' responses to these items were not at all random.
In sum, these findings do not suggest that individuals' scores on the Reading
Test are influenced by the irrelevant variable of test speededness.
BIAS DUE TO THE INFLUENCE OF ETHNIC BACKGROUND. There are many
definitions of
bias (see Flaugher, 1978; Scheuneman, 1981 ), but it is, perhaps, best
understood when viewed as a matter related to the validity of a measure (see
Shepard, 1981). As noted earlier, validity refers to the degree to which the
information given by a test is relevant to the decision or use for which the
test is intended (Thorndike & Hagan, 1977). Bias occurs when the information
given by a test does not have the same meaning for all groups that are tested
- - that is, the information is more relevant for one group than it is for
another. The groups for which a test may have this kind of differential
validity may differ in religion, sex, race, ethnicity, or the like. In any
case, however, the finding that individuals' membership in a particular group
affects the meaning of their test scores is undesirable, since this means that
the intended decision or use to be made of these scores is less valid for this
group than it is for others.
In literature treating the matter of bias, two general approaches to
investigating bias are usually noted. When a test is to be used to predict
performance on a criterion measure, the test-criterion relations for different
groups are usually compared (see Hunter & Schmidt, 1976; Linn, 1973). Such
studies would be conducted for tests that are to be used for college admissions
or employee selection since these tests are intended to predict future success.
For a test like the Regents' Reading Test, which is not intended to predict
performance on a particular criterion, studies of the test's internal and
external properties are carried out to determine whether there are any
differences in the meaning of the test for members of different groups.
Studies of a test's internal properties entail investigations of the test's
content, the difficulty of its items, and its internal consistency, among other
things, in the interest of determining whether there is evidence that the test
behaves differently for different groups that are assessed. External analyses
include studies of the relations of the scores on the test to other variables
in the interest of examining whether these relations are the same for different
groups (see Jenson (1980) for a detailed discussion of these two approaches).
External and internal analyses of the Reading Test shed light on the
question of whether the test is differentially valid for groups of white and
black students who are assessed. The findings from these analyses are
discussed in the paragraphs below, with those studies bearing on the test's
internal properties being discussed prior to those studies bearing on the
test's external properties.
Studies of bias based on internal analyses.
ANALYSIS OF TEST CONTENT. Since the validity of a skills test is strongly
affected by the quality of the content of which it is comprised, studies of
test content constitute one important basis for detecting factors that could
contribute to the differential validity of the test for different racial
groups. Studies of this content should be carried out in the course of
developing the test. One of these studies should entail (1) an examination of
the clarity with which the domains of skills to be assessed by the test have
been described, and (2) an examination of the degree to which the items written
for the test represent these specifications. When the specifications for the
test are judged to be clear, and the items written for the test are judged to
well-represent these specifications, many potential sources of invalidity are
eliminated from the test. As Shepard (1981) explained,
The meaning of a test score depends ultimately on how well
the items on the test represent the intended subject matter
or implied ability. Establishing logical validity is often
characterized as a sampling problem, that is, the accuracy
of the inferences made from the test will depend on how well
the test content domain is specified and how well the items
sampled represent the test content domain. The more am-
biguous the definition of the intended domain or the more
elusive our grasp of the intended construct, the more po-
tential there is for what is captured in the test to be a
distortion of the intended meaning. (p.83)
In addition to studies of domain clarity and item representativeness, the
content of a test should also be studied to identify any items that present
stereotypic images, ambiguous wording, or unfamiliar language that might alter
the meaning of the items for the members of a particular group (see Hambleton,
1980b; Cole & Nitko, 1981).
As noted previously, the skill and item specifications for the Reading Test
were developed after careful review by a joint committee of English and reading
experts in the Winter of 1982. After considering the clarity, completeness, and
appropriateness of these specifications, the specifications were approved and
have been subsequently used as guides for selecting passages and writing items
intended for the Reading Test.
The items that were written for Form 20 of this test were first reviewed by
staff from the Regents' Testing Program office for technical soundness and were
edited where necessary. These passages and items were subsequently organized
into a preliminary test form that was submitted to the joint committee for
further review. This committee judged the conformity of the items to the test
specifications and noted any instances in which ambiguous wording, unfamiliar
language, or difficult vocabulary occurred. After some revisions in the items
were made, the joint committee approved the items, indicating that the items
satisfied the requirements for item content, item structure, and representative-
ness set forth in the test specifications. In the final version of Form 20 of
the Reading Test, no apparent features were found that would introduce bias and
differentially affect the meaning of the Reading test scores for members of
different racial groups.
ANALYSES OF TEST RESPONSES. Analyses of the responses to items constitute
the next stage in an internal analysis designed to detect the presence of bias.
These analyses entail comparative studies of the statistical properties of the
item responses made by different groups. The finding that there are marked
differences in these properties would suggest that there may be items that do
not have the same meaning for the tested groups and, hence, might be biased.
In 1974 Litaker examined the difficulties of Reading Test items for stu-
dents attending four different types of institutions in the University System of
Georgia. The types of institutions and the number of schools in each type were
as follows: universities (3), senior colleges (9), junior colleges (11), and
four-year institutions that were historically black (3). The data analyzed by
Litaker consisted of responses to Form F of the Regents' Reading Test, which was
taken by students in the Spring and Fall quarters of 1972 and 1973. In his
study Litaker grouped the students from the four types of institutions into four
ability levels, using as a measure of this ability students' scores on the
objective writing test that had been a component of the Regents' Test at the
time. For his final sample he selected students at each of the four ability
levels from each of four different kinds of institutions. He subsequently
compared, within each ability level, the difficulty of the Reading Test items
for students attending the four types of institutions.
In Table 4 the most important of Litaker's findings are given. This table
shows the correlations, within each ability level, between the item difficulties
for pairs of institutional types. These correlations are most important because
they reflect the degree to which the items of Reading Form F have the same rela-
tive difficulty for the two groups compared. When relative item difficulties
for the two groups are very similar, the values of these correlations will be
close to 1.00. Alternatively, correlations substantially less than 1.00 will be
found when there are items that are relatively more difficult or easy for one or
the other of the two groups compared.
Table 4
Correlations between the Difficulties of Reading Form F Items
for Samples of Students of a Given Ability* Within Different Types of Institutions
[Litaker, 1974]
______________________________________________________________________________
____________
Ability Level/
/Types of Institutions Compared
correlation
Ability Level 1
Universities vs. Senior Colleges .992
Universities vs. Junior Colleges .983
Senior Colleges vs. Junior Colleges .992
Universities vs. Historically Black Colleges .944
Senior Colleges vs. Historically Black Colleges .961
Junior Colleges vs. Historically Black Colleges .952
Ability Level 2
Universities vs. Senior Colleges .987
Universities vs. Junior Colleges .990
Senior Colleges vs. Junior Colleges .995
Universities vs. Historically Black Colleges .882
Senior Colleges vs. Historically Black Colleges .909
Junior Colleges vs. Historically Black Colleges .897
______________________________________________________________________________
____________
*Litaker decided students' abilities using the objective writing test
that was part of the Regents' Test at the time. He selected for his
study students performing in the 2nd, 4th, 6th, and 9th deciles on this
test. Because there was not adequate representation in the 6th and 8th
deciles among students attending the historically black colleges, only
the item difficulties of students who were in the 4th and 6th deciles
and attended these colleges were analyzed by Litaker and reported here.
To appraise the correlations presented in Table 4, it is useful to compare
their values to those obtained by Angoff and Ford (1973) in their classic study
of bias in the Preliminary Scholastic Aptitude Test (PSAT). For the verbal and
mathematic sections of this test, Angoff and Ford reported that values of .959
and .923, respectively, were obtained when the item difficulties for matched
black and white students were correlated. As shown in Table 4, Litaker's
findings for the correlations that involved historically black colleges range
from .897 to .944, which are slightly lower than the correlations obtained by
Angoff and Ford. Because these findings suggested that some items of the
Reading Test might be relatively more difficult or easy for students at the
historically black institutions, Litaker conducted further analyses in an
attempt to identify such items. He found four items that were relatively easier
and four that were relatively harder for these students than for students
attending the three other types of institutions in the University System.
Litaker examined the content of these aberrant items and speculated about the
reasons for their differential difficulty, but ultimately he concluded that
there was no clear explanation for their possible bias. It is important to note
that Litaker's conclusion reflects the lack of success experienced by other
researchers who have attempted to discern the particular features of items that
make the items relatively more difficult or easy for members of a particular
group. In studies by Jensen (1979), Sendoval and Miller (1980), and Plake
(1980), there was no more agreement from judgmental and empirical analyses of
bias than that level of agreement that would occur by chance.
It also is important to note that it would be inappropriate to generalize
Litaker's findings and consider them applicable to recent versions of the
Reading Test. The form of the Reading Test examined by Litaker was the first
form of the test that had been devised, and it entailed items drawn from a pool
that the Regents' Testing Program had leased from the Educational Testing
Service. Since 1974, the items comprising forms of the Reading Test have not
been drawn from this pool. Also, the content of these items has been consider-
ably refined since that time. Therefore, whatever content characteristics
produced the possible bias evident in Form F may not still be evident in more
recent test forms. Particularly since Litaker was unable to identify any type
of item characteristic that could be consistently related to the discrepant item
difficulties that he observed, there is no basis for the inference that biasing
content characteristics like those that he found will be evident in more recent
forms of the Reading Test.
To examine the question of bias in a more recent form of the Reading Test,
the Regents' Testing Program office conducted a study like Litaker's on Form 15
of this test. Compared in this study was the relative difficulty of Form 15
items for unmatched groups of black and white students who had been administered
this form during the 1978-1979 academic year. Hence, this study differed from
Litaker's in that the item difficulties for black and whites students, rather
than for historically black and non-black institutions, were compared.
When the difficulties of Form 15 items for the unmatched black and white
students were correlated, a value of .938 was obtained. This value is high and
compares favorably with the item difficulty correlations of .929 and .901 ob-
tailed by Angoff and Ford when they compared the relative difficulties of PSAT
verbal and mathematical items for unmatched groups of blacks and whites. It
also suggests that the relative difficulty of Form 15 items for black and white
students was about the same. An even higher correlation than .938 probably
would have been obtained if the relative difficulty of Form 15 items had been
compared using groups of black and white students who had been matched on
ability. As Angoff and Ford observed, matching appears to reduce some of the
disparities in the test performance of different groups.
Since the correlation of .938 fell slightly below 1.00, the Form 15 items
were examined individually in an attempt to identify those items for which there
was the most marked difference in relative difficulties when the black and white
student groups were compared. Two items (No.s 21 and 63) that were found to be
slightly more difficult for the black students than were other items on the test
are shown in Table 5.
Table 5
Items from Form 15 of the Regents' Reading Test
That Were Relatively More Difficult for Black Students Than for
White Students
______________________________________________________________________________
____________
Percent Choosing Each Option
_______________________________
White Black
Students Students
______________________________________________________________________________
____________
21. Aerobes are
*1. live organisms. 95 82
2. atmospheric gases. 2 7
3. animal residue. 2 8
4. decaying plants. 1 3
63. This passage was obviously written
*1. in the mid to late 20th century. 96 83
2. at the turn of the century. 2 5
3. in the mid to late 17th century. 1 3
4. in the early 20th century. 1 3
(omit) (1) (5)
______________________________________________________________________________
____________
Note: Percentages based on responses of 7,468 white students and 926 black students who
were administered Form 15 during the 1978-1979 academic year.
To determine why the items shown in Table 5 were, relative to other items,
slightly more difficult for black students, it is reasonable to consider the
material of the passages to which each item referred. In the expository passage
to which Item 21 refers, a method of constructing a compost heap is described.
The word aerobes is used in a sentence that provides several context clues that
suggest the meaning of this word. The sentence is:
The soil organisms which break down the plant
and animal residues and convert them into
compost are aerobes, i.e., they must have oxygen
from the atmosphere to carry on their life
activities.
Thus, Item 21 can be answered correctly by the student who is able to use "i.e."
and the pronoun reference "they" to make the link between the soil organisms and
the life activities referred to in the passage. As is evident from Table 5,
most white and black students did answer the item correctly. However,
the analysis of this item for bias indicated that, relatively speaking, the item
posed slightly more difficulty for some black students. It is not clear,
though, why this was the case.
With respect to Item 63, the reason for the discrepancy in black and white
students' performance may be more clear. Item 63 refers to a narrative passage
about a man's rise to success as an advertising agent, singer, and poet during the
"Eisenhower" years." The reference to the Eisenhower years and a reference to
computers would appear to be the only information in the passage which would
allow a student to discern that the passage was written about events occurring
in the mid to late 20th century, which is the correct answer to Item 63. In
particular, linking Eisenhower or computers to the middle or late 20th century
requires information that has not been presented in the passage. It may be that
a greater portion of black than white students lacked this information and so
found this item relatively more difficult.
Studies of bias based on external analyses
Data that were gathered by Hickman (1973) in the course of a larger study
permit rough comparisons of the relation between academic variables and the
Reading Test scores obtained by students who differ in race. In her study,
Hickman examined these relations for each of five institutions, which included
two junior colleges, one four-year college, one university, all of which were
predominantly non-black, and one 4-year college that was predominantly black.
For the students at each institution, she calculated the correlation between
their Reading Test scores and both their scores on the Scholastic Aptitude Test
and their grade point averages. Thus, through a comparison of the correlations
obtained for the predominantly black college to those obtained for the four
predominantly non-black institutions in Hickman's study, some conclusions can be
drawn about the similarity or differences in the meaning of the Reading Test
scores for students attending these different types of institutions. Because
the institutions studied were not homogeneous with respect to race, only very
tentative inferences can be made about the meaning of the Reading Test scores
for students who differ in race. Hickman's findings also have limited
generalizability because she included in her study only one predominantly black
college that may or may not have a student body that is representative of the
black student population in the University System.
The results of Hickman's correlational analyses are presented in Table 6.
As is evident from the table, with two exceptions the correlations obtained at
the predominantly black college differed little from those obtained at the other
four institutions that Hickman examined. For example, the correlation of .27
between the Reading Test scores and cumulative high school grade point averages
obtained by students at the historically black institution was about the same as
those reported for students attending the junior and 4-year colleges in
Hickman's study; why the correlation reported for the University was somewhat
lower than those of these other institutions is not clear. Further study of
Table 6 also shows that institutional type did not seem to influence the
relation of students' Reading Test scores to their SAT(M) scores and to their
cumulative college and freshman grade point averages; for all five institutions,
these correlations were about the same.
Table 6
Correlations between Regents' Reading Test Scores
and Selected Academic Variables Within
Five Different Institutions
[Hickman, 1973]
______________________________________________________________________________
__________
Institutional Types
_______________________________________________________________
Predominantly Non-Black Predominantly
Black
Academic
________________________________________________________________
Variables Jr. Jr. 4-Year University 4-year
College College College College
______________________________________________________________________________
__________
Test Scores
SAT(V)* .61 .76 .74 .70 .58
SAT(M)** .41 .57 .41 .39 .41
Grade Point Averages
Cumulative-High School .27 .26 .26 .17 .27
Cumulative-College .38 .49 .48 .44 .46
Freshman-College .35 .46 .48 .42 .44
Eng. Composition-College .35 .40 .44 .39 .32
______________________________________________________________________________
_________
Note: Correlations based on sample sizes ranging from 88 to 560.
*SAT(V) refers to the verbal section of the Scholastic Aptitude Test.
**SAT(M) refers to the mathematical section of the Scholastic Aptitude test.
The two exceptions to this trend of similar correlations are the markedly
lower correlation between the Reading Test and the SAT(V) reported for the
predominantly black college and the slightly lower correlation between the
Reading Test and students' grade point average in English Composition reported
for this college. The Reading Test - SAT(V) correlation of .58 that was ob-
tailed at the predominantly black college was quite a bit lower than the values
for this correlation obtained at the other institutions in Hickman's study (r's
= .61 to .76). Similarly, the correlation between the Reading Test and
students' grades in English Composition obtained at the predominantly black
institution (r = .32) was slightly lower than the correlations obtained at the
other institutions (r's = .35 to .44).
Restriction of range is the most plausible explanation for lower correla-
tions obtained at the predominantly black institution. That is, the students
attending this institution received scores on the correlated variables that were
more homogeneous than the scores of students attending the other institutions
studied by Hickman. This explanation is supported by the data presented in
Table 7, which gives the means and standard deviations of the Reading Test
scores, the SAT scores, and the grade point averages obtained by the students at
the five institutions involved in Hickman's study. When the standard deviations
shown in this table are examined, it becomes clear that the students from the
predominantly black college were considerably more homogeneous in their Reading
Test scores, their SAT(V) scores, and their English composition grades than were
students attending the other four institutions in Hickman's study. Since
homogeneity restricts the size of the correlation that can be obtained, it is
most probable that the correlations among these variables were relatively lower
for the predominantly black college because the students at this college were
less variable in their performance on both variables involved in the
correlations.
Table 7
Means and Standard Deviations* of Scores Obtained
on the Regents' Reading Test and on Selected
Academic Variables by Students
at Five Different Institutions
[Hickman, 1973]
______________________________________________________________________________
_______________
Institutional Types
_____________________________________________________________________
Academic
Predominantly Non-Black Predominantly
Black
Variables
_____________________________________________________________________
Jr. Jr. 4-Year University 4-year
College College College College
______________________________________________________________________________
_______________
Regents' Reading 63.25 63.47 64.94 67.16 48.65
Test Scores (9.39) (9.57) (9.30) (10.24) (7.96)
Other Test Scores
SAT(V)** 422.98 411.94 456.31 455.22 302.62
83.82) (85.38) (100.25) (95.20) (57.33)
SAT(M)*** 431.90 448.9 2 451.92 471.30 328.66
(84.84) (98.43 ) (89.70) (94.09) (59.66)
Grade Point Averages
Cumulative-High School 2.56 3.25 2.64 2.65 2.65
(.70) (.60) (.62) (.67) (.58)
Cumulative-College 2.68 2.53 2.45 2.39 2.56
(.55) (.65) (.58) (.68) (.48)
Freshman-College 2.57 2.45 2.37 2.31 2.64
(.58) (.67) (.65) (.63) (.48)
Eng. Composition-College 2.54 2.20 2.50 2.49 2.89
(.75) (.70) (.71) (.59) (.69)
______________________________________________________________________________
____________
Note: Correlations based on sample sizes ranging from 88 to 560.
*Standard deviations reported within parentheses.
**SAT(V) refers to the verbal section of the Scholastic Aptitude Test.
***SAT(M) refers to the mathematical section of the Scholastic Aptitude Test.
Section II
DEVELOPMENT AND VALIDATION OF THE ESSAY TEST
DEVELOPMENT OF THE TEST PROCEDURE
The first administration of the Regents' Test entailed two measures of
students' writing skills: an essay test and a multiple-choice grammar and usage
test. Students were given 30 minutes to work on each of these tests, and they
were assigned the essay topic on which they were to write.
The decision to assess students' writing skills in this way was made after
much debate between members of the Subcommittee on Testing and testing experts
about the manner in which writing could properly be appraised. The Subcommittee
members felt strongly that writing skills could be validly assessed only on the
basis of a writing sample. In contrast, the testing experts preferred a
multiple choice test of writing skills because of the psychometric problem of
reliably scoring a writing sample and because of the administrative and economic
difficulties entailed in scoring large numbers of these samples. Inclusion of
both a 30-minute essay test and a 30-minute multiple-choice test of writing
(grammar and usage) was the compromise these two groups ultimately reached
(Johnson, 1980; Thompson & Rentz, 1973). Students were assigned their essay
topics in part because both the Subcommittee members and testing experts held
the view that if students were given a choice of topics they would spend too
much of the relatively short testing time selecting their topics and too little
of that time organizing and writing their essays. It was also thought that
permitting a choice would make unduly complicated the scoring procedures used to
rate the essays (Thompson & Rentz, 1973).
The multiple-choice writing test and the essay test that provided no choice
of topics were used in the Regents' Test until 1974, when several changes were
made in the writing assessment procedures. The multiple-choice writing test was
dropped from the Language Skills Examination after the Summer, 1974,
administration. This decision was made because data had been collected which
showed that students' essays were being reliably scored and the Subcommittee
concluded that students' grammar and usage skills could be considered by essay
raters when they read and scored the students' essays (Johnson, 1980).
Concurrently, the Subcommittee decided to give students 45 rather than 30
minutes to write their essays and to allow students a choice between two essay
topics to write on. These two matters were decided in light of data gathered by
the Regents' Testing Program office that showed that students' passing rates
might be improved by extending their essay-writing time to 45 minutes and that
these rates would not be adversely affected by giving students a choice between
two topics (Regents' Testing Program, 1974). In 1978, members of the Ad Hoc
Committee recommended that the time allowed for the essay test be extended to
the current 60-minute limit because they believed that the quality of students'
essays might improve if more time to work on them were allowed.
With respect to the essay topics presented in the Essay Test, the Testing
Subcommittee specified that these topics should be narrow enough to elicit an
essay in 60 minutes but broad enough to bear on students' common knowledge and
experiences rather than on any specialized knowledge that only some students
might have (Thompson & Rentz, 1973). Also these topics were not to (1) contain
difficult vocabulary, (2) appear to have a rural-urban or an ethnic bias, (3)
closely resemble topics previously used, (4) involve highly controversial or
emotional subjects, or (5) seem to encourage students to identify their
institutions in their essays. It was also specified that the two topics
presented on a form of the test should be sufficiently different from one
another so that students had a reasonable choice between topics, e.g., one topic
might bear on a contemporary idea or event, and the other might bear on a
personal event or experience.
PROCEDURES FOR SELECTING ESSAY TOPICS
The topics used on the earliest forms of the Essay Test were written by the
Testing Subcommittee. More recently, proposed topics have been solicited
through the president of each institution, who is asked to obtain suggestions
for topics from students, faculty, and administrators. Over 950 suggestions
were proffered in response to the most recent request for essay topics, which
was made in the Fall Quarter of 1980.
The essay topics that are submitted are first reviewed by the Scoring
Coordinators and the members of the Testing Subcommittee, who select those
topics that conform to the specifications for the Essay Test. After being
revised where necessary, the topics selected are then submitted to the Academic
Committee on English for further revision and final approval. The approved
topics are then put in pairs by the Regents' Testing Program staff for use on
test forms. The Testing Subcommittee and Scoring Coordinators subsequently
review and revise the pairs of topics to ensure that the two topics presented on
each form are sufficiently different from each other to offer students a
reasonable choice between topics.
PROCEDURES FOR SCORING THE ESSAY TEST
Each student's essay is graded independently by three raters who use a
holistic scoring procedure to assign ratings to the essay. Often, raters who
use holistic scoring procedures are required to read an essay quickly to gain an
overall impression of its quality and to use standards established by the raters
themselves to assign a rating to it (Godshalk, Swineford, & Coffman, 1966). A
variant of this procedure is used to score the essays written for the Essay
Test. Raters of these essays do assign ratings that reflect their judgment of
the overall quality of an essay; however, the standards used to assign these
ratings have been formulated by the Testing Subcommittee rather than the raters
themselves. The development of these standards is described in a section below.
Procedures that entail holistic or impressionistic judgments of essays are
used in many large-scale testing programs because essays can be scored quickly
when these procedures are used. Because of this efficiency, it is possible to
have three raters read and score each essay submitted by the students who have
taken the Essay Test. The analytic scoring method, which is another method of
directly appraising an essay, requires a rater to evaluate individual features
of the essay. This method is more time-consuming and expensive than the
holistic approach, and it is not known to be any more reliable or valid (Coffman
& Kurfman, 1968; Raven, Veal, & Rentz, 1974).
A disadvantage of the holistic approach is that holistic ratings provide no
information about the particular strengths and weaknesses of the rated essays.
A holistic rating reflects a judgment about the quality of the essay as a whole,
and the particular features of the essay that led to this rating are not
described by the rater. However, it is not reasonable to expect that the
results of a large scale testing program provide extensive diagnostic
information. This kind of information is probably more appropriately and
effectively gained in the classroom. For a reliable diagnosis about the strong
and weak aspects of a student's writing, several essays should be written at
different times by the student and then appraised analytically. Students who
have taken the Essay Test and have questions about their essay scores can review
their essays with faculty who are familiar with the procedures used to rate the
Essay Test. Although it is not possible to determine from this review the
reasons why a particular score was assigned to an essay, strengths and
weaknesses in the essay can be pointed out by the faculty member and, more
importantly, further writing samples can be requested so that an accurate
diagnosis can be made.
Until 1981, the instructions that were given to essay raters included a
description of the scoring procedure they were to use and the set of model
essays. These instructions had been developed by the Testing Subcommittee.
In 1981, the instructions to the essay raters were expanded and a new set
of model essays was selected. These expanded instructions did not entail any
change in the scoring standards or criteria; minor revisions were made in the
description of the essay scoring procedure, and a new section was added wherein
answers to questions that raters commonly pose were provided. These revised
instructions, which are part of the Description of Essay Scoring Procedures,
were approved by the Academic Committee on English in February, 1981, and they
have been used by Regents' Test essay raters since Spring quarter, 1981.
In order to monitor the procedures for scoring essays, the members of the
Testing Subcommittee meet with the Scoring Coordinators each quarter after the
test has been administered and before the first essay scoring session. At this
meeting members of the group read essays written on the topics used that quarter
and look for any specific problems that raters could have in rating papers on
the topics. A discussion of anticipated problems provides the basis for the
guidelines that are developed for use in scoring the current essays. At the
meeting, the group also selects practice essays for use at the scoring sessions.
The ratings assigned to the practice essays are based on unanimous agreement of
the group members. At the scoring session held later in the quarter, Scoring
Coordinators discuss both the guidelines and the practice essays with essay
raters.
Rationale for Scoring Standards
As noted above, the standards used as a basis for scoring the essays
written for the Regents' Test were developed by the Testing Subcommittee, whose
members had an average of 20 years of experience in grading the compositions of
high school and college students. The Subcommittee's choice of standards was
based both on the members' extensive experience in appraising the writing of
students from widely different backgrounds and on its reading and discussion of
hundreds of essays written for the experimental form of the Essay Test
administered in the spring of 1971.
The following statement by a former chairman of the of the Testing Sub-
committee conveys the considerations on which the essay-scoring standards have
been based:
The fundamental assumption which justifies the whole test is
that language is the primary tool of thought as well as of
expression and that the inability to manipulate language
accurately not only impairs communication but blights the
entire thinking process. The writer, by giving up such
advantages as gestures, intonation, and the personal
interplay between speaker and hearer, is forced to a far
greater precision of diction and organization of material
than the speaker. Thus in the examination it was necessary
to demand tight, logical organization as well as precise word
choice and unambiguous phrasing.
. . . .What we sought to establish in the Rising Junior test
was a level of composition which would not disgrace a letter
of application for a job and which would convey clearly such
information as a chemist, a business man, or a policeman
would need to convey to his superiors, his colleagues, or his
customers. I believe that the standards which our committee
set represent a bare minimum of what the business world would
accept in terms of organization, coherence, explicitness, and
freedom from gross linguistic bad manners.
I believe that the professors who established the standards
for the essay, working as they did out of extensive and
nationwide experience of student writing, were thoroughly
qualified to define both the criteria by which writing should
be judged and the level at which it should be considered
minimally acceptable. I can testify that the labor, extend-
ing over a number of days and over hundreds of papers, was
carried out conscientiously and with a deep realization of
the responsibility which we carried. Although many believe
the standards too low, I believe we can defend the minimally
acceptable papers as only slightly below the standard re-
quired in most Freshman English courses; the Rising Junior
test is, after all, taken under somewhat more harassing
conditions than a Freshman in-class theme (Pendexter III,
undated).
Although a two-point, pass-fail rating scale would have adequately served
the purposes of the Essay Test, the Testing Subcommittee decided to use a
four-point scale so that students and their institutions would be given more
information about the quality of the students' writing performance. With three
acceptable levels of writing, exceptional writing performance would not go
unrecognized, and marginal levels of skill could be distinguished from the
clearly acceptable levels. The Subcommittee also decided that one failing score
was sufficient because, in the words of one committee member, "We all felt
strongly that distinguishing between failing and failing miserably would be
needlessly depressing to the student who received the lowest grade."
The use of a model essay to represent each division point on the four-point
scale was thought to provide the most effective means of precisely defining the
scale. Each of these models describes just one point on this scale. Were
models representing the ratings of "1," "2," "3," and "4" used, less precise
definition of the scale would invariably result because a range of performance
is represented by each of these ratings. The use of models representing the
mid-point of each rating on the scale would similarly result in a less
well-defined scale: when an essay is judged to fall between two midpoints, a
rater has no clear means for deciding what rating the essay should be assigned.
In order to have some systematic method of selecting model essays, in 1971
members of the Testing Subcommittee identified 22 aspects of writing and indi-
cated how these aspects should be weighted when the model essays were selected.
The committee indicated that 40% of the weight should be given to organizational
aspects, 40% to rhetorical aspects, and 20% to the mechanical aspects of the
essays considered. These categories and the specific features of writing to
which they pertain are listed in Tables 8 to 10 and are discussed in a
subsequent section where two studies that used the Testing Subcommittee's list
of 22 criteria are described.
During the first few years in which the Regents' Test was administered, the
Testing Subcommittee selected new model essays for each topic each quarter so
that essay raters could be given model essays written on the topics used in the
Essay Test administered that quarter. Selecting these "topic-specific" model
essays was time-consuming and, even with the careful procedures for selecting
these models, there was no way to ensure that the selected models were
equivalent. Therefore, in Spring 1976, it was suggested that one standard set
of model essays that could be used each quarter be identified and used in lieu
of the topic-specific sets of model essays that had to be changed each quarter.
Research subsequently conducted by the Regents' Testing Program in Summer, 1976,
indicated that students' pass rates on the Essay Test and rater reliability
would not be significantly affected by using the standard models in lieu of
topic-specific models. Consequently, the Testing Subcommittee agreed to select
a set of standard models that would be used as bases for rating all essays
written for the Regents' Test. These models were chosen from among the essays
that had been previously used as topic-specific models.
In 1978, the Ad Hoc Committee on the Regents' Test examined the standards
that were used to judge the quality of the essays written for the Essay Test.
After considering the procedures used to select the model essays that establish
these standards and appraising the rationale underlying these procedures, the Ad
Hoc Committee concluded that the essay scoring procedure was sound, and it
recommended that responsibility for setting the scoring standards remain with
the Academic Committee on English.
EVIDENCE OF CONTENT VALIDITY
As in the case of the Reading Test, the validity of the Essay Test rests
primarily on evidence that it is content valid. As noted in Section I above, a
test developer builds content validity into a test by (1) carefully considering
what content and skills the test should cover, (2) by carefully defining the
content and skills the test will assess, and (3) by choosing items for the test
that adequately represent the content and skill specifications. Experts in the
subject matter and skills to be assessed should be engaged at each of these
stages (Anastasi, 1976).
Forms of the Essay Tests have been developed in such a manner. The aspects
of writing to be appraised by the test were discussed at length by the Testing
Subcommittee in 197l. These aspects were illustrated by the model essays that
have been used to define the Essay Scoring scale and are discussed in the
Description of Essay Scoring Procedures given to the raters who grade the
Regents' essays. With respect to the topics that the Essay test is to cover, as
noted above, potential topics for the test are reviewed by the Testing Sub-
committee, and topics are selected that conform to the aforementioned criteria
used by this committee. The selected topics are then reviewed by the full
Academic Committee on English, which consists of representatives from all 33
institutions in the University System. Any topic that still appears inappro-
priate or flawed is either revised, when possible, or eliminated. The Testing
Subcommittee and the Scoring Coordinators conduct a final review of the approved
topics after they have been paired for use on forms of the Essay Test.
OTHER EVIDENCE OF VALIDITY
Relation between Essay Test Scores and Other Measures of Writing Skill
Like the Reading Test, the validity of the Essay Test largely rests on the
quality of its content, but additional support for this validity can be gained
from studies of the relation between students' essay scores and other measures
of their writing skill; the finding that the scores on the two measures relate
well would provide support for the claim that the Essay Test does, in fact,
appraise students' writing skills (see Campbell, 1964; Campbell & Fiske, 1959).
Three studies have been conducted to study such relations. In the studies
by Ravan (1973) and Henderson (1977), Regents' essays that had been holistically
graded were rated again using another scoring method so that comparisons could
be made between the essays' holistic scores and the scores resulting from
another method of appraising the quality of writing in essays. In contrast,
Veal and Rentz (1975) compared students' holistic essay scores to their
performance on an objective test of writing skills in order to examine the
relation between the Essay Test and an entirely different approach to the
measure of writing skills. The results of these studies are described in the
paragraphs that follow.
Ravan (1973) and Henderson (1977) examined the relation between the holis-
tic scores students received on the Essay Test and the scores these students
obtained when their essays were analytically graded. In contrast to the global
rating of writing quality associated with the holistic method, when an essay is
appraised analytically, a rater reads the essay slowly and then rates individual
components of the essay. Both Ravan and Henderson had raters individually rate
22 components of the essays that they read. These 22 components are those that
the Testing Subcommittee specified in 1971 and used as the initial basis for
selecting the model essays that would define the score scale for the Essay Test.
If the holistic scores assigned to Regents' essays validly reflect the levels of
quality described by the essay score scale, these scores should establish a rank
order of essays that is the similar to that which results when these essays are
analytically graded on the 22 components.
For her study, Ravan selected from the Winter, 1973, test administration 40
essays that had been given holistic scores of 1, 2, 3, or 4, with 10 essays
selected to represent each of these scores. Two raters then analytically graded
each of the 40 essays on the 22 components that were noted above and are listed
in Table 8. For the components pertaining to Organization and Rhetoric, the
raters used a 4-point scale that ranged from (1) substandard to (4) superior.
The components pertaining to Mechanics were also rated on a 4-point scale that
differed slightly in that it ranged from (1) demonstrates incompetence to (4)
demonstrates competence. The results of these analytic ratings are reported in
Table 8.
Table 8
Mean Analytic Ratings on 22 Components for Regents' Essays
Given Holistic Ratings of 1, 2, 3, and 4
[Ravan, 1973]
______________________________________________________________________________
__
Holistic Scores
Analytic Component
_________________________________________
Components Means
1 2 3 4
______________________________________________________________________________
__
Category of
Organization:
1. Limiting the
Subject 1.40 1.60 1.60 2.40 1.75*
2. Evidence of a
Thesis 1.40 2.10 2.00 3.00 2.12*
3. Development of
Thesis: Unity 1.10 1.80 2.00 2.60 1.86*
4. Development of Thesis:
Logical Development 1.00 1.80 1.80 2.80 1.85*
5. Development of Thesis:
Coherence 1.00 1.50 2.00 2.70 1.80*
6. Development of Thesis:
Evidence 1.00 1.50 1.50 2.60 1.65*
Category Mean 1.15 1.7 1.82 2.68 1.84
Category of Rhetoric:
Diction:
7. Clarity 1.20 1.80 2.80 2.90 2.17*
8. Economy 1.30 1.80 2.50 2.80 2.10*
9. Precision 1.20 1.90 2.50 2.90 2.12*
10. Consistency 1.80 2.20 2.90 3.20 2.53*
Sentence Structure:
11. Clarity 1.30 1.90 2.60 2.90 2.17*
12. Variety 1.60 2.00 2.80 3.10 2.38*
13. Economy 1.40 2.00 2.50 3.00 2.23*
14. Parallelism 1.90 2.70 2.60 3.20 2.60*
Paragraph Structure:
15. Unity 1.20 1.70 2.30 2.80 2.00*
16. Logical Development 1.20 1.70 2.30 2.80 2.00*
17. Coherence 1.30 1.70 2.20 2.80 2.00*
Point of View:
18. Appropriateness 1.10 1.60 2.40 3.00 2.02*
19. Consistency 1.00 1.80 2.10 2.80 1.92*
Category Mean 1.35 1.91 2.50 2.94 2.17
Category of Mechanics:
20. Spelling 2.60 3.30 3.80 3.80 3.38*
21. Punctuation 2.60 3.20 3.10 3.60 3.12
22. Usage 2.40 2.90 3.10 3.60 3.00
Category Mean 2.53 3.13 3.33 3.67 3.17
OVERALL ANALYTIC RATING 1.45 2.02 2.43 2.97
______________________________________________________________________________
__
Note: Each component mean reflects the analytic rating assigned by two raters to
10 essays given the specified holistic score.
* On this component, an analysis of variance indicated significant differences
in the mean analytic scores assigned to essays differing in their holistic
scores (p < .05).
As is shown by the final statistics listed in Table 8, Ravan found that the
overall analytic scores assigned to the 40 essays confirmed the rank order esta-
blished by these essays' holistic scores. That is, essays given higher holistic
scores also were assigned total analytic ratings that were higher, which sug-
gests that the holistic scores do effectively reflect the overall quality of
writing found in the Regents' Test essays. Ravan also found that the holistic
scores generally reflected the quality of individual features of essays that
were graded. As shown in the table, the analytic scores for the three
categories of Organization, Rhetoric, and Mechanics, reflected a rank order like
that established by these essays' holistic scores. Thus, Ravan found the
expected qualitative differences in the organization, rhetoric, and mechanics of
essays assigned different holistic scores. Table 8 also shows that even for the
components of the analytic categories, the analytic ratings for all components
but punctuation and usage progressed from low to high in accord with the
holistic scores. For example, on all 22 components, the analytic scores
assigned to the essays rated holistically as 2's were higher than the analytic
scores assigned to the essays holistically rated as 1's. With respect to the
components of punctuation, usage, and also spelling, Ravan noted that both good
and poor essays were assigned relatively high scores on these mechanical aspects
of writing. These findings indicated that organizational and stylistic problems
rather than mechanics had impeded the communications of the poor writers in her
sample. Ravan's general conclusion was that her data demonstrated that the
holistic procedures used to grade the Essay Test were valid. As she noted, "the
results of the analytic procedure placed the forty essays in four ranks of essay
quality, ranks which correspond to the essay ranks previously determined (by
holistic scoring)" (p. 110).
Henderson's study differed slightly from Ravan's in that he was interested
in the question of whether only those essays that had been holistically graded
as 1's and 2's (Failing and Barely Passing) would be judged to differ in quality
when analytically graded. For his study, Henderson selected 72 essays to be
analytically graded, where eighteen essays were obtained from each of four
institutions. Of the 72 essays, 36 had been given holistic grades of 1, and 36
had been given holistic grades of 2. Twelve raters graded each of the 72 essays
on the abovementioned 22 components using a two-point scale having the values
(0) Fail and (1) Pass. Henderson then calculated the average component ratings
assigned to the 36 essays that had been holistically graded as 1's, and the
average component ratings assigned to the 36 essays holistically graded as 2's.
He also calculated the percentage of analytic "pass" ratings each set of 36 was
assigned on each of the 22 components.
As noted in Table 9, Henderson found that, on all components, the set of
essays holistically graded as 2's were assigned mean analytic ratings that were
significantly higher (p < .001) than the mean analytic ratings assigned to the
set that had been holistically graded as 1's. In Henderson's view, these
findings showed that the holistic grades of 1 and 2 were valid indicators of
different levels of writing quality. He noted, "(this) evidence that every one
of the twenty-two criteria does differentiate between a pass/fail holistic
rating for the essay unquestionably validates the holistic rating method at
Levels 1 (fail) and 2 (minimal pass). The (holistic) rating procedure is,
Table 9
Mean Analytic Ratings on 22 Components for Regents' Essays
Given Holistic Ratings of 1 and 2
[Henderson, 1977]
______________________________________________________________________________
__________
Holistic Scores
Analytic Component
_________________________________________
Components Means
1 2
______________________________________________________________________________
___________
Category of
Organization:
1. Limiting the
Subject 7.08 10.28 8.68
2. Evidence of a
Thesis 8.53 11.33 9.93
3. Development of
Thesis: Unity 5.22 9.72 7.47
4. Development of Thesis:
Logical Development 3.56 8.28 5.92
5. Development of Thesis:
Coherence 3.53 8.25 5.89
6. Development of Thesis:
Evidence 3.67 7.75 5.71
Category of Rhetoric:
Diction:
7. Clarity 6.22 10.00 8.11
8. Economy 5.72 8.58 7.15
9. Precision 4.39 8.22 6.31
10. Consistency 7.97 10.67 9.32
Sentence Structure:
11. Clarity 5.53 10.28 7.90
12. Variety 6.47 9.44 7.96
13. Economy 5.19 8.36 6.78
14. Parallelism 6.75 10.19 8.47
Paragraph Structure:
15. Unity 6.22 10.06 8.14
16. Logical Development 2.87 6.94 4.90
17. Coherence 4.39 8.67 6.53
Point of View:
18. Appropriateness 8.69 10.97 9.83
19. Consistency 8.25 10.78 9.51
Category of Mechanics:
20. Spelling 7.81 11.19 9.50
21. Punctuation 8.33 11.00 9.67
22. Usage 6.86 10.94 8.90
OVERALL ANALYTIC RATING 6.06 9.63
______________________________________________________________________________
____________
Note: Each component score reflects the analytic ratings (1 = pass, 0 = fail) assigned
to 36 essays by 12 raters when the 12 ratings assigned to each essay are summed over
essays and this total is divided by 36. For all components analyses of variance
indicated that the analytic scores assigned to essays holistically graded as 1's
differed significantly from those assigned to essays holistically graded as 2's (p <
.05).
therefore, functional. The fact that the results are not just marginally
significant but significant at the .001 level of confidence reinforces the
degree of assurance for the validation contention" (p.84).
Examining the analytic scores in more detail, Henderson observed that the
essays given the failing holistic grades of 1 differed most markedly from those
given the passing holistic grade of 2 on components 3-6, 9, 13, 16, and 19,
which are components pertaining to organization and rhetoric. As is indicated
in Table 10, on these components only 13% to 40% of the analytic ratings
assigned to the 36 failing essays were "passes," whereas such passes were given
in 58% to 81% of the analytic ratings given to the 36 barely passing essays. In
contrast, Henderson noted that, in general, the failing essays did not have
serious mechanical problems; 65% to 69% of the analytic ratings assigned to
these essays on spelling, punctuation, and usage were "passes." On the basis of
these findings, Henderson concluded, in accord with Ravan, that mechanics had
not been the major factor causing the failures among the essays in his sample.
Table 10
Percentage of "Passing" Analytic Ratings Assigned on 22
Components to Regents' Essays Holistically Graded as 1's and 2's
[Henderson, 1977]
_________________________________________________________________________
Holistic Ratings
Analytic _________________________________________
Components
1 2
_________________________________________________________________________
Category of
Organization:
1. Limiting the
Subject 59 87
2. Evidence of a
Thesis 71 94
3. Development of
Thesis: Unity 43 81
4. Development of Thesis:
Logical Development 30 69
5. Development of Thesis:
Coherence 29 69
6. Development of Thesis:
Evidence 31 66
Category of Rhetoric:
Diction:
7. Clarity 52 83
8. Economy 48 69
9. Precision 37 68
10. Consistency 66 89
Sentence Structure:
11. Clarity 46 86
12. Variety 54 78
13. Economy 13 70
14. Parallelism 56 85
Paragraph Structure:
15. Unity 52 84
16. Logical Development 24 58
17. Coherence 37 72
Point of View:
18. Appropriateness 73 91
19. Consistency 68 90
Category of Mechanics:
20. Spelling 65 93
21. Punctuation 69 92
22. Usage 57 91
______________________________________________________________________________
___
Note: Each component score reflects the mean number of passing grades (1 = pass)
assigned by 12 raters to 36 essays.
The third investigation to be noted is that of Veal and Rentz (1975), who
examined the relation between the Essay Test and an objective writing test. The
objective test that these researchers used was a part of the Regents' Test at
the time they collected their data. It was composed of multiple-choice items
that appraised grammar and usage skills. From the Spring, 1972, administration
of the Regents' Test, Veal and Rentz obtained the scores of 292 students who had
taken both the objective test and the Essay Test. To examine the relation be-
tween these two measures, Veal and Rentz first grouped the students into quar-
tiles on the basis of their objective test performance. They then calculated
the percent of students in each quartile who obtained passing scores on the
Essay Test. Since grammar and usage are two factors that Regents' Test raters
take into account when grading students' essays, it is reasonable to expect some
positive relation between performance on the objective test and the pass rates
that are attained on the Essay Test. However, this relation should not be very
strong because the description of the essay scoring procedures indicates that
factors such as organization and style also will influence the grades assigned
to Regents' Test essays.
Veal and Rentz's findings are presented in Table 11. In general, these
findings indicate that, as expected, there is a positive relation between
students' performance on the objective test and their scores on the Essay Test.
Veal and Rentz did not calculate a measure of the precise relation between these
measures, but it is evident from study of Table 11 that increases in students'
performance on the objective test were accompanied by increases in the percen-
tages of students who obtained passing Essay Test scores of 2, 3, or 4. Of
those students who performed in the lowest quartile on the objective test, for
example, only 38.7% attained passing status on the Essay Test. In contrast,
passing status was attained by 69.7% of the students performing in the second
quartile on the objective test, and higher pass rates were obtained by those
performing in the third and fourth quartiles on this test. Thus, Veal and
Rentz's findings show that students' performance on the Essay Test is related to
their performance on another test of their writing skills. This finding lends
some support to the claim that students' essay test scores are valid measures of
their writing skill.
Table 11
The Pass Rates Attained on the Regents' Essay Test by Students Performing
at Different Levels on an Objective Writing Test
[Veal & Rentz, 1974]
______________________________________________________________________________
____________
Performance on Objective Writing Test
______________________________________________________________________________
____________
1st Quartile 2nd Quartile 3rd Quartile 4th Quartile
(Mean = 40.55) (Mean = 49.55) (Mean = 55.18) (Mean = 63.75)
______________________________________________________________________________
____________
.387 .697 .806 .971
(n = 78) (n = 82) (n = 69) (n = 63)
______________________________________________________________________________
____________
Relations between Essay Test Scores and Selected Academic Variables
As noted previously, additional evidence of a test's validity can be pro-
vided by studies that show that the test is correlated in the manner expected
with other variables of interest. In the case of the Essay Test, positive
correlations between this test and certain academic variables are reasonable to
expect. As part of their studies of the Regents' Test, Hickman (1973) and
Prather and Smith (1975) examined the relations of students' Essay Test scores
to their high school and college grade point averages and to their scores on the
verbal and mathematical sections of the Scholastic Aptitude Test. The findings
from these two studies are given in Table 12.
Table 12
Findings from Two Studies on the Correlations
between the Regents' Essay Test and Selected Academic Variables
___________________________________________________________________________
Hickman Prather &
Academic Variables (1973)* Smith (1975)**
___________________________________________________________________________
Aptitude Measures
Scholastic Aptitude Test (Verbal) .37 .29
Scholastic Aptitude Test (Math) .21 .24
Grade Point Averages
Cumulative - High School .23 .11
Cumulative - College .26 .30
Freshman - College .21 .23
English Composition - College .20 .13
___________________________________________________________________________
*Correlations based on 660 to 892 students attending five different
institutions in the University System of Georgia.
**Correlations based on 1,910 students attending one university in the
University System of Georgia.
As is evident from the table, the correlations obtained in the two studies
were similar. Students' Essay Test scores were most strongly correlated with
their performance on the verbal section of the Scholastic Aptitude Test (SAT-V),
but this correlation was not high. Some relation between these two measures
should be expected. The SAT-V assesses individuals' verbal reasoning skills,
which undoubtedly influence the effectiveness of their writing, but it does not
directly assess those skills in organizing and expressing ideas that are also
germane to successful performance on the Essay Test. Also, the size of the
correlations between the SAT-V and the Essay Test reported by Hickman and by
Prather and Smith might have been higher if the four-point range of Essay Test
scores were not so small.
With respect to the low correlation between SAT-M and students' Essay Test
scores, restriction of range also may be influential here, but these
correlations would be expected to be low in any case. Although some portion of
the ability influencing individuals' performance on the SAT-M might also in-
fluence their writing skills, the SAT-M and the Essay Test largely assess
different skills.
Finally, the low correlations found between students' Essay Test scores and
their grade point averages (GPA's) should be considered. Since students' GPA's
can range over only a 5-point scale and the Essay Test scores are on a 4-point
scale, both variables involved in these correlations have restricted ranges that
limit the size of these correlations. Also, because many factors other than
writing skill affect students' grades in most courses that they take (Cronbach,
1971), the relation of their essay scores to their cumulative GPA's is unlikely
to be large. However, the low relation between the essay scores and students'
English composition grades is somewhat surprising. Presumably, these grades
primarily reflect students' writing skill. Hickman (1973) suggested that the
low relation was obtained because the essays graded in English composition
classes are unlike those entailed in the Essay Test in terms of both the time
allowed for writing and the audience to which the two types of essays are
addressed. Also, she noted that students' grades in composition are based on
several appraisals of the students' writing, whereas their Essay Test score is
based on only one. However, it is also possible that the low relation occurs
because the grading of students' essays is systematic, whereas the grades as-
signed in English composition courses are based on standards that vary across
courses and instructors and are influenced by many factors that are
uncontrolled.
The meaning of the correlations reported by Hickman and by Prather and
Smith is clarified by data collected by Citron (1980). As part of a larger
study, she examined the relations of students' SAT-V scores and their grades in
English to the pass rates they attained on the Essay Test. For her study,
Citron collected data on 1,971 students who attended the Georgia Institute of
Technology and took the Regents' Test between Summer, 1977, and Spring, 1978.
Citron's findings are presented in Tables 13 and 14. As shown in Table 13,
performance on the SAT-V actually bears a stronger relationship to the passing
rates on the Essay Test than the correlations reported above convey. Only 26%
of the students with SAT-V scores between 200 and 299 passed the Essay Test,
whereas 75% of those with SAT-V scores between 400 and 499 were given passing
scores on this test, and all but a few of the students at the top of the SAT-V
score range passed the Essay Test. Like Hickman (1973) and Prather and Smith
(1975), Citron had obtained a low correlation (r = .285) between the SAT-V and
the Essay Test, but this correlation clearly understates the fairly strong
relation between performance on these two measures shown by Table 13.
Table 13
Relation between Scores on the Verbal Section
of the Scholastic Aptitude Test (SAT-V) and
Passing Rates on the Regents' Essay Test
[Citron, 1980]
________________________________________________________________
SAT-V Percent Passing
Scores Regents' Essay Test
________________________________________________________________
200-299 26 (n = 23)
300-399 61 (n = 152)
400-499 75 (n = 634)
500-599 86 (n = 760)
600-699 89 (n = 336)
700-800 92 (n = 66)
________________________________________________________________
In Table 14, where the relation of students' Essay Test scores to their
English GPA's is displayed, a marked relation between these two measures also is
evident. Among those students with English GPA's in the 1.0 to 1.9 range, only
58% passed the Essay Test. In contrast, 78% of those having English GPA's be-
tween 2.0 and 2.9 and 90% of those with GPA's above 3.0 passed the test. Al-
though a correlation was not calculated for the data presented in Table 14, the
relation between Essay Test scores and English GPA is evidently stronger than
that conveyed by the low correlation coefficients calculated by Hickman and by
Prather and Smith.
Table 14
Relation between English Grade Point Averages
and Passing Rates on the Regents' Essay Test
[Citron, 1980]
__________________________________________________________________
English Grade Point Percentage Passing
Averages Regents' Essay Test
__________________________________________________________________
0 - .99 NA*
1.0 - 1.99 58 (n = 168)
2.0 - 2.99 78 (n = 804)
3.0 - 4.00 90 (n = 90)
__________________________________________________________________
*The pass rate is not given for students having English GPA's
less than 1.0 because it would not be meaningful. Citron
assigned GPA's of 0 to the 219 students in her sample who
had not taken English and grouped these students with the 21
students who had actually obtained English GPA's of less than
1.0.
Relation Between Essay Test Scores and Irrelevant Variables
As was suggested a propose the Reading Test, support for a claim of test
validity is also gained by findings that a test does not relate to variables
that are thought to be unrelated to the quality assessed by the test (see
Campbell, 1964). With respect to the Essay Test, two variables that should be
unrelated to individuals' scores on this test are those of handwriting and
neatness since essay raters are not directed to consider these two variables
when grading Regents' essays. Therefore, by finding that the holistic scores
assigned to essays are unaffected by the handwriting and neatness evident in
the essays, support is gained for the validity of the claim that the holistic
scores are not influenced by such irrelevant variables. The impact on these
scores of two other presumably irrelevant variables, speededness and bias due
to the influence of ethnic background, was also examined and is discussed in
the pages that follow.
HANDWRITING AND NEATNESS. Gwinn & Renfrow (1980) investigated the effects of
handwriting on the holistic scores assigned to Regents' essays and found no
evidence that handwriting affected these scores. In their study, these re-
searchers used three essays that were on the borderline between passing and
failing. Each essay was copied three times, once in a very clear, neat hand-
writing, once in an average handwriting, and once in a legible but very poor
handwriting. Eighteen graders then rated the three essays, with each grader
rating one essay from each of the three levels of handwriting. The mean
ratings for essays written in good, average, and poor handwriting were 1.33
(s=.49), 1.44 (s=.51), and 1.56 (s=.62), respectively. An analysis of variance
showed that there were no significant differences in the mean essay ratings
assigned to the essays that differed in the quality of their handwriting
[F(1,17) = 1.21; p > .05].
In 1981, Renfrow conducted a second study examining the effects of hand-
writing on essay ratings. In this case, Renfrow analyzed a systematic sample
of 154 essays written by students at Georgia State University during the Fall,
1979, administration of the Regents' Test. As part of a larger study, she
rated each of the 150 essays on the composite variable of handwriting and neat-
ness. A nine-point scale was used for her ratings. The handwriting ratings
were then correlated with the sums of the three holistic ratings assigned to
the essays by Regents' Test raters in a regular scoring session. A very small
negative correlation (r = -.11) was found between handwriting and the essay
ratings. Thus, the conclusions of the second study were similar to those of
the first: student handwriting does not seem to be a salient characteristic
affecting the rating of essays written for the Regents' Testing Program;
students who write good essays in poor handwriting do not seem to be penalized
for their handwriting, and students who write poor essays do not seem to be
rewarded for good handwriting.
Speededness. As in the case of the Reading Test, the Essay Test is intended to
be a power test, not a speed test. That is, students' Essay Test scores are
expected to primarily reflect the quality of their writing skills rather than
the rate at which they can compose and write an essay.
Because students who are administered the Essay Test select and write on
just one topic, in effect the test is a one-item test. Therefore, some of the
less complex methods of appraising test speededness can not be applied to the
Essay Test because these methods pertain only to tests that have multiple items
(see Donlon, 1978; Rindler, 1979). The ideal, albeit most complex, way to ap-
praise speededness would be to compare the scores that students obtain on
parallel test forms that they have taken in both timed and untimed administra-
tions (see Cronbach & Warrington, 1951). This kind of experimental study, while
highly desirable, has not been conducted for the current 60-minute Essay Test
both because of its administrative complexity and because of the difficulty
entailed in creating test conditions like those students actually encounter when
taking the Essay Test.
Some studies of the Essay Test were carried out when the time limit on this
test was less than 60 minutes. These studies do not unequivocally demonstrate
that the 60-minute limit introduces no speededness, but their findings do
suggest that students' scores on the 60-minute Essay Test are less likely to be
influenced by the factor of speededness than they would be if the test was
shorter.
One of these studies was carried out by the Regents' Testing Program in
1974 to investigate, in part, the effect on students' performance of an increase
in the time limits on the Essay Test from 30 minutes to 40 minutes. In this
study, one to three classes in English composition at each of five institutions
in the University System were randomly assigned to 30-minute and 45-minute test
administrations. In all, 338 students were allowed 30 minutes and 291 students
were allowed 45 minutes to take the Essay Test. The essays that were written
were then graded at the regular Winter Quarter grading session. Subsequently,
the essay scores were analyzed, and it was found that 216 (74%) of the students
working under the 45-minute time limit had attained passing scores, whereas only
220 (65%) of the students working under the 30-minute time limit had passed the
test. Thus, the 15-minute increase in the time limit produced a higher pass
rate. This finding suggests that speededness may have been a factor that
depressed performance on the 30-minute test, and that an extension of the time
limit to 45 minutes would diminish the effects of this factor and improve test
performance.
In light of these findings, the time limits on the Essay Test were subse-
quently extended to 45 minutes, and Henderson (1977) later concluded that these
limits should be extended further to one hour. As previously noted, Henderson
examined 22 analytic, component ratings assigned to Regents' essays that had
been given holistic scores of 1 (Fail) and 2 (Barely Passing). In the
discussion of his findings, Henderson noted that both the failing and the barely
passing essays had been given relatively low analytic scores on the components
pertaining to thesis and paragraph development and to economy and precision of
diction. In Henderson's view, these low scores could have been a product of the
45-minute time limit on the Essay Test. He contended that this time limit
prevented students both from developing their theses fully and from revising
their writing to make its wording more economical and precise. Henderson re-
commended that the time limit on the test be extended from 45 minutes to one
hour. This was done in 1978.
The impact of the 60-minute time limit on students' test performance has
not been formally examined, but there is no evidence from informal studies of
students' essays that they have difficulty completing the test: the final para-
graphs of most students' essays are found to be complete, and students usually
make revisions in their essays, which suggests that they have time to review and
edit their work. Also, as Willig (1980) determined from his survey of
composition teachers in Georgia, most teachers of writing (71%) think that the
1-hour time limit is appropriate. Thus, there appears to be little evidence
that the 60-minute limit is hindering students' test performance. However, a
formal study of the performance effects of the 60-minute limit would be carried
out if problems produced by this limit became evident.
BIAS DUE TO THE INFLUENCE OF ETHNIC BACKGROUND. In the case of the Essay
Test,
bias can be viewed in the same way as that suggested a propose the Reading Test -
- that is, as a matter related to the validity of the test. More specifically,
bias can be thought of as pertinent to the finding that the scores on a test do
not have the same meaning for all groups of individuals who are tested. This
finding is undesirable because individuals' scores on a test should reflect the
skill or the ability one wishes to assess and not irrelevant variables such as
group membership.
Some of the types of internal and external analyses described as appro-
priate for detecting bias in the Reading Test are also applicable in the
investigation of bias in the Essay Test. As noted previously, internal analyses
include considerations of the content of the test, the difficulty of the test,
its internal consistency, and the like, in the interest of determining whether
the test behaves differently for different groups that are assessed. External
analyses include studies of the relations between the test and other variables
in order to examine whether these relations are the same for different groups
(see Jensen (1980) for a detailed discussion of these two types of analyses).
Internal and external analyses that have been conducted on the Essay Test
bear on the question of whether scores on this test have the same meaning for
groups of black and white students that are assessed. The findings from these
studies are reported in the pages that follow. The results of the internal
analyses are treated prior to those from the external analyses.
Studies of bias based on internal analyses
ANALYSES OF TEST CONTENT. As the validity of the Essay Test, like that of
the Reading Test, is primarily determined by the validity of its content,
studies of this content comprise one important means of detecting factors that
could distort the meaning of the test for members of different groups (Shepard,
1981). Two types of studies of test content should be carried out. One of
these entails an examination of the content validity of the test, that is, a
study of (1) the clarity with which the domains of skills to be assessed by the
test are described, and (2) the degree to which the items written for the test
represent these domains (APA et. al., 1974). The second type of study entails
examining the items of a test to detect any occurrences in these items of
ambiguous wording, stereotypic images, or unfamiliar language that might alter
the meaning of the test for the members of a particular group.
The procedures used to appraise the content of the Essay Test have been
previously described, but can be briefly summarized here. The Testing
Subcommittee selected model essays to define the manner in which the Regents'
Test essays should be appraised and provided both analyses of these essays and a
detailed description of the manner in which the essays should be scored. The
Testing Subcommittee also defined the content of the Essay Test, indicating that
essay topics used on the Regents' Test should be narrow enough to elicit an
essay in 60 minutes, but broad enough to bear on students' common knowledge and
experiences rather than on specialized knowledge that only some students might
have (Thompson & Rentz, 1973). Also, the topics were not to (1) contain
difficult vocabulary, (2) appear to have a rural-urban or ethnic bias,
(3) closely resemble topics previously used, (4) involve highly controversial or
emotional subjects, or (5) seem to encourage students to identify their
institutions in their essays.
Potential essay topics are solicited from students, faculty, and
administrators in the University System so that a pool of diverse topics becomes
available for use on the Essay Test. These topics are reviewed by the Testing
Subcommittee, and those that do not fit its specifications for the Test are
either revised or eliminated. The topics that are regarded as acceptable are
subsequently submitted to the full Academic Committee on English for further
consideration and final approval. This Committee, composed of representatives
from all 33 institutions in the System, considers the wording and difficulty of
the topics and suggests revisions and deletions in the list of topics where
necessary. Since four members of the committee come from the historically black
institutions in the University System, the committee's review of essay topics is
expected to detect any topics having content that might not be equally familiar
to black students and white.
ANALYSES OF TEST RESPONSES. Analyses of individuals' responses to a test
constitute a second component of an internal analysis designed to detect the
presence of bias. These analyses are comparative in nature, conducted to
discern whether there are differences in the test responses of different groups
that might indicate the presence of bias.
To detect biasing features of test content, it is usual to conduct a study
of the relative difficulties of a test's items for members of different groups
(Shepard, 1981). For the Essay Test, this type of study has not yet been done
because of its administrative complexity: to conduct such a study, numerous
Essay Tests would have to be administered either to the same samples of black
and white students or to many samples of black students and white students that
are equivalent in ability. Such a study, however complex, is desirable to carry
out and shall be conducted by the Regents' Testing Program office if its
inherent problems can be worked out.
Because the essays written for the Essay Test are not scored mechanically,
the raters who grade these essays constitute a second possible source of bias
(see Guion, 1978). Raters' appraisals of an essay may be influenced by vari-
able that are unrelated to writing quality so that, as a consequence, these
variables inappropriately affect the ratings given to members of a particular
ethnic group.
This matter was investigated by the Regents' Testing Program office in
1974. Of interest in the study was the question of whether raters who differed
in race assigned grades to the essays written by black students that were dif-
ferent from those that were assigned to the essays written by white students.
The raters of Regents' essays are given no information concerning the race of
the essay writers. Therefore, if students' race is found to affect the essay
scores that raters of different race assign, this bias effect is probably due to
subtle features of students' essays and would not be recognized unless such a
study were done.
The study involved 3,218 essays that were graded at the regular, quarterly
scoring sessions held at three Regents' Essay Scoring Centers. It was not
possible to determine the race of the students or the raters. Therefore, these
raters and students were classified in terms of their affiliation with a
predominantly black or a predominantly non-black school. The effects of
students' institutional type, raters' institutional type, and the interaction of
these factors on students' essay scores was examined by calculating the mean
scores assigned to the essays that had been classified by students' and raters'
institutional type. These calculations are presented in Table 15 and depicted
in Figure 1.
Table 15
Mean Ratings* Assigned by Essay Raters from Predominantly Black
and Non-Black Institutions to Essays Written by Students from Predominantly
Black and Non-Black Institutions
Rater's Institutions
Essay Writer's __________________________________________________
Predominantly Predominantly
Institutions Non-black Black
______________________________________________________________________________
__
Predominantly Non-black 1.89 1.96
(.72) (.72)
n = 5,777 n = 3,094
Predominantly Black 1.38 1.44
(.58) (.57)
n = 495 n = 288
______________________________________________________________________________
___
*Standard deviations given within parentheses
As is somewhat evident in Table 15 and very clear in Figure 1, raters from the predominantly black institutions assigned both groups of students slightly higher mean essay scores than did raters from predominantly non-black institutions. However, the raters from both types of institutions were similar in that both gave higher mean essay scores to the students from predominantly non-black institutions than they did to those students from the predominantly black schools. Thus, raters from both types of institutions rated students from predominantly black and non-black institutions in the same way and did not appear to be differently influenced by type of institution from which an essay writer came. Therefore, there was no evidence from this study that raters showed bias in the grades they assigned to the essays written by students from predominantly black and non-black institutions. Studies of bias based on external analyses Hickman (1973) examined the relations between the Essay Test and selected academic variables as part of a larger study, and her findings permit some rough determination of whether these relations differ for black and non-black groups of students. In her study, Hickman calculated these relations for each of five different institutions. This group of five institutions included one four-year college that was predominantly black, and two junior colleges, one four year college, and one University that were primarily non-black. For each of these five institutions, Hickman calculated the correlations between students' Essay Test Scores and both their scores on the Scholastic Aptitude Test and their grade-point averages. Thus, through a comparison of the correlations for the predominantly black college with those found at the four predominantly non-black institutions, some tentative conclusions can be drawn about the similarities or differences in the meanings of black students' and non-black students' essay scores. Of course, such conclusions must be regarded as rough because the institutions in Hickman's study were not homogeneous with respect to race and, therefore, are not adequate indicators of students' race. Also, the generalizability of these findings is questionable because Hickman used in her study only one predominantly black college, which may or may not be representa- tive of other black institutions or black students in the University System. The results of Hickman's correlational analyses are presented in Table 16. As is evident from the table, the correlations reported for the predominantly black college are, in general, like those reported for the second junior college she examined, and they are higher than the correlations reported for the four- year college, the university, and the other junior college that were examined. For example, the correlations between students' Essay Test scores and their SAT(V) scores are substantially higher at the predominantly black college (r = .46) and at the second junior college (r = .45) than are these correlations when calculated for the other three institutions that Hickman studied (r's = .29 to .32). The correlations between the Essay Test and students' grade point averages conform to a similar pattern, with the correlations for the pre- dominantly black college and the second junior college exceeding those reported for the other three institutions examined. Table 16 Correlations between the Regents' Essay Test and Selected Academic Variables Within Five Different Institutions [Hickman, 1973] ______________________________________________________________________________ __________ Institutional Types ________________________________________________________________ Predominantly Non-Black Predominantly Black Academic ________________________________________________________________ Variables Jr. Jr. 4-Year University 4-year College College College College ______________________________________________________________________________ __________ Test Scores SAT(V)* .29 .45 .30 .32 .46 SAT(M)** .07 .30 .10 .17 .20 Grade Point Averages Cumulative-High School .20 .31 .13 .18 .35 Cumulative-College .25 .41 .25 .26 .41 Freshman-College .22 .33 .19 .20 .40 Eng. Composition-College .21 .39 .03 .30 .44 ______________________________________________________________________________ __________ Note: Correlations based on sample sizes ranging from 88 to 302. *SAT(V) refers to the verbal section of the Scholastic Aptitude Test. **SAT(M) refers to the mathematical section of the Scholastic Aptitude test. It is not possible to ascertain from Hickman's data the reasons why the correlations for the predominantly black college should exceed those reported for three of the four other institutions that she studied. Also, it is not possible to ascertain whether similar findings would be obtained if other in- stitutions in the System were compared. It can be said, however, that the correlations that Hickman reported indicate that, for the institutions studied, the Essay scores obtained by students from the predominantly black institutions generally bear a slightly stronger relation to selected academic variables than do the scores obtained by students from the predominantly non-black institu- tions. Faculty Perceptions of the Regents' Essay Test The perception of a test by those who use it can shed light on validity. If, for example, a test were perceived to have inappropriate content or to measure irrelevant variables, there is basis for questioning whether the measure will provide the type of information that is desired. In the case of the Regents' Essay Test, it is useful to examine how it is regarded by faculty members in the University System of Georgia, since these faculty teach the students who take the test and may often be involved in their remediation. If it were found that this faculty believes that the test has improper content or unsound scoring procedures or that the test is detrimental in its effects, the validity of the test would have to be questioned. In the interest of determining faculty members' perceptions of the Regents' Test, Willig (1980) distributed a questionnaire to composition teachers in the University System of Georgia. Many of the questions posed pertained to the Essay Test, since a few critics of the test had publicly questioned certain features of this test (see House, 1980; Watters, 1979). Willig sent a questionnaire to each of the 498 full-time faculty members who taught composition in the System. Approximately 60% of these questionnaires were completed and returned. Of these respondents, approximately 15% had taught composition fewer than five years, and 61% had taught it for more than ten years. A summary of the responses to the questionnaire is presented in Table 17. In general, the composition teachers indicated support for the Regents' Test and lack of agreement with the critics of the Test. Over 75% of the respondents indicated that they were in favor of the Test and indicated that it is "a meaningful check of minimal reading and writing skills." Most of the respon- dents indicated that the one-hour time limit is sufficient (71%), that the writing of one essay is adequate (85%), and that the present anonymous system of grading is preferable to grading done on individual campuses (82%). Further- more, most of the respondents indicated that the Essay Test neither overempha- sizes nor underemphasizes grammar (74%) and that the Test does not discriminate against black students (75%). Table 17 Faculty Responses* to Questions about the Regents' Test [Willig, 1980] ______________________________________________________________________________ ___________ What is your overall opinion of the Regents' Test? I am in favor of it. (75%) I am opposed to it. (18%) I am neutral. ( 7%) How does the existence of the Regents' Test affect the teaching of composition at your institution? It tends to improve the overall teaching of composition. (57%) It tends to harm the overall teaching of composition. (17%) It has no noticeable positive or negative effect. (26%) The test is a meaningful check of minimal reading and writing ability. (76%) The test is not a meaningful check of minimal reading and writing ability.(**) I have no opinion.(**) The test discriminates against black students. (13%) The test does not discriminate against black students. (75%) I have no opinion. (12%) Students should have more than one hour to write the required essay. (24%) One hour is sufficient writing time for the limited demands of this essay. (71%) I have no opinion. (5%) The writing section of the Regents' Test emphasizes grammar too much. (5%) The writing section of the Regents' Test emphasizes grammar too little. (10%) The emphasis on grammar is appropriate. (74%) I have no opinion. (11%) The writing section of the Regents' Test should require more than one essay. (8%) The writing section of the Regents' Test is adequate with one essay. (85%) I have no opinion. (7%) The Regents' Test should be graded by professors on the campus where it is administered. (13%) The anonymous "mass-grading" of the Regents' Test as presently done is adequate. (82%) I have no opinion. (5%) Students should be allowed to use a dictionary during the writing section. (67%) Students should not be allowed to use a dictionary during the writing section. (25%) I have no opinion. (8%) ______________________________________________________________________________ ____________ * Responses reported in terms of percent of respondents choosing each option. ** Data not available. The one criticism of the Test that received support from the majority of respondents concerned the rule against students' use of a dictionary for the Essay Test. (Use of a dictionary has never been permitted because of the pro- bless with test administration and test security that providing dictionaries would cause. Raters of the Essay Test are aware that students do not have access to dictionaries and are supposed to take this into account while rating. There is no evidence that spelling is a major cause of failing scores [Henderson, 1977; Ravan, 1973].) Chapter III ADDITIONAL TECHNICAL INFORMATION
Reliability of Reading Test Scores
Reliability is concerned not with what a test measures but with how con-
sistently it measures whatever is measured. Unreliability in test scores
results from variation due to chance factors such as guessing, the health of
an examinee on a particular day, and the sampling of test items. There are
different methods for examining reliability that take into account different
types of random error and provide different types of information about
consistency. Most traditional estimates of reliability such as alternate
forms correlations and internal consistency coefficients are based on
correlational methods. Thus, these estimates provide information on the
consistency with which examinees are rank-ordered and are highly dependent on
the variability of test scores. Because discriminating among examinees is not
the major purpose of the Regents' Test and no attempt is made to maximize
variability, the internal consistency and alternate forms reliability
estimates for this test must be interpreted with caution.
Because the Regents' Reading Test is used to determine whether students
score above a predetermined criterion rather than to compare students with
each other, the most important estimate of reliability is an estimate of the
consistency with which pass-fail decisions about students are made. In order
to obtain such an estimate, a representative group of examinees should be
given two different forms of the Reading Test under the same conditions and
with no instruction between the two test administrations. The similarity of
pass-fail decisions yielded by the two administrations could then be examined.
Unfortunately, such a study has not been conducted on the Regents' Reading
Test because of practical problems. The major problem is the difficulty of
administering two forms under the same conditions. For example, if one form
were to be used for practice and another for the official test administration,
the two administrations would differ on conditions such as student motivation
and anxiety. An alternative is to conduct two official test administrations
and use the student's highest score as the final score. However, implementing
this study would cause problems. It would not be fair to use a sample of
students or schools, as this would give some students a greater opportunity to
pass the test than others. On the other hand, the study could not be
implemented at all institutions because some would find it administratively
impossible to give the Reading Test twice to all students.
Because it was not feasible to administer two forms of the test under the
same conditions to a representative sample, other data that had been collected
were used to estimate the results of such a study. This analysis, as well as
alternate forms and internal consistency reliability of the Reading Test, is
discussed in the remainder of this section.
Consistency of Decisions
The consistency of pass-fail decisions was examined through a comparison
of the results from two quarters for a sample of examinees who had initially
failed the Regents' Test. This comparison is less than ideal for two reasons:
1) the sample of examinees is not representative because it consists only of
students who initially failed one or both parts of the Regents' Test, and 2)
some students in the sample had remedial work between the two administrations
which, if effective, should cause inconsistency in pass-fail classifications.
Despite these problems, the data provide useful information about the
consistency of decisions.
The sample consisted of the 2,613 students who took Form 15 of the
Reading Test in Winter, 1979 and repeated the Regents' Test with Form 16 of
the Reading Test in Spring, 1979. Table 18 shows the classifications of
students on the two administrations of the test.
Table 18
Classification of Repeaters on Two
Administrations of the Reading Test
Classification
Test 1
FAIL PASS
|-------------------|
| | |
| a | b |
| | |
FAIL | 230 | 99 |
| 8.8% | 3.8% |
Classification | | |
|---------|---------|
Test 2 | c | d |
| | |
PASS | 360 | 1924 |
| 13.8% | 73.6% |
| | |
---------------------
Consistency is indicated by the proportion falling in cells a and d.
This proportion, which is called the coefficient of agreement, is .824. This
value is misleading because some of the inconsistency reflected in cell c is
the result of remediation rather than error. Because some of the students who
failed took remediation before repeating the test, the lower failure rate on
the second administration is to be expected. Therefore, only cell b provides
an unambiguous indication of inconsistency due to unreliability. If it is
assumed that cell b is in fact the best estimate of unreliability of
classification, and if that value is used as an estimate for cell c, then
total unreliability is approximately 8%. In other words, classification
consistency is estimated at 92%.
While the estimated consistency of classification of 92% is quite high,
this value would be higher if a representative sample of students had been
included in the study. Because the sample consisted of repeaters only, the
mean of the sample was lower than that of the total group, and more students
had scores near the cutoff score. Inconsistency of classification is more
likely for scores bordering on the cutoff score than for those well above the
cutoff; therefore, the inconsistency in the sample should be an overestimate
of inconsistency in the total group.
Correlation Between Alternate Forms
The data from the study described above were also used to estimate the
alternate forms reliability coefficient for the Regents' Reading Test. The
correlation between Form 15 scores and Form 16 scores for the sample of
repeaters was .70. This coefficient is an underestimate of the alternate
forms reliability because the variability of the sample was smaller than the
variability usually found in the total group and because some of the
inconsistency is the result of remedial work taken by some students between
the two administrations. Also, as noted above, this coefficient must be
interpreted in light of the purpose of the Regents' Test. The coefficient
indicates the extent to which the examinees' relative positions were
maintained from the first test to the second test. The Regents' Test is not
developed to maximize this type of consistency because discriminating among
students is not the major purpose of the test.
Internal Consistency
The KR-20 reliability estimates for the two most recently used forms of
the Reading test are presented in the following table.
Table 19
KR-20 Reliability
Estimates for Form 17 and Form 20
__________________________________________________________
Quarter Form No. of Items N KR-20
_____________________________________________________________
Spring 1981 17 69 7757 .865
Spring 1982 17 69 7169 .869
Summer 1982 20 58 3180 .864
_____________________________________________________________
Although these KR-20 reliability coefficients provide information about
the reliability of discriminating among examinees and may be underestimates of
the reliability of the Regents' Reading Test, the values, which are quite
high, provide additional evidence of Reading Test reliability.
Additional KR-20 reliability estimates were calculated for a sample of
116 examinees taking both Form 20 of the Regents' Reading test and Form 1A of
the STEP reading test in Summer Quarter, 1982. For this sample, the KR-20
reliability of the Regents' Reading Test was .88, and the KR-20 reliability of
the STEP was .89. This comparison provides further evidence that the
reliability of the Reading Test is quite high: its reliability is comparable
to that of the STEP, a test that has been designed to provide maximum
discrimination among students in the entire range of scores.
Analysis of Reading Test Items
Item analysis results for the Spring Quarter, 1984, administration of
Form 23 of the Regents' Reading test are provided in Table 20. Presented for
each item is the following information: item classification, the percentage of
students choosing each of the four options, the p-value, the point-biserial
correlation, the biserial correlation, and the Rasch difficulty.
TABLE 20
Item Analysis Data for the Spring, 1984
Administration of Form 23
________________________________________________________________
ITEM ITEM % CHOOSING EACH OPTION P- PB BIS
NUMBER CLASS 1 2 3 4 VALUE CORR CORR
________________________________________________________________
1 Literal 2 93* 3 2 93 .19 .35
2 Literal 0 2 3 94* 94 .33 .66
3 Inference 4 1 93* 2 93 .33 .62
4 Analysis 88* 6 5 1 88 .16 .27
5 Literal 2 78* 5 14 78 .42 .59
6 Inference 2 7 87* 4 87 .35 .55
7 Vocabulary 4 10 1 85* 85 .36 .56
8 Vocabulary 23 7 67* 3 67 .36 .47
9 Literal 86* 4 8 2 86 .24 .38
10 Inference 74* 17 2 6 74 .39 .53
11 Inference 80* 2 2 16 80 .23 .33
12 Vocabulary 6 75* 13 6 75 .32 .44
13 Inference 8 55* 2 35 55 .41 .52
14 Analysis 7 6 2 85* 85 .37 .57
15 Analysis 14 4 16 65* 65 .27 .35
16 Vocabulary 4 8 22 66* 66 .45 .58
17 Vocabulary 30 56* 4 10 56 .42 .53
18 Literal 75* 5 4 17 75 .39 .54
19 Inference 11 10 73* 5 73 .32 .44
20 Inference 3 76* 10 11 76 .43 .58
21 Analysis 6 5 77* 11 77 .37 .52
22 Analysis 4 6 2 88* 88 .31 .51
23 Inference 48* 13 30 9 48 .41 .51
24 Inference 82* 1 1 16 82 .37 .55
25 Literal 1 2 96* 1 96 .22 .49
26 Inference 8 4 6 82* 82 .39 .58
27 Analysis 79* 8 5 8 79 .41 .59
28 Vocabulary 4 74* 11 11 74 .43 .58
29 Inference 11 5 6 78* 78 .39 .54
30 Inference 4 3 88* 5 88 .36 .59
31 Inference 88* 4 4 3 88 .35 .57
32 Literal 3 3 81* 13 81 .49 .70
33 Analysis 30 44* 24 1 44 .42 .52
34 Vocabulary 2 88* 6 4 88 .45 .72
35 Literal 6 85* 3 5 85 .44 .67
36 Inference 76* 8 13 2 76 .30 .41
37 Vocabulary 91* 2 3 4 91 .36 .63
38 Literal 19 6 2 72* 72 .43 .58
39 Inference 2 91* 3 3 91 .31 .54
40 Literal 2 72* 16 9 72 .42 .55
41 Literal 1 1 0 98* 98 .24 .64
42 Analysis 8 2 81* 9 81 .39 .56
43 Inference 81* 12 5 2 81 .28 .40
44 Inference 1 5 81* 13 81 .39 .56
45 Literal 12 74* 1 13 74 .27 .37
46 Vocabulary 1 96* 1 2 96 .22 .48
47 Analysis 12 17 4 67* 67 .31 .40
48 Analysis 50* 8 29 12 50 .38 .48
49 Vocabulary 3 1 94* 1 94 .28 .55
50 Inference 1 6 80* 12 80 .45 .64
51 Vocabulary 5 86* 2 6 86 .43 .68
52 Literal 28 61* 4 5 61 .31 .39
53 Inference 3 10 5 80* 81 .34 .48
54 Vocabulary 72* 20 4 3 72 .56 .74
55 Inference 6 75* 7 9 75 .51 .70
56 Vocabulary 82* 6 8 2 82 .45 .66
57 Analysis 76* 8 3 9 76 .56 .77
58 Inference 9 4 26 58* 58 .56 .71
59 Analysis 5 4 4 84* 84 .44 .66
60 Inference 19 26 18 32* 32 .37 .49
The "ITEM CLASS" column indicates the skill category classification of
each item. The four classifications included on the Regent's Reading Test are
Vocabulary (VOC), Literal Comprehension (LIT), Inferential Comprehension
(INF), and Analysis (ANA). These skills were briefly described in Table 1.
The "% CHOOSING EACH OPTION" columns indicate the percentage of students
that chose each distractor and the percentage that chose the correct answer.
The correct answer for each item is indicated with an asterisk (*).
The p-value is the percentage of students getting the item correct. It
is identical to the percentage choosing the option marked with an asterisk.
The point-biserial correlation (PB CORR) is the Pearson product moment
correlation between item score (correct-incorrect) and total test score. It
is an index of item discrimination and indicates the extent to which those who
did well on the total test tended to get the item right more often than those
who did less well on the test. The maximum value of point-biserial
correlations is always less than 1.0, and the maximum value for very hard or
very easy items is less than the maximum value for items of middle difficulty.
Thus, the values of the point-biserial correlations are dependent on item
difficulty and should be interpreted in light of these difficulties.
The biserial correlation (BIS CORR), which is always higher than the
point-biserial correlation, is an estimate of the Pearson product moment corre-
lation between the total test score and a hypothetical continuum of
performance on an item. While this estimate is based on some assumptions that
are not always tenable, the biserial correlation is useful in that it provides
an index of discrimination that is independent of item difficulty.
The last column of the table presents the Rasch difficulty values (RASCH
DIFF) for each item. The Rasch difficulty values are transformations of the
p-values. These difficulties are described in more detail in the section of
this chapter concerned with equating. Of interest here is the fact that the
difficulties can be related to the cutoff score on the test. A scale score of
61, which is the minimum passing score, corresponds to a Rasch difficulty
value of 1.1. More than 50% of students with scores at or above the cutoff
correctly answer items with difficulties below 1.1; less than 50% of the
students at the cutoff correctly answer items with difficulties greater than
1.1.
The results of the item analysis are not used as the sole basis for selec-
tion of items for forms of the Regents' Reading Test, as item content is con-
sidered more important than performance data. However, the item analysis data
are routinely used in the revision of items. For example, a popular
distractor for an extremely hard item may be revised to make the item easier;
an item with a low discrimination index is carefully examined for any evidence
that the item is ambiguous and needs revision.
When an item is used for the first time on a form of the Reading Test,
the results of the item analysis are used to determine whether the item should
be included in the scoring of the test. Occasionally, the data indicate
problems that were not foreseen in the test development process. When
problems with an item are found, the item is not included in the scoring or
equating of the test. Two such items, items 12 and 57, were deleted from the
scoring of Form 20 of the Regents' Reading Test. Thus, Form 20 consisted of
58 rather than 60 items that were scored and used in the equating.
Equating of Reading Test Forms
The score that is used to describe a student's performance on the Reading
Test is a translation of the student's total raw (number right) score to a
standard score scale that is common to all forms of the test. Scaled scores
rather than raw scores are used so that a student's score is independent of
the particular form of the test taken. It is not possible to develop alter-
nate forms of a test that are equivalent in difficulty; some forms are
slightly easier or more difficult than others. Therefore, it is important
that these differences in difficulty be taken into account in the reporting of
scores. The scaled scores take these differences into account: a scaled score
of 61 indicates the same level of skill regardless of the relative difficulty
of the particular form taken. For one form of the test, a scaled score of 61
may represent 70% of the items answered correctly; for a slightly more diff.-
cult form, a scaled score of 61 may represent 68% of the items answered
correctly. Because of the use of scaled scores, a student is not penalized
for taking a more difficult form or rewarded for taking an easier form.
Furthermore, the use of the scaled scores allows performance comparisons to be
made from one quarter or year to another; although different forms are used,
equal scaled scores indicate the same level of performance from one
administration to another.
Before students' reading scores are reported, each form of the Reading
Test must be equated to the common score scale. This equating is accomplished
through the use of a bank of items whose difficulties have been calibrated
with Rasch procedures. The Rasch model is a latent trait model that expresses
the probability of an examinee's correctly answering an item as a function of
two parameters - - the item difficulty and the examinee's ability (or
achievement, in the case of the Regents' Test). Item difficulty and examinee
ability are calibrated on the same logistic scale. In fact, item difficulty
is defined as the point on the ability scale at which an examinee has a .50
probability of getting the item correct.
An advantage of using the Rasch model for equating tests is that, if the
data fit the model, the estimates of item difficulty are independent of the
ability of the particular group of examines, and the estimates of examinee
ability are independent of the particular set of items on the test. When
items drawn from a Rasch-calibrated item bank are used to construct a test,
the test should automatically be equated to other tests constructed from the
item bank (Wright, 1977). When difficulties are calibrated, the ability of
the sample of examines is taken into account so that, unlike the traditional
p-value, the difficulty estimate should be the same regardless of whether the
particular sample is of relatively high or low ability.
Whenever an item is used on a form of the Regents' Reading Test, a Rasch
difficulty for the item is estimated. If the item appears to be functioning
appropriately, it is included in an item bank. When a new form of the test is
developed, items from the item bank, as well as new and revised items, are
included on the form. The items from the bank, which are to be used in the
equating, are chosen on the basis of fit to the Rasch model as well as on the
basis of content. After the new form is administered, Rasch difficulties are
calculated for all items. Fit to the model is re-examined for those items
with bank difficulties, and stability of these difficulties is checked. An
item is not used in the equating if a problem is found. Through a comparison
of the bank difficulties and the difficulties calculated from the
administration of the new form, an equating constant is derived. This
constant is used in conjunction with further Rasch analyses to convert
number-right scores to scores on the common scale.
The tentative conversion table derived from the process described above
is then verified through other procedures. Because accuracy of equating at
the cutoff score is crucial, the raw score to scale score conversion of scores
near the cutoff in the tentative conversion table is examined with equipercen-
tile and conditional p-value procedures.
Items that are common to the new form and one of the existing forms serve
as a basis for equipercentile equating. The new form and the existing form
are equated to scores on the common items; scores on the two forms correspon-
ding to the same scores on these items are presumed to be equivalent (Angoff,
1971). While the number of common items is usually not sufficient for equi-
percentile equating of scores throughout the score scale, the results are
useful as a means of verifying the Rasch equating. The scores examined are
those in a narrow range around the cutoff, and these scores have sufficiently
high frequencies to allow interpretation of the results of equipercentile
equating.
An additional procedure used to verify the equating is an examination of
conditional p-values for items common to the new form and an existing form
(Burk, 1980). This procedure is based on the assumption that examines with
the same scaled score should have the same level of achievement on common
items regardless of the test form used to obtain the scores. The adequacy of
equating is verified through a comparison of the proportion of examines with
equivalent scores from the two forms who correctly answer items common to both
forms. The equating is considered accurate if the two forms yield similar
p-values for examines with equivalent scores near the cutoff.
For Form 20, the equipercentile and conditional p-value procedures veri-
fied the equating near the cutoff produced by the Rasch model. When Form 16
was equated, the verification procedures and other analyses indicated that a
small adjustment was needed. Further examination of the Rasch equating indi-
caged some instability in the difficulties of a few items. When this insta-
bility was taken into account, the raw score corresponding to the cutoff was
lowered by one point. The equating was then consistent with the results of
the equipercentile and conditional p-value methods.
Before scores are reported, the logistic scale scores obtained from the
Rasch model and verified through the equipercentile and conditional p-value
procedures are linearly transformed to a scale that ranges from 0 to 99.
Specifically, each logistic score is multiplied by 10 and increased by 50
points. This transformation has no effect on the equating; its only purpose
is to place the scores on a more convenient scale.
The final conversion table for Form 23 is presented in Table 21. The
cutoff score, a scaled score of 61, is equivalent to a raw score of 43 on this
form.
Table 21
Raw Score* to Scaled Score Conversion Table
for Form 23 of the Regents' Reading Test
______________________________________________________________________________
_
Raw Score Scaled Score Raw Score Scaled Score
______________________________________________________________________________
_
1 5 31 51
2 12 32 52
3 17 33 53
4 20 34 53
5 23 35 54
6 25 36 55
7 27 37 56
8 28 38 57
9 30 39 58
10 31 40 58
11 33 41 59
12 34 42 60
13 35 43 61**
14 36 44 62
15 37 45 63
16 38 46 64
17 39 47 65
18 40 48 67
19 41 49 68
20 42 50 69
21 43 51 71
22 44 52 72
23 45 53 74
24 45 54 76
25 46 55 78
26 47 56 81
27 48 57 84
28 49 58 88
29 49 59 96
30 50 60 99
_________________________________________________________________
*The raw score is the number of items answered correctly.
**minimum passing score
Reliability of Essay Test Scores
The two major sources of inconsistency on the Essay Test arise from the
sampling of writing and the rating of essays. Students write one essay at
each test administration, and three graders rate each essay. The reliability
of the Essay Test could be increased by having students write more than one
essay on more than one day with more than three raters grading each essay.
However, practical considerations limit the extent to which the reliability
can be increased.
In any consideration of reliability, the importance of the decision to be
made on the basis of a test score must be taken into account. It would not be
reasonable to base a major decision such as eligibility for graduation on only
one writing sample. However, under Regents' policy, a student has many
opportunities to take the test before graduation. Thus, if a student who
fails the test complies with policy, he or she would have written on more than
one topic on more than one day and received grades from numerous raters before
fulfilling other requirements for graduation. Furthermore, a student may
continue to take the test as many times as necessary after all other
requirements for graduation have been completed. Thus, no irreversible
decisions about college graduation are made on the basis of Regents' Test
results.
As discussed in the section on Reading Test reliability, information on
reliability should be obtained by administering different forms of the test
under the same conditions to a representative group of examines. Consistency
of classifications made on the basis of the different administrations could
then be examined. Such a study has not been conducted for the Essay Test
because of the practical problems discussed in the section on Reading Test
reliability. The reliability of raters, however, is reported each year, and
additional studies have been conducted to examine this reliability. This
information on rater reliability is described in the remainder of this section.
Rater Reliability Reports
Every Fall Quarter, the performance of each rater who graded essays in
any of the previous four quarters is examined. Reported for each rater are
the number of days rated each quarter; the total number of days rated for four
quarters, the total number of essays rated; the number of essays rated per
day; the agreement percentage, which is the percentage of essays for which the
rating was identical with the rating of at least one of the other two raters
scoring the essay; and the percentage of 1's, 2's, 3's, and 4's given over the
four-quarter period.
Copies of this report are provided to members of the Academic Committee
on English and to Scoring Coordinators. Committee members receive the report
on raters from their institutions, and Scoring Coordinators receive reports on
all raters attending their scoring centers. Raters are instructed to examine
their performance statistics in relation to the average performance statistics
for the system. Committee members are instructed to use the statistics to
identify discrepant raters at their institutions and, when raters with
discrepant performance are identified, to make sure that they are not sent to
another scoring session until they have reviewed the Essay Scoring Manual and
the source of the discrepancies has been identified.
The rater performance summary statistics for Fall, 1980, through Summer,
1981, are presented in Table 22. These data are provided with the rater
reports to assist in the interpretation of data for individual raters. An
indication of rater reliability is the "Percentage Agreement" statistic of
80.22. This value is the mean percentage agreement defined as the percentage
of essays for which an individual raters' rating is identical to the rating of
at least one of the other two raters scoring the essay. Also provided is a
different type of percentage agreement statistic, the percentage of times at
least two of the three raters agreed on the essay rating. This agreement for
1980-1981 was 97.2%.
Table 22
Rater Performance Summary Statistics
for Fall, 1980 through Summer, 1981
System Average
Number of Essays Rated Per Day 112.94
(based on average of 4.15 hrs. per day)
Percentage Agreement 80.22
Distribution of Ratings:
Rating Percentage
1 38.69
2 55.13
3 5.76
4 .41
__________________________________________________________________________
Distribution of Essays Scores (This is the distribution of student scores
reported to institutions; use the above distribution for comparisons with
rater reports.)
Score Percentage
1 36.58
2 60.12
3 3.21
4 .09
Percentage of times at least two of the three raters agreed: 97.2
(Use "Percentage Agreement" above for comparisons with rater reports.)
Results of Review Process
The results of the review process for essay scores provide information
about the frequency with which mistakes are made in the rating of essays.
Since January, 1980, students who failed the Essay Test with failing scores
from two raters and a passing score from one rater have had the opportunity to
request a re-evaluation of the essay by a systemwide review panel. This
review panel consists of members of the Testing Subcommittee and the Scoring
Coordinators. The specific procedures for review are described in the Essay Scoring Manual.
The review process was implemented so that mistakes in the rating of
essays could be corrected. The results of the review process indicate that
very few mistakes are made in the rating of failing essays. From Fall, 1980,
through Summer, 1981, 11,331 essays received failing scores. Of these, 6,042
were given a passing grade by one rater and were thus eligible for review. Of
these essays, 201 were submitted for re-scoring on the recommendation of the
on-campus review panels. In the final review, 56 essays were given passing
scores by the systemwide review panel. Thus, failing grades were reversed on
review for fewer than 1% of the essays that were eligible by virtue of
receiving one passing rating. Furthermore, the members of the systemwide
review panel have noted that very few serious mistakes in essay ratings are
found; most of the essays that pass on review are on the borderline between
passing and failing, and, even when the systemwide reviewers assign passing
grades to these essays, they can also identify reasons why they were failed on
the initial scoring.
Additional Analyses
Singleton examined the reliability of Regents' Test essay ratings through
four different procedures. Her summary of the types of estimates used and of
the results of her analyses is provided in Table 23 (1976, p. 106).
Table 23
Estimates of Rating Reliability for the Essay Portion of the Language Skills
Examination
[from Singleton, 1976]
______________________________________________________________________________
____________
Percentage of Product-Moment Reliability of Coefficient of
Statistic Rater Agreement Correlation Average Ratings Concordance
______________________________________________________________________________
____________
Sample N = 92,469 N = 162 N = 17,095 N = 43,508
Value of
Statistic 92.97 .624 .7248 .8208
Interpre- At least 2 of Correlation If one averaged An estimate of
tation 3 raters 0 between stu- three ratings for the degree of
agree on an dent scores each ratee and concordance
essay rating and scores as- could correlate for three in-
92.97% of signed by a the set of av- dependent
the time panel of experts erages from com- judgments -
parable raters, given that all
the results would scoring pat-
be about .7248 terns with a
score differ-
ence of one
point are ad-
justed to con-
concordant
patterns
______________________________________________________________________________
____________
The "Percentage of Rater Agreement" of 92.97% reported by Singleton was
based on the ratings of 92,469 essays graded over seventeen quarters. This
value, which was 97.2% for 1980-1981, is the percentage of times at least two
of the three raters agreed on the essay rating. (The increased percentage
agreement for 1980-1981 does not necessarily reflect higher reliability; the
increase is due partly to the fact that fewer ratings of 3 and 4 have been
given in recent years.)
The second statistic reported, the "Product-Moment Correlation," is the
correlation between ratings assigned to 162 essays by members of the Testing
Subcommittee and the ratings assigned during the regular scoring session. The
Testing Subcommittee did not use the usual procedure for rating essays: in
addition to the four-point scale, they also used borderline scores (e.g., 1.5
to indicate a score between 1 and 2). The final score for an essay was the
mean of the ratings rather than the middle score usually used as the final
score. The correlation between these ratings and the ratings assigned during
the regular scoring session was .624. As discussed in the section on reading
reliability, correlational estimates of reliability for the Regents' Test must
be interpreted with caution because they indicate only the extent to which
consistency in the rank-ordering of examines is maintained and do not provide
information on consistency in relation to the cutoff score.
The third estimate of rater reliability reported by Singleton is an intra-
class correlation calculated according to the procedure described by Ebel
(1967). Ratings from 17,095 essays written during a two-year period were
analyzed with this procedure, which is based on the analysis of variance. The
estimate of reliability for single ratings was .47. The reliability of the
average of three ratings was estimated as .72. This can be interpreted as an
estimate of the correlation between two sets of scores when each score is
based on the average of three ratings.
The final analysis of reliability reported by Singleton was an intraclass
correlation based on the degree of concordance among essay ratings. For this
analysis, all score patterns with a difference of only one point by one rater
(e.g., ratings of 1,1,2 or 2,2,3) were adjusted to concordant patterns. This
adjustment seems reasonable because the one deviant rating does not affect the
final score of an essay. The average correlation between pairs of ratings,
which is an estimate of the reliability of single ratings, was .60. The
Spearman-Brown formula was used to estimate the reliability of the average of
three ratings. This value, which may be interpreted as the correlation
between the averages of two sets of three ratings when patterns with a
one-point discrepancy are adjusted for concordance, was .82.
Singleton found that the results of her analyses were comparable with the
results found in other situations in which essay tests are rated. On the
basis of her analyses, she concluded that the Essay Test was scored consis-
tently and that the reliability of the ratings was sufficient for the intended
use of the Regent's Test.
Chapter IV
REGENTS' TEST RESULTS FROM 1972 TO 1982
Presented in Table 24 are the percentages of examines passing the
Reading, Essay, and total Test each year (Fall Quarter through Summer Quarter)
from Winter Quarter, 1972, through Summer Quarter, 1982. Results are reported
separately for repeaters and first-time examines. (No data are reported for
repeaters for the first two years because of the small number of examines in
that category during these periods.)
Table 24
Regents' Test Results from 1972 to 1982
______________________________________________________________________________
__________
Percent Passing Under Reading Cutoff Percent Passing
Score in Effect at the Time of Test Under Reading Cutoff Score of 61
Administration (Effective Fall, 1980)
________________________________________
__________________________________
Reading Essay Total Reading Total
___________ ___________ ___________ __________________
_____________
Year Rep. 1st Rep. 1st Rep. 1st Rep. 1st Rep. 1st
______ ______ _____ _____ _____ _____ _____ _________ ________ ______
_____
1972 90.4 67.4 65.0 66.2 51.7
72-73 90.2 74.1 70.8 67.0 56.2
73-74 82.5 91.0 56.4 75.1 51.0 71.7 41.2 63.6 29.2 53.9
74-75 89.8 97.6 46.2 80.2 54.6 75.3 54.5 80.0 38.0 64.9
75-76 98.6 99.4 49.4 69.8 49.2 69.7 76.7 88.5 43.3 66.0
76-77 98.5 99.4 50.5 69.3 50.3 69.2 78.8 88.9 44.4 65.2
77-78 96.8 98.7 51.3 68.5 50.8 68.3 73.6 86.1 43.2 63.2
78-79 82.9 89.9* 52.5 68.4 48.2 65.2* 76.1 85.8 45.6 63.2
79-80 66.3 87.3** 45.7 71.4 48.3 65.9** 44.9 85.1 44.2 65.3
80-81 47.0 83.7*** 49.5 69.7 46.8 63.4*** 47.0 83.7 46.8 63.4
81-82 45.7 82.9 52.1 69.3 47.9 62.6 45.7 82.9 47.9 62.6
______________________________________________________________________________
____________
*Reading cutoff score raised from 51 to 59
**Reading cutoff score raised another point to 60
***Reading cutoff score raised another point to 61
Results for the Reading and total Test are presented in two ways: the
first columns show the percentages of students who actually passed the test
under the cutoff scores in effect at the time the test was taken; the last
four columns show the percentages of students who would have passed the test
under the current cutoff score of 61 (effective Fall Quarter, 1980). The per-
centage passing under the new cutoff score should be used for any year-to-year
comparisons.
The total test results for repeaters beginning with the Winter, 1980 ad-
ministration were computed differently from the repeaters' results for
previous quarters. Before Winter, 1980, all repeaters had to retake both the
Reading Test and the Essay Test. The policy was changed as of Winter, 1980,
to allow students who passed one part and failed the other to retake only the
part that was failed. As of Winter, 1980, the total percentage passing
statistics for repeaters include students who passed both parts of the test
during that year and students who passed one part that year and the other part
in a previous year. Thus, because of the changes in the administration of the
test and in the computation of the statistics, results for repeaters for ad-
ministrations after Fall, 1979, are not directly comparable with results from
previous administrations. Results for first-time examines are not affected by
this change.
Another problem with the comparability of results across years is that
changes in the policy resulted in changes in the population of students taking
the test. For example, before the most recent policy went into effect in
Winter, 1980, there was less pressure on students who failed the test to repeat
the test each quarter. The new policy requires students who fail to take re-
mediation and to retake the test. There is no way to predict the effect of the
new policy on the percentage of students passing the test: a decreased percen-
tags could be predicted because the poorer students are required to take the
test more often, or an increased percentage could be predicted because of the
remediation requirement. In either case, the comparability of results is
affected. While trends in performance from year to year can be examined, it is
not possible to determine the causes of any changes observed in passing rates.
Some trends are evident in the year-to-year comparison of the percentages
of students passing the test. For first-time examines, Essay Test performance
increased from 1972 to 1974-1975, decreased in 1975-1976, and has fluctuated
only slightly since 1975-1976. The percentages of repeaters passing the Essay
Test show a slight increase each year from 1974-1975 to 1978-1979, a decrease
in 1979-1980, and an increase since 1979-1980. These fluctuations may be a
result of changes in the policy.
While the percentages of students passing the Essay Test do not indicate
substantial improvement over time, some of the essay raters have indicated im-
provement that would not be evident in the statistics. Raters have reported
that the failing essays, in general, are not as poorly written as they were in
the beginning of the testing program; a large decrease in the number of
egregious essays has been noted.
Performance on the Reading Test improved from 1973-1974 to 1976-1977. The
percentage of first-time examines passing the Reading Test decreased slightly
each year after 1976-1977. The lower performance of repeaters on the Reading
Test since 1979-1980 is probably caused by the change in policy. Under the old
policy, students who failed the Essay Test and passed the Reading Test had to
retake the Reading Test when they retook the Essay Test. Because these re-
peaters usually passed the Reading Test each time they took it, the percentage
of repeaters passing the Reading Test was rather high. Under the new policy,
only the students who fail the Reading Test have to retake it; thus, the
population of repeaters is different from what it was before the new policy
went into effect. The lower passing rates shown in the years 1980 to 1982 are
more likely a product of the change in the population of students required to
repeat the test than a product of an actual decrease in these repeaters' level
of performance on the test. Had the performance of those repeaters who had
initially failed the Reading Test decreased, this decrease would have effected
a decrease in the percentage of repeaters passing the total test. As is
evident in Table 24, the percentage of students passing the total test showed
little decrease in 1979-1980; so it seems likely that the apparent decline in
the Reading Test performance of repeaters is caused by the change in the policy
rather than by change in student performance.
References
Anastasi, A. Psychological testing. New York: Macmillan, 1976.
American Psychological Association, American Educational Research Association,
National Council on Measurement in Education. Standards for educational
and psychological tests. Washington, D.C.: American Psychological
Association, 1974.
Angoff, W.H. Scales, scores, and norms. In R.H. Thorndike (Ed.) Educational
measurement. Washington, D.C.: American Council on Education, 1971.
Angoff, W. H. & Ford, S. F. Item-race interaction on a test of scholastic
aptitude. Journal of Educational Measurement, 1973, 10, 95-106.
Barrett, T. Taxonomy of reading comprehension. In R. Smith & T.C. Barrett
(Eds.) Teaching reading in the middle grades. Reading, MA: Addison-
Wesley, 1976.
Bloom, B.S., Madaus, G.F., & Hastings, J.T. Evaluation to improve learning.
McGraw-Hill, 1981.
Bormuth, J. On the theory of achievement test items. Chicago: The University
of Chicago Press, 1970.
Burk, K. Verifying the results of equating for minimum competency tests.
Paper presented at the annual meeting of the American Educational
Research Association, Boston, 1980.
Buros, O. K. (Ed.) The seventh mental measurements yearbook (Vol I)
Highland Park, NJ: Gryphon Press, 1972.
Campbell, D. T. Recommendations for APA test standards regarding construct,
trait or discriminant validity. American Psychologist, 1960, 15, 546-553.
Campbell, D. T. & Fiske, D. W. Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 1959, 56, 51-105.
Citron, H. R. Analysis of predictive variables for the essay scores on the
Regents' Test in one Georgia institution. Unpublished doctoral
dissertation, Georgia State University, 1980.
Coffman, W. E. & Kurfman, D. A comparison of two methods of reading essay
examinations. American Educational Research Journal, 1968, 5, 99-107.
Cole, N. S. & Nitko, A. J. Measuring program effects. In R. A. Berk (Ed.)
Educational evaluation methodology: State of the Art. Baltimore, MD:
Johns Hopkins University Press, 1981.
Cronbach, L. J. Test validation. In R. L. Thorndike (Ed.) Educational
Measurement. Washington, D.C.: American Council on Education, 1971.
Cronbach, L. J. & Warrington, W. G. Time limit tests: Estimating their
reliability and degree of speeding. Psychometrika, 1951, 14, 167-168.
Dale, E., O'Rourke, J., & Bamman, H. Techniques of teaching vocabulary.
Palo Alto, CA: Field Educational Publications, 1971.
Donlon, T. F. An exploratory study of the implications of test speededness.
Graduate Record Examinations Report 546-27. Princeton, NJ: Educational
Testing Service, 1978.
Ebel, R. L. Estimation of the reliability of ratings. In W. A. Mehrens &
R. L. Ebel (Eds.), Principles of Educational and Psychological
Measurement. Chicago: Rand McNally, 1967.
Fort Valley College. An examination of the effects of time for test
administration on students' performance on the Language Skills
Examination of the Regents' Testing Program. Report from Fort Valley
College (Abstract only), 1974.
Flaugher, R. L. The many definitions of test bias. American Psychologist,
1978, 33, 671-679.
Godshalk, F. I., Swineford, F., & Coffman, W. E. The measurement of writing
ability. New York: College Entrance Examination Board, 1966.
Guion, R. M. Scoring content samples: The problem of fairness. Journal
of Applied Psychology, 1978, 63, 499-506.
Hambleton, R.K. Test score validity and standard-setting methods. In R.A.
Berk (Ed.) Criterion-referenced measurement: State of the art.
Baltimore, MD: Johns Hopkins University Press, 1980. (a)
Hambleton, R.K. Review methods for criterion-referenced test items. Paper
presented at the annual meeting of the American Educational Research
Association, Boston, April, 1980. (b)
Henderson, F. N. An analysis and comparison of essay evaluations among raters
from four institutions in the University System of Georgia. Unpublished
major applied research project, Nova University, 1977.
Hickman, M. A. Study of the relationships between selected antecedent
variables and the Language Skills Examination of the University System
of Georgia, 1972. Dissertation Abstracts International, 1973, 33,
4877-4878A. (University Microfilm No. 73-5710, 120).
Himmelweit, H. T. Speed and accuracy of work as related to temperament.
British Journal of Psychology, 1946, 36, 132-144.
House, E. B. Testing and teaching: A critique of the Georgia Regents' Test.
Paper presented at the annual meeting of the Conference on College
Composition and Communication, Washington, D.C., March, 1980.
Hunter, J. E. & Schmidt, F. L. A critical analysis of the statistical and
ethical implications of various definitions of "test bias." Psychological
Bulletin, 1976, 83, 1053-1071.
Jenson, A. R. An examination of cultural bias in the Wonderlic Personnel
Test. Intelligence, 1977, 1, 51-64.
Jenson, A. R. Bias in mental testing. New York: Free Press, 1980.
Johnson, W. J. The origin and development of the University System of
Georgia's Regents' Testing Program. Paper presented at the annual
meeting of the Mid-South Educational Research Association, November, 1980.
Kendall, L. M. The effects of varying time limits on test validity.
Educational and Psychological Measurement, 1964, 24, 789-800.
Linn, F. L. Fair test use in selection. Review of Educational Research, 1973,
43, 139-161.
Litaker, R. G. An investigation of item bias in the Language Skills
Examination. Unpublished doctoral dissertation, University of Georgia,
1974. Dissertation Abstracts International, 1974, 35, 6366A (Univer-
s Microfilm No. 75-8175,99).
Marahnich, N. An empirical comparison of four indicators of test speededness.
Paper presented at the annual meeting of the American Educational
Research Association, Boston, 1980.
Pearson, P.D., & Johnson, D.D. Teaching reading comprehension. New York:
Holt, Rinehart and Winston, 1978.
Pendexter III, H. Personal communication. Undated.
Plake, B. S. A comparison of statistical and subjective procedures to
ascertain item validity: One step in the test validation process.
Educational and Psychological Measurement, 1980, 40, 397-404.
Popham, W.J. As always, provocative. Journal of Educational Measurement,
1978, 15, 297-300.
Prather, J. E. & Smith, G. Factors influencing student performance on a
language skills examination: The Regents' Test. Office of Institutional
Planning Report No. 76-1. Atlanta, GA: Georgia State University, 1975.
Ravan, F. O., Veal, L. R., & Rentz, R. R. A validity study of the essay test
of the Georgia Language Skills Examination. Paper presented at the
Annual Meeting of the National Council on Measurement in Education, New
Orleans, 1974.
Regents' Testing Program. An examination of student performance on the essay
test of the Language Skills Examination under different conditions of
time and choice of topic. (Abstract) 1974.
Rindler, S. E. Pitfalls in assessing test speededness. Journal of
Educational Measurement, 1979, 16, 261-270.
Scheuneman, J. D. A new look at bias in aptitude tests. In P. Merrifield
(Ed.) New directions for testing and measurement - Measuring human
abilities (No. 12). San Francisco, CA: Josses-Bass, 1981.
Sendoval, J. & Miille, M. P. W. Accuracy judgments of WISC-R item difficulty
for minority groups. Journal of Consulting and Clinical Psychology,
1980, 48, 249-253.
Shepard, L. Standard setting issues and methods. Applied Psychological
Measurement, 1980, 4, 447-467.
Shepard, L. A. Bias in test items. In B. F. Green (Ed.) New directions for
testing and measurement - Issues in testing: Coaching, disclosure, and
ethnic bias (No. 11) San Francisco, CA: Josses-Bass, 1981.
Singleton, D. J. The reliability of ratings on the essay portion of the
Language Skills Examination. Unpublished doctoral dissertation,
University of Georgia, 1976.
Swineford, F. Test analysis manual. Statistical Report 74-06. Princeton,
NJ: Educational Testing Service, 1974.
Terranova, G. The relationship between test scores and test-time. The
Journal of Experimental Education, 1972, 40, 81-83.
Thompson, D. J. & Rentz, R. R. Large-scale essay testing: Implications for
test construction. Paper presented at the International Symposium on
Educational Testing, The Hague, The Netherlands, July, 1973.
Thorndike, R.L. & Hagan, E. Measurement and evaluation in psychology and
education. New York: Wiley, 1977.
Tuiman, J.J. Determining the passage dependency of comprehension in 5 major
tests. Reading Research Quarterly, 1973-1974, 9, 206-223.
Veal, R. & Rentz, R. Large-scale essay testing. Paper presented at the
annual meeting of the National Council on Measurement in Education,
Chicago, 1974.
Watters, P. Faith, hope, and parity. Change Magazine, October, 1979,
pp. 10-13.
Wesman, A. G. Some effects of speed in test use. Educational and
Psychological Measurement, 1960, 20, 267-274.
Willig, C. The University System of Georgia Regents' Test: A faculty
perspective. Paper presented at the annual meeting of the Mid-South
Educational Research Association, November, 1980.
Wright, B. D. Solving measurement problems with the Rasch model. Journal of
Educational Measurement, 1977, 14, 97-116.
Appendix A
Board Policy
Administrative Procedures (includes Special Administration for students with disabilities) Use of Dictionaries on the Essay Test
"Grandfather" Issue
Appendix B MEMBERS OF THE COMMITTEE ON THE REGENTS' READING TEST, 1981-1982 Ms. Marolyn Howell Dr. Helen Naugle Developmental Studies English Department Abraham Baldwin Agricultural College Georgia Institute of Technology Dr. William Dodd Ms. Verdery Deal Developmental Studies Developmental Studies Augusta College Georgia Southern College Mrs. Annie Russell Dr. Joan Elifson Developmental Studies Developmental Studies Emanuel county Junior College Georgia State University Miss Patricia Ann Solomon Ms. Judy L. Shank Developmental Studies Developmental Studies Albany Junior College Southern Technical Institute Mrs. Rosa Tift, Chairperson Dr. William Diehl Developmental Studies Developmental Studies Albany State College University of Georgia Dr. Philip Scriven Dr. Bob W. Jerrolds Developmental Studies Reading Education Savannah State College University of Georgia Ms. Brenda Jackson Mrs. Annie Robinson Developmental Studies Developmental Studies Georgia Southwestern College Fort Valley State College Dr. Monica Jean Hiler Dr. Henrietta Miller Developmental Studies Developmental Studies Gainesville Junior College Clayton Junior College Dr. Nancy Bland Ms. Dorothy Randall Elementary Education Developmental Studies Armstrong State College Bainbridge Junior College Dr. Joan H. Marshall Dr. George M. McNinch Learning Center Department of Education Columbus College West Georgia College Dr. Ola M. Brown Mrs. Elle Billiard Education Department Developmental Studies Valdosta State College Atlanta Junior College Ms. Teresa T. Deen Developmental Studies Kennesaw State College MEMBERS OF THE TESTING SUBCOMMITTEE OF THE ACADEMIC COMMITTEE ON ENGLISH, 1981-1982 Dr. William J. Johnson Languages and Literature Augusta College Dr. Luetta Milledge Department of English Savannah State College Dr. James W. Mathews Department of English West Georgia College Dr. Larry Corse Humanities Division Clayton Junior College Dr. Thomas A. Wilkerson Humanities Division Dalton Junior College Dr. Jean B. Bridges Humanities Division Emanuel County Junior College Dr. Betty Jo Strickland Humanities Division Brunswick Junior CollegeLast updated: November 8, 1996
![]()