Three recent Canadian developments around the use of student opinion surveys may be of interest to AUNBT members.

The Kaplan Decision

In June 2018, arbitrator William Kaplan released his decision in an interest arbitration case over the use of student evaluations of teaching (SETs) in employment-related decisions, such as promotion and tenure, at Ryerson University.

While the Ryerson Faculty Association and Ryerson University agreed that evidence of high quality teaching is an essential requirement for promotion and tenture and that SETs can be a valuable source of information from students about their educational experience, the parties disagreed over the use of SETs to assess teaching effectiveness.  

The arbitrator’s ruling concludes that SETs cannot assess teaching performance and effectiveness.  They are “imperfect at best and downright biased and unreliable at worst” [Kaplan, p5]. The decision states:  

According to the evidence, which was largely uncontested, and which came in the form of expert testimony and peer reviewed publications, numerous factors, especially personal characteristics—and this is just  a partial list—such as race, gender, accent, age and ‘attractiveness’ skew SET results. [Kaplan, p6]

The arbitrator also identified as influencing SET reliability:

  • systemic factors (online vs. in-class administration of surveys);
  • course characteristics (class size, quantitative vs. humanities, traditional teaching vs. innovative);  
  • the possibility of a lack of expertise amongst students to comment on course content and teaching methods;
  • the impact of SETs on behaviour (teaching to desired SET outcomes) and
  • the correlation between SET scores and student grade expectations.  

In his ruling, the arbitrator directed changes to the collective agreement to ensure that the SETs will not be used to measure teaching effectiveness.  He also directed that the numerical system should be replaced with an alphabetical one, that a process for educating those involved with the promotion and tenure process about the limitations of SETs should be further developed, and that online evaluations for non-online courses for probationary faculty should be discontinued. Effectively, while SETs can continue to be used to measure student experience, averages should not be used to compare faculty members or to determine teaching effectiveness.

The Kaplan arbitration relied for its expert opinion on reports by Richard Freishtat, director of the Center for Teaching and Learning at the University of California, Berkeley, and by Philip B. Stark, Professor of Statistics, Associate Dean of Mathematical and Physical Sciences, and Director of the Statistical Computing Facility at the University of California, Berkeley.  

In Freishtat’s report, of particular note is his consideration of the research on how gender bias and stereotypes inform student assessments in ways that disadvantage female instructors, how ethnicity and race negatively affect SETs, with racial minority instructors and nonnative English speakers being rated lower, and how age and attractiveness matter in teaching evaluations, qualities that a professor cannot change [Freishtat, II.D.1-5].  

Stark’s report covers similar terrain and usefully explains why it is statistically inappropriate to average SET scores across courses, instructors, and disciplines: possible answers such as strongly disagree, disagree, neither agree nor disagree, etc., are “ordinal categorical variables,” and while it is common practice to replace these category names with numbers (e.g. strongly disagree=1),

the numbers themselves are not quantities, just new labels. They are codes that happen to be numerical. The actual magnitude of the numbers does not mean anything [Stark, #28].

So averaging such numbers is meaningless:  “adding or subtracting labels from each other does not make sense, any more than it makes sense to add or average postal codes” [#29]. Averages, however, are often reported as if they were meaningful, creating “the illusion of scientific precision when in fact the result is not valid” [#30].  In Stark’s view, labels should not be numerical.

The reports were prepared for the Ryerson Faculty Association and the Ontario Confederation of University Faculty Associations (OCUFA), the Ontario counterpart to the Federation of New Brunswick Faculty Associations (FNBFA).

OCUFA Report

In February 2019, the OCUFA Student Questionnaires on Courses and Teaching Working Group released its own very substantial report on student questionnaires.  Based on research and expert reports, responses from Ontario faculty associations, and documentation pertaining to student questionnaires and research ethics, the Working Group’s report agrees with most of the findings of the Kaplan arbitration, while also adding further detail to the picture of the problems with student opinion surveys.

The report assesses student questionnaires on courses and teaching (SQCTs) from several directions:

Methodology: The report indicates that in creating numerical statistics (“metrics”), SQCTs are prone to misinterpretation and asserts they do not measure student learning but rather student satisfaction.  It too finds that SQCTs are influenced by factors unrelated to the quality of teaching, such as time of day, class size, subject discipline, and whether or not the course is an elective. Furthermore, women, racialized, and LGBTQ2S+ faculty often receive lower scores than white male faculty, and it is impossible to correct for students’ unconscious bias.   

Research Ethics: the Working Group recommends that  students should be allowed the opportuntity for active informed consent and be aware of the way in which their responses will be used and who will have access to them. Faculty and students both should be assured of the data security measures that protect their privacy.

Human Rights: the Working Group has an extensive analysis of the Human Rights context and lays out how the summative use of SQCTs does not meet the bar of equal treatment and freedom from discrimination set by the Ontario Human Rights Code, which is supported by faculty collective agreements. Given the evidence that women, racialized people, people with disabilities, and LGBTQ2S+ people often receive lower scores than their white male able-bodied colleagues, the summative use of SQCTs represents a form of systemic discrimination; inequity results particularly when they are used in decisions on pay and employment.  While some faculty may experience disadvantages with respect to class size and scheduling, when schedules change over a career, this balances out; the same is not true of bias experienced on the grounds of gender and race and other protected categories. The report also highlights the need to consider the impact on teaching of context, particularly of “cultural taxation,” the higher service workloads often required of faculty who serve on multiple university committees as a diversity representative. That equity seeking groups are also more likely to be targeted by abusive comments in the comments sections of questionnaires is also unacceptable from a Human Rights perspective.

The Working Group argues that SQCTs should be used for their original formative purpose, to provide teachers with feedback on the course and students’ learning experience, but should not be used for “summative” purposes, evaluating faculty performance. Indeed, instead of supporting high quality teaching, SQCTs can actually subvert teaching excellence: “In addition to incentivizing teaching strategies which do nothing to advance student learning, they can work to discourage classroom innovation or the study of challenging subjects. New or challenging approaches may yield lower SQCT scores due to student resistance, even when these improve student learning” [OCUFA Report, p8].

The Working Group also assesses the roots of the current reliance of SQCT scores.  The summative use of SCQCTs, it says, are a “cheap substitute for qualitative assessment of teaching” [p7] that has emerged in an era defined by reduced funding for higher education and an increased emphasis on performance indicators and the preparation of students for job markets. Teaching should instead, the Group argues, be evaluated by teaching dossiers and evaluation by peers—a position that Freishtat also develops [Freishtat, I. 1-7].  SQCTs that continue to be used should be redesigned for a formative purpose.


In line with the Kaplan arbitration with respect to Ryerson, the new Collective Agreement ratified by members of the University of Western Ontario Faculty Association in November 2018 changed how student questionnaires  will be used at Western University. They can now be employed as information about students’ experience in the classroom but not as evidence in the evaluation of teaching performance in the appointments, annual evaluation, and promotion and tenure processes.   The University of Southern California has switched to a peer-review model for evaluating teaching, and the University of Oregon will stop using numerical rankings in promotion and tenure assessment [CAUT Bulletin].

AUNBT is studying the details of these documents with a view to determining how they can best be understood in the context of our Collective Agreements.  Members who are preparing for the upcoming round of assessments should direct questions to the Grievance Chairs ( or the Professional Officer (Robert Gagne). We will be elaborating further on these matters at the assessment workshops in the fall.

Further Reading

The End of Student Questionnaires? CAUT Bulletin, November 2018. 

Farr, M. (2018). Arbitration decision on student evaluations of teaching applauded by faculty.  University Affairs, August 2018.  

Stark, P. B., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen Research. Retrieved from DOI: 10.14293/S2199-1006.1.SOR-EDU.AOFRQA.v1.

The reports from OCUFA, Stark, and Freishtat report all have extensive bibliographies.

(March 2019)

PDF version