Will Computers Ever Replace Teachers?

The classroom looked like a call center. Long tables were divided by partitions into individual work stations, where students sat at computers. At one of the stations, a student was logged into software on a distant server, working on math problems at her own pace. The software presented questions, she answered them, and the computer instantly evaluated the answer. When she answered a sufficient number of problems correctly, she advanced to the next section. The gas-plasma monitors attached to the computers displayed the text and graphics with a monochromatic orange glow.

This was in 1972. The computer terminals were connected to a mainframe computer at the University of Illinois at Urbana-Champaign, which ran software called Programmed Logic for Automated Teaching Operations (PLATO). The software had been developed in the nineteen-sixties as an experiment in computer-assisted instruction, and by the seventies a nationwide network allowed thousands of terminals to simultaneously connect to the mainframes in Urbana.* Despite the success of the program, PLATO was far from a new concept—from the earliest days of mainframe computing, technologists have explored how to use computers to complement or supplement human teachers.

At first glance, this impulse makes sense. Computers are exceptionally good at tasks that can be encoded into routines with well-defined inputs, outputs, and goals. The first PLATO-specific programming language was TUTOR, an authoring language that allowed programmers to write online problem sets.* A problem written in TUTOR had, at minimum, a question and an answer bank. Some answer banks were quite simple. For example, the answer bank for the question “What is 3 + 2?” might be “5.” But answer banks could also be more complicated, by accepting multiple correct answers or by ignoring certain prefatory words. For instance, the answer bank for that same question could also be “ (5, five, fiv).” With this more sophisticated answer bank, TUTOR would accept “5,” “five,” “fiv,” “it is 5,” or “the answer is fiv” as correct answers.

This sort of pattern-matching was at the heart of TUTOR. Students typed characters in response to a prompt, and TUTOR determined if those characters matched the accepted answers in the bank. But TUTOR had no real semantic understanding of the problem being posed or the answers given. “What is 3+2?” is just an arbitrary string of characters, and “5” is just one more arbitrary character. TUTOR did not need to evaluate the arithmetic of the problem. It could simply evaluate whether the syntax of a student’s answer matched the syntax of the answer bank.

Humans are much slower than computers at this kind of pattern-matching, as anyone who has graded a stack of homework can attest, and as a response educators have developed a variety of technologies to speed up the process. Scantron systems allow students to encode answers as bubbles on multiple-choice forms, and optical-recognition tools quickly identify whether the bubbles are filled in correctly. In eighteenth-century America, one-room schoolhouses employed the monitorial method, in which older students evaluated the recitations of younger ones. Younger students memorized sections of textbooks and recited those sections to older students, who had previously memorized the same sections and had the book in front of them for good measure. Monitors often did not understand the semantic meaning of the words being recited, but they insured that the syntactical input (the recitation) matched the answer bank (the textbook). TUTOR and its descendants are very fast and accurate versions of these monitors.

Forty years after PLATO, interest in computer-assisted instruction is surging. New firms, such as DreamBox and Knewton, have joined more established companies like Achieve3000 and Carnegie Learning in providing “intelligent tutors” for “adaptive instruction” or “personalized learning.” In the first quarter of 2014, over half a billion dollars was invested in education-technology startups. Not surprisingly, these intelligent tutors have grown fastest in fields in which many problems have well-defined correct answers, such as mathematics and computer science. In domains where student performances are more nuanced, machine-learning algorithms have seen more modest success.

Take, for instance, essay-grading software. Computers cannot read the semantic meaning of student texts, so autograders work by reducing student writing to syntax. First, humans grade a small training set of essays, which then go through a process of text preparation. Autograders remove the most common words, like “a,” “and,” and “the.” The order of words is then ignored, and the words are aggregated into a list, evocatively called a “bag of words.” Computers calculate different relationships among these words, such as the frequency of all possible pairwise combinations of any two words, and summarize these relationships as a quantitative expression. For each document in the training set, the autograder then correlates the quantitative representation of the syntax of the essay with the grade assigned by the human, the assessment of semantic quality.

The final step is pattern-matching. The algorithm searches through each essay, compares the syntactic patterns in the ungraded essay to the pattern of essays in the training set, and assigns a grade based on its similarity with those syntactic patterns. In other words, if an ungraded bag of words has the same quantitative properties as a high-scoring bag of words from the training set, then the software assigns a high score. If those syntactic patterns are more similar to a low-scoring essay, the software assigns a low score.

In some ways, grading by machine learning is a marvel of modern computation. In other ways, it’s a kluge that reduces a complex human performance into patterns that can be algorithmically matched. The performance of these autograding systems is still limited and public suspicion of them is high, so most intelligent tutoring systems have made no effort to parse student writing. They stick to the parts of the curriculum with the most constrained, structurally defined answers, not because they are the most important but because the pattern-matching is easier to program.

This presents an odd conundrum. In the forty years since PLATO, educational technologists have made progress in teaching parts of the curriculum that can be most easily reduced to routines, but we have made very little progress in expanding the range of what these programs can do. During those same forty years, in nearly every other sector of society, computers have reduced the necessity of performing tasks that can be reduced to a routine. Computers, therefore, are best at assessing human performance in the sorts of tasks in which humans have already been replaced by computers.

Perhaps the most concerning part of these developments is that our technology for high-stakes testing mirrors our technology for intelligent tutors. We use machine learning in a limited way for grading essays on tests, but for the most part those tests are dominated by assessment methods—multiple choice and quantitative input—in which computers can quickly compare student responses to an answer bank. We’re pretty good at testing the kinds of things that intelligent tutors can teach, but we’re not nearly as good at testing the kinds of things that the labor market increasingly rewards. In “Dancing with Robots,” an excellent paper on contempotary education, Frank Levy and Richard Murnane argue that the pressing challenge of the educational system is to “educate many more young people for the jobs computers cannot do.” Schooling that trains students to efficiently conduct routine tasks is training students for jobs that pay minimum wage—or jobs that simply no longer exist.

Photograph: Peter Marlow/Magnum

Correction: An earlier version of this post suggested that Michigan hosted thousands of terminals connected to the PLATO network. It also incorrectly suggested that TUTOR was the first programming language used on PLATO.