Manufactured Minds: The Complex Ethics of Artificial Intelligence

Work Reviewed:  Christian, Brian. The Alignment Problem: Machine Learning and Human Values.* W.W. Norton & Company, 2020.

As an Amazon Associate I earn from qualifying purchases. To provide total transparency over which hyperlinks are affiliated, I’ll mark all affiliate hyperlinks with an asterisk (*) at the end.

Robotic hand with a representation of an artificial intelligence neural network.

Have you ever felt uncomfortable with the idea of self-driving vehicles, or wondered about the implications of programs like ChatGPT for our future? If so, explore Brian Christian’s work The Alignment Problem to find out more about how artifical intelligence (AI) systems learn, the potential benefits and risks they pose, and the many ways in which researchers have attempted to keep such systems in alignment with our goals and values.

Introduction to Artificial Intelligence

Christian begins his work with a troubling example from an AI created by Google. The word2vec system was programmed to do math with words. For example, China + River might equal Yangtze, while Paris – France + Italy would yield Rome. Some other equations, however, led to disturbing outcomes, such as doctor + woman = nurse, or programmer – man = homemaker. 

So how do such problems arise within artificial intelligence programs? Two factors are datasets and rewards. Some AI programs are fed huge amounts of data, and extrapolate from there. Others receive more curated datasets, while still others learn by encountering scenarios and then facing some reward or punishment based on their actions. So, if the dataset you choose contains bias, that will come out in the finished application, as in word2vec. 

Another central problem has to do with the difficulty of pinning down exactly the behavior you wish to reward. In an example that only remains humorous while it is still fictional, an AI endlessly loops a boat through a section of a game full of the power-ups that were meant to incentivize winning the race.

The theme of both of these issues (datasets and incentives) is the alignment problem: how do we ensure that the AI programs we create align with our values and intent? 

Chapter One: Representation

Chapter one traces the history of machine learning beginning with Rosenblatt’s 1958 “Perceptron,” a machine that he trained to recognize punch cards using the principle of stochastic gradient descent. Basically, when the machine does something right, the user gives no input. But when it does something wrong, the user must adjust a series of weights towards the correct action. The machine showed great promise, and led to the use of artificial intelligence to interpret checks and zip codes. But by 1973, the state of research had plateaued. This was, in part, due to the massive supplies of power and data needed for such work. 

Christian identifies a turning point with Alex Krizhervsky’s “AlexNet,” which won the ImageNet Large Scale Visual Recognition Challenge by a landslide in 2012. But this advancement in the accuracy and speed of image recognition AIs led to new problems. You might recall the outrage surrounding Google Photos’ identification of people of color as “gorillas.” How does such a thing happen? Well, the image recognition nets train with particular sets of images. So, if an image set is biased in some way, the AI trained on it will be, too. Christian reports that the image set Google trained its AI on, Labeled Faces in the Wild. This dataset was 77% men, and 83% white people. So, women of color, in particular, would see much higher error rates in the AI’s identifications. 

Insidious Bias

When the problem is a lack of representation, researchers can change the dataset to improve things. But what about more insidious bias, like the word2vec examples from the introduction? These are more difficult. If an artificial intelligence learns to disregard gender ostensibly on job applications, for instance, it may find other ways to repeat the bias. For example, it might see that someone named “John” is more common in the field of software engineering than someone called “Mary.”

Christian compares this to juries at a blind audition noticing the sounds of women’s heels on the floor. Bias was not reduced until carpet was placed, and the jury could no longer use context clues to infer the identity of the musician. The author will later identify things like gender as “redundantly coded,” or signified by multiple factors that are difficult to tease out when trying to prevent an AI from being biased.   

Chapter Two: Fairness

The concerns of the previous chapter lead logically into a discussion of fairness centered around the use of AI in parole decisions. 1951’s “Manual of Parole Prediction,” was an example of early computing using punch cards. By 1998, Brennan and Wells had developed COMPAS, Correctional Offender Management Profiling for Alternative Sanctions. At first, such systems seemed promising. But for many people, it was disturbing to think that a computer would decide whether someone should be in jail or not.

These concerns appear in the 2016 study “Machine Bias,” which investigated the outcomes of COMPAS decisions. The findings were startling. The system was 61% accurate at predicting recidivism, or the likelihood that someone would reoffend. But the system, it turned out, was making fundamentally different mistakes for people of color vs. white people. People of color rated as higher risk were less likely to reoffend at predicted rates, while white people rated as lower risk were more likely to reoffend. While some defended the system based on the 61% success rate, this example highlights the importance of equity vs. equality. The error rate might be the same, but when the errors are leaning so clearly in different directions, there cannot be equity.

Christian points out another problem. The algorithm can only predict “future policing,” not reoffense, because only those who are re-arrested and re-convicted factor into the system. So if someone reoffends in an area not monitored by police, and no one arrests them, their data is not part of the model. Even more disturbingly, if heavily policed areas continue to see arrests, it will justify more police presence, and more re-arrests, further skewing the model. Christian will return to such feedback loops later on, as they are tricky to avoid.

Chapter Three: Transparency 

I saw an excellent example of correlation vs. causation today.

It brings up the question: how do we train AIs to recognize significant factors, and discard others? For example, asthma and heart disease patients, as well as individuals over the age of 100, had higher rates of surviving pneumonia in a particular dataset. If an AI took that at face value, it might recommend all such patients be sent home, while failing to recognize they survive because human doctors immediately flag them as high risk and send them to the ICU.

So how can one correct for this type of error? One way is to introduce additional factors, such as cost. If a patient survives and has a small bill, that indicates little intervention was needed. But if they survive at great cost, it indicates that more care needs to be taken. This is a relatively easy task inside of a rule-based decision making system, but neural nets are much harder to see inside of. 

Neural nets are really good, they’re accurate, but they’re completely opaque and unintelligible, and I think that’s dangerous now…my goal right now is to scare people. To terrify them. –Rich Caruana 

pages 85-87

In fact, they cannot be used in some places as of the EU’s 2018 General Data Protection Regulation (GDPR) policy: if an AI makes a decision about you, you have a right to know how it arrived at that decision. And sometimes, even the creators of the neural net cannot say exactly how it arrives at any one decision. Christian’s message on this topic is that sometimes, simpler is better. When you can see clearly where the errors are trending, and can correct them, as in the pneumonia example, you can avoid some of the pitfalls inherent in training AI. 

Chapter Four: Reinforcement

Psychologist Edward Thorndike’s Law of Effect goes something like this: learned responses are more likely to happen when the subject thinks the consequence of that response will elicit a positive emotion. In particular, Thorndike observed that when chickens and other animals encountered a puzzle box containing food, they first interacted at random with it. Any action that generated further annoyance or frustration was unlikely to be repeated, while actions that led to satisfaction or pleasure would be repeated.

Christian connects this idea with a discussion of dopamine later in the chapter, making the case that this chemical is not just about pleasure, but the anticipation of it– something like a message that says, “This is a good direction. Keep it up!” As a side note, this is one reason why addiction can be so difficult to kick. Christian describes the elevation of the chemical as “writing a check you can’t cash,” because the anticipatory pleasure of the dopamine rush from the drug promises future pleasure, while the reality is that the drug is always going to wear off.

From Random to Learned Actions

Back to the main topic – the movement from random to learned actions can be reinforced with rewards due to the “hedonistic” nature of our neurons, as described by Harry Klopf. Basically, we sometimes think of ourselves as trying mainly to reduce negative things, but in reality, humans are often “maximizers,” whose main goal is to acquire as many benefits as possible. There are challenges, though, when applying this theory. For example, if a child receives a gold star in class for a particular behavior, they might not continue it when they’re out in the world, away from the original reward. Or they might begin to do that behavior incessantly and out of context. Christian breaks these difficulties down into three parts in the context of AI training. 

  1. In real life, nothing exists in a vacuum. Behaviors and consequences all have complex contexts.
  2. As a result, consequences can become unclear and confusing.
  3. In addition, consequences in real life are often delayed.

For example, a chess game might proceed move by move for hours before someone’s mistake becomes apparent. So if the only cue is winning or losing, how does the subject know what behaviors to modify? 

Chapter Five: Shaping Artificial Intelligence

In 1943 B.F. Skinner attempted to train birds to drop bombs on selected targets. The researchers attempted several different reward models. Rewards might occur after a number of responses (ratio), or after some amount of time (interval). Either method can be fixed, occurring predictably, or variable, so the subject is not sure when the reward comes. The most effective combination turned out to be the variable-ratio method, where the rewards come often, but at an unpredictable number of responses. 

Any given subject, however, is unlikely to stumble upon the exact correct response immediately, especially in a sparse-rewards environment like a chess game. So behavioral shaping must often occur gradually, with the rewards coming after close behaviors at first, even if not exactly correct. Another method is to start from the final stage, for example, with checkmate moves, and then extrapolate backwards from there. 

There are some clear pitfalls, though, when trying to shape the desired behavior. Christian gives the example of promising a child a piece of candy for every instance where they help their sibling use  the bathroom. The child in question helped their sibling, certainly, but also began force-feeding the sibling water in order to increase the rewards! One solution is to reward the subject based on the state of the world, not on their actions, for example, a candy per day where the sibling manages to avoid any accidents. This also addresses the possibility that the action might become harmful in a different environment.

For instance, humans feel great when they eat sugar because it was evolutionarily advantageous to seek out this source of calories whenever possible in environments with more scarcity. But now that availability has changed, we often crave sugar past the point where it is helpful to our bodies.

Chapter Six: Curiosity

In 2008, Michael Bowling set up a library of Atari games for use in research about reinforcement learning. The goal was to create an AI that could play all the games in the library, with just the pixels on the screen as input. A Deep Q network was able to do so, and to outperform humans on all the games except for one. Montezuma’s Revenge was a 1984 game featuring lots of random exploration, sparse rewards, and plenty of perils. The reason this game posed such a problem has to do with the concept of shaping. If rewards are sparse, and most actions lead to losing a life in the game, how do you motivate the AI to keep moving? 

The answer researchers found was novelty. If they trained the AI to simply avoid death, it might stay forever in a safe location on the first level. So, instead, they decided to reward novelty. If the AI’s actions led to an area of the game never before seen, then those actions should be associated with a positive outcome. Under this model, the AI went from being able to explore just 2 of the levels to 15. 

Christian draws parallels between curiosity as an intrinsic motivation for human learners. Humans find motivation in novelty, surprise, and feelings of mastery even in contexts where no external reward or punishment is present, so if there is a sparse-rewards environment such as Montezuma’s Revenge, weighting novelty might be a good solution. 

Chapter Seven: Imitation

In his discussion of imitation, Christian begins with a note about the irony of the term “aping.” In fact, humans are the most prone of any primate to “ape,” other beings. While there is a distinct lack of imitative behavior in monkeys, for example, within just forty minutes of birth, human infants will begin imitating caregivers who stick out their tongues. They’ll even start doing strange actions to see if you’ll begin imitating them. 

Even more fascinating is the phenomenon of overimitation, in which the learner repeats actions that are irrelevant to the goal. For example, if someone happens to sneeze while showing me how to start my car, I might try to reproduce the sneeze alongside the key turn! Christian reports interesting findings in this area. Basically, if the learner cannot determine for certain that the extra action, like the sneeze, is unnecessary, they’ll imitate it just in case, especially if the teacher presents themselves as an expert. But if the teacher appears to be experimenting, it’s less likely that the learner will overimitate. 

Advantages

Christian presents three advantages to the use of imitation in AI. 

  1. Learning from someone else’s work, instead of your own
  2. Safety
  3. Avoiding the difficulty of describing what you are trying to do

Imitation has been particularly effective in things like teaching AI systems to steer vehicles. In such a high stakes setting, trial and error is not an option. But there’s a problem– what if the teacher makes a mistake? Or, what if the AI makes a mistake, and has never seen the expert recovering from something like that? Researchers can address this by introducing controlled examples of mistakes, such as the view of a road when a driver has gone out of their lane, and allow the AI to observe the way the expert moves the controls in the example. That way, the AI does not interpret the movement outside of the lanes as the ideal, as it might if given random examples of driver behavior. 

This chapter concludes with an interesting discussion on possibilism vs. actualism, which I’ll leave you to explore for yourself–basically, when you can’t do as well as the expert, should you still aim for their best example, or only the best that you yourself are able to do? 

Chapter Eight: Interference and AI

Another helpful concept in the training of AI systems is “inverse reinforcement learning.” In humans, this might look like a child helping an adult open a cabinet that they appear to be struggling with. In order to do so, the child must infer the end goal of the set of actions they’re observing. Importantly, this means that the helper must adopt the goals and values of the person they’re observing, rather than the exact actions. 

In the context of AI training, this could look like giving the system a compilation of expert attempts at a goal. Upon analysis, the AI should be able to infer the end goal, and reasonably assume that more frequent actions are closer to success than one-off variations. This can even work for situations where human experts struggle to perform the goal, allowing for artificial intelligence to surpass the experts. 

Chapter Nine: Uncertainty

In the final chapter before the conclusion, Christian explores the problem of “brittleness” in AI. Basically, many artificial intelligence systems perform extremely well under a narrower range of circumstances, but struggle outside of that arena. A memorable example is the incorrect classification of glancing sunlight as incoming missiles in 1983, an error that a human thankfully recognized and corrected. 

Christian lays out several attempts to capture the level of certainty in AI decisions. For example, running an image multiple times through an image identification AI should yield roughly the same result each time if it is performing well. So, tracking the variation through multiple runs can yield some insight on the level of certainty in that identification. Another way is to train multiple models for the same task, and compare the results for each. Finally, researchers can selectively shut down parts of an artificial intelligence to approximate uncertainty, and look for differences in the results. Whatever the method, the decisions that are less certain can then go through human review. 

Irreversible Actions

Another approach is to encourage the AI system not to take irreversible actions, such as pushing all the pieces into a corner during a game that forbids pulling pieces. However, it is difficult sometimes for an artificial intelligence to define just what a human means by “irreversible.” Any action could technically be just that, since we can’t turn back time. 

In extreme cases, it might even be best for the artificial intelligence to shut itself down, leading Christian into a discussion of Catholic moral theology. The author calls the incentivization of shutdown “brownie points in heaven,” and discusses the laxist vs. rigorist moral approaches. Should we prioritize doing as much benefit as possible? Or doing as much good as we can without doing any harm? If we ourselves cannot answer these questions easily, we must, in Christian’s view, proceed with extreme caution when training artificial intelligence systems to make decisions with any moral weight.  

Conclusion

Christian ends the book with an anecdote about a faulty thermometer. Perhaps you’ve experienced this yourself. When a door in the house is closed, the system gets an inaccurate reading and continues to pump out heat until the closed room is sweltering. In this example from real life, Christian notes that the situation did not become deadly due to the heating system’s own limitations. It could not heat the room to a deadly temperature, just an uncomfortable one. This is the author’s final argument for taking it slow when it comes to assigning power to artificial intelligence. 

See page 314 of the conclusion for a chapter by chapter summary of the topics. 

This review of The Alignment Problem is part of the 2024 Manufactured Minds Trio. Follow me on Patreon for access to the monthly trio of selections, supporting materials, and community discussion.

Scroll to Top