When Graphs Are a Matter of Life and Death
Where van Langren had abstracted the range of longitudinal estimates into a line, Playfair had gone further. He discovered that you could encode time by its position on the page. This idea may have come naturally to him. Friendly and Wainer describe how, when Playfair was younger, his brother had explained one way to record the daily high temperatures over an extended period: he should imagine a bunch of thermometers in a row and record his temperature readings as if he were tracing the different mercury levels; from there, it was only a small step to let the image of the thermometer fade into the background, use a dot to represent the top of the column of mercury, and line up the dots from left to right on the page. By visualizing time on the x-axis, Playfair had created a tool for making pictures from numbers which offered a portal to a much deeper connection with time and distance. As the industrial age emerged, this proved to be a life-saving insight.
Back when long-distance travel was provided by horse-drawn stagecoaches, departure timetables were suggestive rather than definitive. Where schedules did exist, they would often be listed alongside caveats, such as “barring accidents!” or “God permitting!” Once passenger railways started to open up, in the eighteen-twenties and thirties, train times would be advertised, but, without nationally agreed-on time and time zones, their punctuality fell well shy of modern standards. When George Hudson, the English tycoon known as the Railway King, was confronted with data showing how often his trains ran late, he countered with the data on how often his trains were early, and insisted that, in net terms, his railway ran roughly on time.
As train travel became increasingly popular, patience was no longer the only casualty of this system: head-on collisions started to occur. With more lines and stations being added, rail operators needed a way to avoid accidents. A big breakthrough came from France, in an elegant new style of graph first demonstrated by the railway engineer Charles Ibry.
In a presentation to the French Minister of Public Works in 1847, Ibry displayed a chart that could show simultaneously the locations of all the trains between Paris and Le Havre in a twenty-four-hour period. Like Playfair, Ibry used the horizontal axis to denote the passing of time. Every millimetre across represented two minutes. In the top left corner was a mark to denote the Paris railway station, and then, down the vertical axis, each station was marked out along the route to Le Havre. They were positioned precisely according to distance, with one kilometre in the physical world corresponding to two and a half millimetres on the graph.
With the axes set up in this way, the trains appeared on the graph as simple diagonal lines, sweeping from left to right as they travelled across distance and time. In the simplest sections of the rail network, with no junctions or crossings or stops, you could choose where to place the diagonal line of each train to insure that there was sufficient spacing around it. Things got complicated, however, if the trains weren’t moving at the same speed. The faster the train, the steeper the line, so a passenger express train crossed quickly from top to bottom, while slower freight trains appeared as thin lines with a far shallower angle. The problem of scheduling became a matter of spacing a series of differently angled lines in a box so that they never unintentionally crossed on the page, and hence never met on the track.
These train graphs weren’t meant to be illustrations—they weren’t designed to persuade or to provide conceptual insight. They were created as an instrument for solving the intricate complexities of timetabling, almost akin to a slide rule. Yet they also constituted a map of an abstract conceptual space, a place where, to paraphrase the statistician John Tukey, you were forced to notice what you otherwise wouldn’t see.
Within a decade, the graphs were being used to create train schedules across the world. Until recently, some transit departments still preferred to work by hand, rather than by computer, using lined paper and a pencil, angling the ruler more sharply to denote faster trains on the line. And contemporary train-planning software relies heavily on these very graphs, essentially unchanged since Ibry’s day. In 2016, a team of data scientists was able to work out that a series of unexplained disruptions on Singapore’s MRT Circle Line were caused by a single rogue train. Onboard, the train appeared to be operating normally, but as it passed other trains in the tunnels it would trigger their emergency brakes. The pattern could not be seen by sorting the data by trains, or by times, or by locations. Only when a version of Ibry’s graph was used did the problem reveal itself.
Until the nineteenth century, Friendly and Wainer tell us, most modern forms of data graphics—pie charts, line graphs, and bar charts—tended to have a one-dimensional view of their data. Playfair’s line graph of Navy expenditures, for instance, was concerned only with how that one variable changed over time. But, as the nineteenth century progressed, graphs began to break free of their one-dimensional roots. The scatter plot, which some trace back to the English scientist John Herschel, and which Tufte heralds as “the greatest of all graphical designs,” allowed statistical graphs to take on the form of two continuous variables at once—temperature, or money, or unemployment rates, or wine consumption—whether it had a real-world physical presence or not. Rather than featuring a single line joining single values as they move over time, these graphs could present clouds of points, each plotted according to two variables.
Their appearance is instantly familiar. As Alberto Cairo puts it in his recent book, “How Charts Lie,” scatter plots got their name for a reason: “They are intended to show the relative scattering of the dots, their dispersion or concentration in different regions of the chart.” Glancing at a scatter allows you to judge whether the data is trending in one direction or another, and to spot if there are clusters of similar dots that are hiding in the numbers.
A famous example comes from around 1911, when the astronomers Ejnar Hertzsprung and Henry Norris Russell independently produced a scatter of a series of stars, plotting their luminosity against their color, moving across the spectrum from blue to red. (A star’s color is determined by its surface temperature; its luminosity, or intrinsic brightness, is determined both by its surface temperature and by its size.) The result, as Friendly and Wainer concede, is “not a graph of great beauty,” but it did revolutionize astrophysics. The scatter plot showed that the stars were distributed not at random but concentrated in groups, huddled together by type. These clusters would prove to be home to the blue and red giants, and also the red and white dwarfs.
In graphs like these, the distance between any two given dots on the page took on an entirely abstract meaning. It was no longer related to physical proximity; it now meant something more akin to similarity. Closeness within the conceptual space of the graph meant that two stars were alike in their characteristics. A surprising number of stars were, say, reddish and dim, because the red dwarf turned out to be a significant category of star; the way stars in this category clustered on the scatter plot showed that they were conceptually proximate, not that they were physically so.
But if you could find clusters of dots in two dimensions, why not three? Friendly and Wainer discuss a three-dimensional scatter plot that improved our understanding of Type 2 diabetes. In 1979, two scientists, Gerald M. Reaven and R. G. Miller, plotted blood-glucose levels against the production of insulin in the pancreas for a series of patients. Along a third axis, they added a metric for how efficiently insulin is used by the body. What emerged was a three-dimensional structure that looks a little like an egg with floppy wings. It allowed Reaven and Miller to split participants into three groups—those with overt diabetes, those with latent diabetes, and those who were unaffected—and to understand how patients might transition from one state to another. Previously it had been thought that overt diabetes was preceded by the latent stage, but the graph showed that the only “path” from one to the other was through the region occupied by those classified as normal. Because of this and evidence from other studies, they are now considered two separate disease classes.
If three dimensions are possible, though, why not four? Or four hundred? Today, much of data science is founded on precisely these high-dimensional spaces. They’re dizzying to contemplate, but the fundamental principles are the same as those of their nineteenth-century scatter-plot predecessors. The axes could be the range of possible answers to a questionnaire on a dating Web site, with individuals floating as dots in a vast high-dimensional space, their positions fixed by the responses they gave when they signed up. In 2012, Chris McKinlay, a grad student in applied mathematics, worked out how to scrape data from OkCupid and used this strategy—hunting for dots in a similar region, in the hope that proximity translated into romantic compatibility. (He says the eighty-eighth time was the charm.) Or the axes could relate to your reaction to a film on a streaming service, or the amount of time you spend looking at a particular post on a social-media site. Or they could relate to something physical, like the DNA in your cells: the genetic analysis used to infer our ancestry looks for variability and clusters within these abstract, conceptual spaces. There are subtle shifts in the codes for proteins sprinkled throughout our DNA; often they have no noticeable effect on our development, but they can leave clues to where our ancestors came from. Geneticists have found millions of these little variations, which can be shared with particular frequency among groups of people who have common ancestors. The only way to reveal the groups is by examining the variation in a high-dimensional space.
These are scatter plots that no one ever needs to see. They exist in vast number arrays on the hard drives of powerful computers, turned and manipulated as though the distances between the imagined dots were real. Data visualization has progressed from a means of making things tractable and comprehensible on the page to an automated hunt for clusters and connections, with trained machines that do the searching. Patterns still emerge and drive our understanding of the world forward, even if they are no longer visible to the human eye. But these modern innovations exist only because of the original insight that it was possible to think of numbers visually. The invention of graphs and charts was a much quieter affair than that of the telescope, but these tools have done just as much to change how and what we see. ♦