This is the blog of Vivek Haldar

Now today, in the 21st century, we have a better way to attack problems. We change the problem, often to one that is more tractable and useful. In many situations solving the exact problem is not really what a practitioner needs. If computing X exactly requires too much time, then it is useless to compute it. A perfect example is the weather: computing tomorrow’s weather in a week’s time is clearly not very useful.

The brilliance of the current approach is that we can change the problem. There are at least two major ways to do this:

Change the answer required. Allow approximation, or allow a partial answer. Do not insist on an exact answer.

Change the algorithmic method. Allow algorithms that can be wrong, or allow algorithms that use randomness. Do not insist that the algorithm is a perfect deterministic one.

Shifts in algorithmic design

“The conservative view sees this as an automotive technology, and most of them are very used to thinking about automotive technology. For the aggressive school, where I belong, this is a computer technology, and will be developed — and change the world — at the much faster pace that computer technologies do.”

The two cultures of robocars

When to do a rewrite

Spolsky’s famous screed against rewriting from scratch is now canonical. All the same, there are valid reasons to do a rewrite:

  • Your understanding of the domain, or the domain itself, has changed drastically in the process of building the first version.
  • The scale for which you built the initial system has grown by more than an order of magnitude. The right design for millions of users is very different than that for thousands.
  • Newer languages and runtimes make the product much better along some dimension: deliver better performance, be more maintainable etc.
  • You have accumulated so much technical debt that starting from scratch is easier than untangling the mess you have. This is often the most misleading reason, because engineers always prefer a system they wrote from scratch to inheriting one which works just fine. But even so, it does happen.

None of the above takes away from what Spolsky wrote. A rewrite is still a risky endeavor, even if one or more of the above conditions hold true, and should always be approached with caution and humility.

His argument was much stronger in the era of software that was shrink-wrapped to PCs, because the slow rate of change of the surrounding environment made many of the above reasons inapplicable.

But server-side software that powers online services is always in a constant state of flux. There is rarely any concept of “done”. The product might look more or less the same from the outside, while the backend code constantly churns to deal with growing scale, new stacks, and other services it depends on.

Learning your life

If you are a “regular” person living and working in modern civilization, it is likely that your life has a few recurring patterns:

  • on weekdays, you travel from home to work in the morning, and back in the evening; perhaps punctuated with a school or daycare drop-off and pick-up.
  • on weekends, you either mostly stay at home or frequent a small set of haunts (beaches, restaurants etc).
  • infrequently, you break out of this pattern by leaving the orbit of work-home to go on a vacation.

This is easy to figure out by observing someone for a short period of time. Of course, your phone already does this.

The above model of your life could be described in a few kilobytes. It is a zeroth-order machine learning problem. Is that depressing? Does that mean your whole life has a few kilobytes worth of complexity?

John Foreman, in his poignantly named piece “Data Privacy, Machine Learning and the Destruction of Mysterious Humanity”, takes a view of the field from the inside, and it sounds like that of a physicist working on the Manhattan Project worrying about the impact of what they’re building.

Our past data betrays our future actions, and rather than put us in a police state, corporations have realized that if they say just the right thing, we’ll put the chains on ourselves… Yet this loss of our internal selves to the control of another is the promise of AI in the hands of the private sector. In the hands of machine learning models, we become nothing more than a ball of probabilistic mechanisms to be manipulated with carefully designed inputs that lead to anticipated outputs… The promise of better machine learning is not to bring machines up to the level of humans but to bring humans down to the level of machines.

That is a dim, if plausible, view.

It is based on a pessimistic view of humans as always giving in to nudges to behave as our worst selves. I hope that we are better than that. We are already surrounded by advertising for junk food, and yet a substantial fraction of us manage to ignore that and eat healthy. I hope machine predictions that try to nudge us face a similar fate: predictions that look down on us will be indistinguishable from low-quality ones. The quality and tone of predictions will become an important part of the brand identity of the company emitting them. Do you really want to be the company known for peddling fatty, salty food, legal addictive substances, and predatory loans?

A tangential point: naming matters. Statistical inference, which is what machine learning really is, is a less scary name. “Statistical inference” conjures images of bean-counters with taped glasses; “machine learning” of an inevitable army of terminators.

Circling back to the question I posed above: is your life really describable in a few kilobytes in a predictive model? No. A model is a probabilistic abstraction. You, and your life, are not a sequence of probabilities, in much the same way that hiking a trail is a much richer, “realer” experience than tracing your way through a detailed trail map.

“The AP announced last month that it would use Automated Insights’ software, called Wordsmith, to produce up to 4,440 robot-written corporate-earnings reports per quarter, more than ten times the number its human reporters currently produce. When Alcoa’s earnings report hit the wire that day, the data was instantly compiled by a firm called Zacks Investment Research and passed through the AP’s proprietary algorithm, which pulled out key numbers and phrases and matched them against other contextual information. In milliseconds, the software produced a full story written in AP style, indistinguishable from one made with human hands. In short order, the story appeared on many AP member sites.”

Robot Journalism

“Here’s the tablet news app I’m hoping for in 2014: something that detects when I’m lying down in bed (it knows my device orientation and, using the ambient light sensor, can detect it’s dark) and serves me the sort of video content that I never have time to watch during the day.”

Katie Zhu

Garland Words

I define a garland word to be one which can be formed by chopping off the last few letters and “wrapping around” to the start.

For example: You can chop off the the trailing “on” from “onion” and still form the word by wrapping around to the starting “on”. In other words, you can write the letters “o”, “n”, “i” in a circle, in that order, and form the word “onion”.

I say that a garland word is of degree n if you can do the above with the last n letters of the word.

It turns out that garland words are fairly rare. In the English dictionary that came installed on my Ubuntu machine, there were 99171 words, only 199 words were garland words of degree 2 or above. The highest garland degree was 4.

The exact distribution was: there were only 3 words of degree 4, 32 of degree 3, and 164 of degree 2.

The degree 4 garland words were: “hotshots”, “beriberi” and “abracadabra”.

Lurking Smalltalk

Stephen Kell’s paper at PLOS 2013 meditates on one of CS’ oldest design questions: what should be in the operating system, and what in the programming language and runtime? The paper was a breath of fresh air; one of those opinionated design thinking papers that one rarely sees in academic CS venues. Do read the whole thing.

This question comes into stark relief if you look at what a modern language or VM runtime does: among other things, it multiplexes resources, and protect access to them. But that’s also exactly what the OS is supposed to do!

The paper riffs on an earlier idea from Dan Ingalls: “An operating system is a collection of things that don’t fit into a language. There shouldn’t be one.” The central problem tackled in the paper is that of having more generic object bindings in the OS, as opposed to the “static” filesystem bindings for most operations. Kell uses the following example:

Suppose we wish to search some object for text matching a pattern. If the object is a directory tree of text files, grep -r does exactly that. But suppose instead our directory tree contains a mixture of zipped and non-zipped files, or is actually some non-directory collection of text-searchable objects, such as a mailbox. In Smalltalk parlance, how can we make these diverse objects “understand” (via a collection of interposed objects) the messages of our grep -r (and vice versa)?

This would indeed be elegant. But I’m afraid the engineer in me isn’t terribly excited by that.

“All problems in computer science can be solved by another level of indirection” is a famous quote in computer science. The next sentence is much lesser known: “But that usually will create another problem.” What Kell is proposing adds flexibility in binding by introducing indirection.

Indirection is controlled by some sort of configuration the binds each side of it. And configuration implies a lot of added complexity. The UNIX model of “everything is a stream of bytes and can be treated like a file” might be simple and unstructured, but that is precisely what enabled a multitude of structure to be built on top of it.

Heartbeats

It’s commencement—and commencement speech—season. The one that’s been on my mind the most this year has been Paul Ford’s speech to a graduating class of interaction designers. His unifying thread is our conception of time:

The only unit of time that matters is heartbeats. Even if the world were totally silent, even in a dark room covered in five layers of foam, you’d be able to count your own heartbeats.

When you get on a plane and travel you go 15 heartbeats per mile. That is, in the time it takes to travel a mile in flight your heart beats 15 times. On a train your heart might beat 250 times per mile…

… If we are going to ask people, in the form of our products, in the form of the things we make, to spend their heartbeats—if we are going to ask them to spend their heartbeats on us, on our ideas, how can we be sure, far more sure than we are now, that they spend those heartbeats wisely?

Turns out, heartbeats are a great unit of time for another reason: all animals have approximately the same number of heartbeats in a lifetime.

With each increase in animal size there is a slowing of the pace of life. A shrew’s heart beats 1,000 times a minute, a human’s 70 times, and an elephant heart beats only 28 times a minute. The lifespans are proportional; shrew life is intense but brief, elephant life long and contemplative. Each animal, independent of size, gets about a billion heartbeats per life.

I find that deeply poetic.

The biggest winners may be lawyers who can use machine intelligence to create an automated large-scale practice. The Walton family, it’s worth recalling, got rich by effectively automating large-scale retail. More generally, there may be jobs for a new category of engineer-lawyers—those who can program machines to create legal value.

But the large number of journeyman lawyers—such as those who do routine wills, vet house closings, write standard contracts, or review documents on a contractual basis—face a bleak future.

Machines v. Lawyers, by John O. McGinnis

“Autodesk’s automated design software works through three layers. Generators populate physical space with matter, creating thousands of design suggestions in a split-second. Analyzers test each design against constraints developed by the designer: does it fit in the desired shape? Does it satisfy strength, thermal, material, and mechanical requirements? Above those, an optimization system finds the best designs, returns them to the starting point, and refines them repeatedly.”

The Automation of Design

The Eames house was everything I wanted: modern, but lived in, made out of everyday, accessible materials. No pristine white, dustless Dwell magazine show rooms here. There are plants and books and lights and seats and dust and you get the feeling it is a beautiful, functional space. There are finger prints all over it. The more I click around the web, the less and less I sense those fingerprints on people’s websites the way I used to. I’m not sure if it’s because of templates, or because those things take time, or if people just don’t think it’s worth it. But I think those traces of people are worth the effort, so I wanted to really work towards having this site feel like it was made by a person, and not some smug guy in a very tight, black turtle neck. Has anyone ever smiled in Dwell?

I wanted something homey. Better yet: homely. Americans think of homely as something that’s unhandsome, maybe even ugly. But the Brits observe the root “home,” since they invented the damn language. Homely, for them, is like a house: familiar, comforting, easy. There’s a rightness to it. For me, the Eames house is homely, because they filled it to the gills with the things they loved. How great would it be to have that online, where you would not run out of shelves? It’s an old dream, and one that’s still alive, but we’ve outsourced it. I think that shelf belongs at home.

Frank Chimero, on the design of his new site

Why technologists want to secede

Tech’s secessionist streak has recently got a lot of attention. Balaji Srinivasan thinks we should build a new society based on technology, breaking free from the “legacy” institutions we’re currently hobbled with. Larry Page hopes for experimental societies where change and experimentation can proceed much faster. Elon Musk wants to go live on Mars.

Most reactions to this have been along the lines of “look at these rich powerful elites thinking they can up and leave and do without all the rest of us!” I want to focus not on the politics behind this, but the motivation.

These ideas are getting attention now because rich and powerful titans of tech are espousing them. But these are the secret fantasies of every geek, rich or not, powerful or not.

Geeks have deep isolationist tendencies. They prefer losing themselves in sci-fi worlds to hanging out with friends. They build an early affinity with machines and programs. They sense that programming is the ultimate in world-building, and what’s more, it’s executable! They build elaborate towers of abstraction. They are drunk on the clean rationality of the world that the machine and they have built and understand.

Eventually, they secretly wish for a more ordered world. A world that is not messy or unpredictable. A world where one might have to work hard to establish the priors, but once they are, everything follows.

It is telling that one of the central activities in programming is abstraction. Abstraction is nothing but taking a jumble of intricate detail, putting it in a tightly sealed box, and then poking very small, well-defined holes in the box to interact with that jumble. So clean! So much better!

But every abstraction is leaky. Ints carry the baggage of their bit-width. In a perfect world, every int would be a BigInt with unlimited range. But we live in a world where it matters whether ints fit in a hardware register or not. Networks carry the baggage of not being perfect links. In a perfect world, a link between two machines would be an immutable, always-available channel for sending messages between them. But we live in a world where wires break.

This is the central tension in the mind of every geek. The constant need to abstract, to order, to rationalize, pitted against the necessary realization the world and entropy will eventually thwart you.

Secession is simply the same need for order and abstraction carried into the geek’s mental model for society.

I wrote previously:

The world of humans is messy, unpredictable, and—this is the part that infuriates you the most, the part you simply don’t get, the part that will forever make you an outsider—unreasonable. Humans are unreasonable. Your whole life, you’ve trusted one thing to save you. And saved you it has, again and again, without fail. That one thing is reason. Computers are a tool of reason, to be handled by people who obey the code of reason. People who are involved with computers, they’re reasonable, just like you. You all share a deep, unspoken bond. You’re all worshipers in the church of reason.

Please don’t hold that against us.

What advice would you give to someone starting out?

"You have to be prepared to give creative work 150%. I hear a lot of young people talking about life/work balance, which I think is great when you’re in your 30s. If you’re in your 20s and already talking about that, I don’t think you will achieve your goals. If you really want to build a powerful career, and make an impact, then you have to be prepared to put in blood, sweat, and tears. I don’t think everybody is willing to do that, but if you have the opportunity to do so, you should. That’s why many people go to graduate school in their late 20s: it forces them to devote intense time and focus to their work. It’s an experience that will change you forever."

Ellen Lupton, interviewed by Tina Essmaker.

Bleeding buffers

A lot has been written about Heartbleed. Here are my two cents.

The problem is that all our modern computing infrastructure is written in a language, C, that was designed to be a thin layer over assembly and hence has no memory safety. A memory-safe language, which is to say, any modern post-C/C++ language, still allows writing buggy code that can be exploited—for example, look at the raft of SQL injection attacks in Java backends—but the bar for exploitation is much higher and the space of possible attacks much smaller when an attacker can’t turn array dereferences into a way to read and write arbitrary memory addresses.

Type and memory safe languages have existed for a while now. Language based security and static analysis is still a thriving research area, but it is also mature enough that a broad set of results and techniques have been found to be generally useful, many of which have even made it into mainstream languages and tools. Why then are we still in this state?

The two immediate answers that come to mind—improving the safety of C/C++, and writing infrastructure in safe languages—are, upon closer inspection,riddled with complexities.

Making C/C++ memory safe is possible, but it is not as simple as flipping a switch. There is a price to be paid in performance, of course, but even if one were willing to pay that, there might be scores of subtle behavior changes in code made safe this way. There is probably tons of code out there that depends on unsafe behavior.

Writing SSL in a safe(r) language is possible. But everything that would want to link it (browsers and web servers) is written in C. One could use cross-language calling facilities, but that is rarely clean due to impedance mismatches, and of course, that still leaves the calling code memory-unsafe. You’d have to rewrite all the net’s critical infrastructure in a safe language and runtime. That’s wishful thinking.

When physical infrastructure decays it is visible. A new road uses the latest knowledge and practices for building roads, not those from thirty years ago. Neither is the case for our computing and network infrastructure.

In a strange inversion of the commonly held belief that hardware is hard to change and software is easily changed, some burden for safety might eventually fall on the processor. One example was when CPUs started supporting the no-execute bit to prevent code injection attacks. I wouldn’t be surprised to see more hardware features along those lines.

Relevant links:

Copyright 2008-2013 Vivek Haldar