Coding in natural language: let’s start small

The idea of writing a computer program by writing English (or another natural human language) is attractive because it might make coding easier and faster. This article tells the story of my encounter with natural language programming as a graduate student, and the small working system I built. I discuss the idea of context limiting: we can improve the user experience as well as the system’s performance by having clearly delimited boundaries within which the system operates, rather than replacing code with natural language in arbitrary contexts.

Introduction: why program in natural language?

The history of general-purpose programming language design has been a slow march from languages at very low levels of abstraction (i.e., those which expose details of the underlying machine, such as assembly languages) towards so-called ‘higher level’ languages. The purpose of programming language design, it can be argued, is to make the activity of programming as close as possible to the pure expression of intent. That is, to strip away from programming all concerns that are not related to what the programmer is trying to achieve.

Writing in a natural language, such as English, is close to a pure expression of intent. Yes, the mechanisms by which you write, the language you use, and indeed the fundamental properties of the activity of writing themselves offer resistance and shape intent. But in comparison to programming, writing down what you mean in natural language requires little or no conscious consideration of aspects unrelated to what you’re trying to express in that instant.

This raises an obvious question: can we make it possible to write computer programs in natural language? In order to do so, we would need to develop a system capable of reliably translating natural language statements into conventional programming languages. Traditionally, this problem has been seen as an insurmountable challenge, since natural language is so complex and ambiguous, and computer language so precise.

Advances in deep learning architectures and training have brought us closer than ever to realising natural language programming. OpenAI’s Codex and Deepmind’s AlphaCode are capable of generating correct programs, seeded essentially by a natural language prompt. This technology has already been commercialised as a software development tool in the form of GitHub Copilot. The tool acts as a form of advanced autocomplete, generating programs from natural language comments, completing repetitive lists, automatically generating test cases for code already written, and proposing alternative solutions. Though these systems are far from perfect and there is a lot of work to do yet, it’s awesome to see this technology entering the realm of applicability.

But this isn’t an article about the shiny new deep learning technology, interesting as it is. This article is the story of my own little exploration of building a natural language programming system, which takes place nearly ten years ago.

A problem with writing statistics code

In 2013, I am in the first year of my Ph.D. on developing better interfaces for data analysis. I’m steeped in data analysis myself, having recently left a full-time job as a data scientist, and also fresh with the experiences of writing statistics code to analyse data for several experiments I ran as a Master’s student.

While writing statistics code in the R programming language, I was frustrated that I had to constantly look up documentation to do very simple things. For example, to generate a sample of random numbers from a standard normal distribution, you need the function rnorm. I would keep forgetting this (is it norm, normal, rnormal, randn?) and have to go look it up. If I wanted to vary the parameters of the distribution (mean and standard deviation), I’d have to look it up.

Despite the fact that I knew there was a function that did exactly what I wanted it to do, and I knew all the data the function needed to do its job (the number of samples, mean, and standard deviation), I was still hindered by not knowing the specific name of the function and the names and order of its parameters (the function ‘signature’, in programming parlance). This problem plagued me ceaselessly, several times during a programming session, and each time the trip to search the web for documentation and examples would draw my attention away from my core activity, and disrupt my state of flow.

Why couldn’t I just write: “a random sample of size 100, with mean 0 and standard deviation 1”, and have the system generate the code rnorm(100, mean=0, sd=1), thus saving me a very straightforward round of documentation searching?

I had stumbled across a very specific but nonetheless common class of problem encountered by programmers. I didn’t give it a name in 2013, but I shall do so now (mostly for convenience of reference, but perhaps a little for vanity). I call it the familiar invocation problem. A programmer is facing the familiar invocation problem when their situation has the following properties:

  1. They know a function exists that will solve their needs
  2. They have all the information (arguments) the function requires
  3. However, they cannot recall the function signature (name and order of arguments)
  4. Nonetheless, they can verify by sight whether a specific bit code is what they needed. That is, they must already be familiar with usage of the function; they have used it or looked it up before.

Criteria 1 and 2 are knowledge prerequisites, criterion 3 introduces the problem. It is 4, the familiarity criterion, that really makes this entire approach plausible: being able to recognise correct solutions and identify incorrect solutions from memory is what will save the programmer the trip to the browser to look up documentation.

Rticulate: my natural language programming system from 2013

Having identified the familiar invocation problem, I set about building a proof of concept. I started precisely in the domain that had kindled my frustration: statistical programming in the R language. I named the system Rticulate, pronounced ‘articulate’.

Rticulate is a simple mechanism. I built an annotated dictionary of R functions. For each function, this dictionary contained the function’s name and the number, names, and types of its parameters, but it also contained synonyms and related terms for each. So, for example, the entry for rnorm contained the related words “random”, “normal”, and “distribution”, among others. While I initially built this dictionary by hand, I proposed that the process could be automated by mining documentation, as well as the words people use to describe the function on fora such as Stack Overflow.

Example entries from the Rticulate dictionary. Here three entries are shown, for the functions choose, rnorm and log. For each function and its arguments, the dictionary contains a type and synonyms. Arguments additionally have default values.

When the user enters a query, the system first matches the query to a function in the dictionary. This is, again, implemented quite simply: it treats both the query as well as the dictionary as a bag-of-words, and looks for the function that has the most terms in common with the user query.

Once the target function is found, the next challenge is to match values in the query to the intended arguments. Consider the query: “5 normally distributed samples with mean 6 and deviation 4”. This needs to be resolved to rnorm(n=5.0, mean=6.0, sd=4.0) . The system scans the query to identify likely argument values (5, 6, 4) and likely references to arguments (“samples”, “mean”, “deviation”). Next, it matches likely values to likely parameters, using an optimisation algorithm to find an assignment that minimises the distance from each likely value to its likely parameters. In the example above this is straightforward: the words “mean” and “deviation” are right next to the values 6 and 4. The word “samples” is equally distant from 5 and 6, but if we assigned 6 to “samples”, we’d have to assign either 4 or 5 to “mean”, thus greatly increasing the overall distance of the assignment.

The optimisation process relies on a variant of the proximity principle, a linguistic idea which loosely states that words that are related to each other should appear close to each other in text. Here I interpret it as “argument names should appear close to their values”. It seems like a huge oversimplification, but it works quite well in practice, and the proximity principle is actually the basis of a lot of successful natural language and information retrieval systems.

If the system fails to match a parameter, it uses a default value. Some R functions already have default values for their parameters, and the Rticulate dictionary inherits these values. For others, Rticulate adds new default values. This was my attempt to make the system return useful output in more cases, and mitigate the need for criterion 2 (i.e., that the user needs to know parameter values to use the system).

Examples of natural language statements and their corresponding function invocations as generated by the Rticulate prototype. Note how the binomial distribution query does not contain a value for the size parameter, but due to universal default values, Rticulate can return a complete invocation of the function rbinom.

In a report written early during my Ph.D., I wrote:

Rticulate is motivated by the increasingly common phenomenon of “Google engineering”, i.e., the process of programming by searching for snippets of code online. Formulating a search query that yields the appropriate result is a difficult task for novice programmers. Moreover, many novice R programmers are also learning R in conjunction with learning statistics, as a first programming language. The fact that R is not designed to be a first programming language makes the learning curve much steeper than necessary. Thus, Rticulate aims to provide a natural language interface for the R programming language.

The Rticulate prototype is currently capable of taking free-form natural language input and formulating a function invocation that it believes best represents the input, drawing on a small, manually-annotated function dictionary.


The techniques it uses to formulate this invocation are fairly simplistic, and there is no sophisticated natural language processing being used. Furthermore, it does not integrate at all with the R language; once Rticulate has produced a function invocation it must be manually copied into R. Furthermore Rticulate must be manually made aware of environmental variables and their types so that it can extract references to them from input sequences. Despite these limitations, even at the current stage the prototype provides a compelling demonstration of the potential utility of such a tool as a learning scaffold for the R language.

I was pretty excited about this line of work, but ultimately it was abandoned in pursuit of other projects. In that same report, I proposed four systems, but abandoned one entirely, took one forward only partially, and only seriously developed the other two.

Ph.Ds are often like that. It is rarely the case that you end up doing what you set out to do. You can set your sights on, and direct your attention towards, a certain goal. But then your research agenda is buffeted by practical constraints, unexpected hurdles, serendipitous opportunities, new developments in the field, and indeed, emergent findings from your research itself. This is not just a property of Ph.Ds but of all academic research. As Einstein put it, “If we knew what it was we were doing, it would not be called research, would it?”

Natural language programming: from all context to small context

During my Ph.D., I had to end my explorations of natural language programming early, both because of time constraints but also because I became increasingly enchanted with the power of interactive data visualisations, which I saw as potentially solving a much wider range of problems.

Nonetheless, my interest in natural language as an alternative programming representation endures. I occasionally find myself in a position to study it. A recent investigation led by a student of mine advances the discussion on a question I always had: whether it is better to have “full” natural language in such systems, or only a reduced subset of natural language (e.g., restricted vocabulary and grammar), both to make the interpretation of the statement easier on the system but also to make the experience more predictable and consistent for the user. In our experiment, we found several benefits to using a reduced subset of natural language and provide evidence to suggest that “full” natural language may not be ideal in many cases. If you’re interested to learn more, there’s a blog post and a paper.

What both Rticulate and my student’s work have in common is the limiting of context as a technique. Rticulate limited the scope of the system specifically to familiar function invocation. The later work experimented with limiting vocabulary and grammar. Human language is saturated with context that enables us to disambiguate between many possible interpretations of an utterance. In comparison, there’s far less common ground between us and a machine interpreter.

When systems purport to do everything and understand anything, users encounter a host of problems. Research by my colleagues finds that the experience can be “like having a really bad PA”. The new code generation systems I mentioned at the start of this article, at present, can suffer from both these problems. The power of deep learning systems like Codex make my tiny, laboriously hand-coded attempts with Rticulate look laughably quaint by comparison, but I believe that Rticulate’s simplicity and focus on solving a very specific problem would still be a strong selling point in favour of such a system today.

The core idea behind context-limiting is that in order to reduce the uncertainty the user faces when dealing with a natural language system, as well as improve the performance of the algorithm, start small. Start either with a limited set of contexts in which the system can operate, or with a limited language of instruction, or both. In due course, perhaps, we can find a way to create sufficient common ground between programmer and algorithm that we can “converse” with the same ease as we do with human interlocutors. However, that is still a few years (at least) away.

Closing reflection: research is a really long game

I don’t mean to give the impression that by building a system all these years before the current renaissance, I somehow invented the idea of natural language programming or of context limiting. The idea of programming in natural language has been around for as long as programmable electronic computers, perhaps longer. There has been a lot of great work on this topic. Debates about the suitability of natural language as a programming language, and ways in which we might solve the apparent technical and interactional challenges (including context limiting), have been floating around since at least 1966. Even Edsger Dijkstra waded in, in his typical curmudgeonly fashion, to state his position on the topic (surprise, he hates it).

Bill Buxton postulates the long nose of innovation: “any technology that is going to have significant impact in the next 10 years is already at least 10 years old”. The history of several important technologies such as the mouse, RISC processors, and capacitive multitouch, involved an incubation period of 20-30 years between the invention of the technology and its first consequential application. The challenge of innovation, Buxton argues, is not (just) in the invention: it is the refinement, persuasion, and financing that follows which will determine its success. He gives the following metaphor:

The Long Nose redirects our focus from the “Edison Myth of original invention”, which is akin to an alchemist making gold. It helps us understand that the heart of the innovation process has far more to do with prospecting, mining, refining, goldsmithing, and of course, financing.

Knowing how and where to look for gold, and recognizing it when you find it is just the start. The path from staking a claim to piling up gold bars requires long-term investment, and many players. And even then, the full value is only realized after the skilled goldsmith has crafted those bars into something worth much more than their weight in gold.

Thanks to advances in the technology of generative models, the long nose of natural language programming is finally beginning to poke into the present. For researchers at the intersection of human-computer interaction, artificial intelligence, and programming languages (like myself), this is a tremendously exciting time. There are still many open questions about the experience of interacting with these generative models, and a lot of goldsmithing ahead of us.


Arawjo, Ian. “To write code: The cultural fabrication of programming notation and practice.” In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1-15. 2020.

Ong, Walter J. Orality and literacy. 1982.

Sarkar, Advait. Interactive analytical modelling. No. UCAM-CL-TR-920. University of Cambridge, Computer Laboratory, 2018.

Sarkar, Advait, Neal Lathia, and Cecilia Mascolo. “Comparing cities’ cycling patterns using online shared bicycle maps.” Transportation 42, no. 4 (2015): 541-559.

Sarkar, Advait. “The impact of syntax colouring on program comprehension.” In PPIG, p. 8. 2015.

Csikszentmihalyi, Mihaly, and Mihaly Csikzentmihaly. Flow: The psychology of optimal experience. Vol. 1990. New York: Harper & Row, 1990.

Givón, Talmy. “Iconicity, isomorphism and non-arbitrary coding in syntax.” Iconicity in syntax (1985): 187-219.

Behaghel, Otto. “Deutsche syntax, vol. 4.” Heidelberg: Winter (1932).

DiGiano, Chris, Ken Kahn, Allen Cypher, and David Canfield Smith. “Integrating learning supports into the design of visual programming systems.” Journal of Visual Languages & Computing 12, no. 5 (2001): 501-524.

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Hunches and Sketches: rapid interactive exploration of large datasets through approximate visualisations.” In The 8th international conference on the theory and application of diagrams, graduate symposium (diagrams 2014), vol. 1. 2014.

Sarkar, Advait, Alan F. Blackwell, Mateja Jamnik, and Martin Spott. “Interaction with Uncertainty in Visualisations.” In EuroVis (Short Papers), pp. 133-137. 2015.

Sarkar, Advait, Mateja Jamnik, Alan F. Blackwell, and Martin Spott. “Interactive visual machine learning in spreadsheets.” In 2015 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 159-163. IEEE, 2015.

Sarkar, Advait, Martin Spott, Alan F. Blackwell, and Mateja Jamnik. “Visual discovery and model-driven explanation of time series patterns.” In 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), pp. 78-86. IEEE, 2016.

Mu, Jesse, and Advait Sarkar. “Do we need natural language? Exploring restricted language interfaces for complex domains.” In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-6. 2019.

Luger, Ewa, and Abigail Sellen. “” Like Having a Really Bad PA” The Gulf between User Expectation and Experience of Conversational Agents.” In Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 5286-5297. 2016.

Halpern, Mark. “Foundations of the case for natural-language programming.” In Proceedings of the November 7-10, 1966, fall joint computer conference, pp. 639-649. 1966.

Sammet, Jean E. “The use of English as a programming language.” Communications of the ACM 9, no. 3 (1966): 228-230.

Dijkstra, Edsger W. “On the foolishness of” natural language programming”.” In Program construction, pp. 51-53. Springer, Berlin, Heidelberg, 1979.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s