“Imagine one day you could write a whole page of code from scratch by simply typing in some English commands. Imagine you could develop a perfect program without memorizing every syntax rule by heart.”
This is the vision that motivates Rice University Computer Science Ph.D. student Mingchao “Charles” Jiang’s research.
“My research lies at the intersection of machine learning and program synthesis. The outcome — auto-generated source code — can be leveraged by software engineers as a base upon which to build their next layer of functionality,” he said.
When Jiang began his master’s degree in Electrical and Computer Engineering at Rice, machine learning and data science (DS) were gaining traction and he became fascinated with the idea of combining statistical analysis with data sets - images, words, numbers - to produce knowledge that had a real-world impact.
Jiang said, “I probably took every DS course offered at Rice at the time and completed my ECE masters with a concentration in data science. Then I immediately enrolled in the Ph.D. program in Computer Science to continue on my journey with data science.
Jiang’s advisor, Chris Jermaine is one of Rice’s data science pioneers and the program director for the university’s data science initiative. With Jermaine’s guidance, Jiang immersed himself in all kinds of data-rich challenges he wants to solve —like building the next generation of interpretable artificial intelligence (AI) systems that can automatically generate long-horizon, bug-free source code.
“The problem I am working on now is how to better use a neuro-symbolic approach in building generative language models, that ‘understand’ the structure of data rather than treat it as purely syntactic objects,” said Jiang. “Thanks to industry advances such as OpenAI, DeepMind, and Google Brain, there have already been some proto-type models to accomplish what I described above. ”
In the fall of 2021, the launch of OpenAI’s Codex presented a jaw-dropping demonstration by showing that given instructions in English or even just a few input/output instance pairs, you could get a whole page of working functions in return. It was not a perfect translation, but it was a giant leap forward, which could potentially empower hundreds of thousands of programmers, and even the entire software developing industry.
Jiang said, “The astonishing performance was built upon the development of the state-of-the-art deep learning model – Transformer, which consists of billions of parameters.” However, there is no such thing as free lunch. To achieve the presented performance, the leading teams from the industry scrapped billions of web pages that contain data such as images, text, and source code repositories and spent millions of dollars just on training. “My goal is to further push the performance of auto code generation even with constrained computation power.”
“The question is how are we going to achieve that? This hinges on one important difference between programming language and natural language such as English, which has been overlooked for a while, even from the program synthesis field. That is the significance of structure in programs. “For example, we as human beings might make hundreds of grammatical mistakes during our daily dialogs, but we could still get our points across to one another. However, it is a different story for a programming language. A trivial mistake such as missing a comma symbol might lead to a complete failure of compiling, not even mentioning executing. While many research works treat source code as purely syntactic objects in the models, my work focuses on using a different approach to build generative language models that leverage and ‘understand’ the structure meaning of source code.”
“This type of work really excites me because I can see the direct impact of this kind of tool for software developers.” Jing said, “We can help software engineers quickly master their code writing skills in order to spend more time on the tasks that can’t be automated - like innovating the next feature or solving an outstanding issue.”
“The impact of tools like Word and Excel —particularly when they were first introduced— enabled users with no background in computer science to produce sophisticated documents and spreadsheets. Those tools made it easier for users to spend more time focusing on what they wanted to write or calculate instead of wrestling with code syntax. If we can do for software engineers what Office did for home and office workers, we can empower a broader audience to take advantage of programming to solve problems. That is our hope.”
Communicating the importance of his research could have been just as challenging for Jiang as the work itself. Fortunately, Rice offers resources such as the ACTIVATE Engineering Communication program directed by Dr. Tracy Volz.
Jiang met Dr. Volz when she offered a presentation skills course for graduate students. “I never realized that kind of career impacting resource was available to engineering students at Rice,” he said. “Tracy’s workshop made me see how dramatically improved our talks could be with only a few adjustments. She also helped us experience a wide variety of presentation styles by inviting our peers and our faculty members to give 30-minute research talks.
“I wanted to be one of the speakers who excites their audience, but my research is complex; my presentation could have easily been boring or hard to sit through. Tracy’s coaching and workshops made a huge difference in my presenting skills.”
Jiang said, “The biggest takeaway from Tracy’s coaching was to always keep the audience in mind. It doesn’t matter what you said. What matters is what the audience takes away. Did they get the punch line? Can they remember one to three things you said? Or do they leave having no idea what you are talking about?”
“Before you begin your delivery, study your audience. Is it filled with peers who know your work backward and forward? Or is it an audience that is primarily unfamiliar with your work, or even your general topic? ” The seminar courses helped Jiang and his peers with more than their presentation skills. They also learned more about research developments in their field of study. Jiang was impressed by his peers’ work, which prompted him to reach out and get to know them better.
“I was shocked at how much I could learn from my peers in the Rice graduate programs. I knew Rice had rock stars among the faculty, but sitting in the classroom of a famous professor is just one way of learning. At a great university like Rice, you have a chance to hear really smart students talking about their work. Even if their research is a dramatically different topic than mine, I’ve learned to talk to my peers whenever I have the chance,” he said.
Jiang described a scenario in which he’d been stuck on a problem and took a break, asking another graduate student to go get coffee. They each talked a little about their own work, and a light seemed to go off for Jiang.
“When you hear someone else describe their research and then ask questions about yours, that can provide a breakthrough you weren’t able to see on your own,” said Jiang.
“So many exciting developments are going on in machine learning and AI. While cutting-edge information seems to only be a couple of clicks away, I often find that talking with and listening to my peers has helped me further my knowledge and research to a greater extent. ”
Mingchao Jiang is a Computer Science Ph.D. candidate at Rice University. His advisor is Chris Jermaine, and he matriculated in Fall 2015.
This story is part of a series of profiles for the ACTIVATE Engineering Communication program