Questions with Flair

When I started on the path to becoming a data scientist I was a bit overwhelmed at the broad set of skills that were required. Granted, I took the decision at a point far enough into my software development career that I had acquired a wide swath of skills out of necessity. Math was a favorite for me since I can remember. Statistics came along in earnest with a couple of graduate-level degrees, as did analysis and modeling. Machine learning was a passion, as an author and blogger I felt comfortable writing. As for business acumen, I had an M.B.A. so I could hold my own against some of the best. And intellectual curiosity? I joined that club a long time ago.

But there was one skill that was not emphasized, that ought to be. In fact, this particular skill is probably the most important one, not only for data scientists but for software developers in general. It’s questions. Asking questions.

The ability to frame and ask the right questions is absolutely key to any software development endeavor. I dare say it it also the foundation upon which this thing we call ‘science’ is built, regardless of discipline.

I have been experimenting with using a new (to me) NLP framework I stumbled across (there are so many these days!) in the past few weeks and decided to narrow the use case to one rather simple but rather important one: classification of open-ended questions.

Is it possible, out of context, to accurately classify a question as either open- or close-ended? All you get is the question itself and none of the conversation, nuance, inflection or non-verbal clues that are essential for real human communication. Can a machine, using ONLY the text of the question, correctly determine if a question is open-ended?

To be 100% accurate (like us humans, eh?) context is required. There is no way we can determine if, “What’s up?” means a simple greeting or the lead to a long story without something in the way of context. And with NLP, sometimes the missing context makes all the difference.

We live in an age of awesome compute power and magical applications. Extracting text transcripts from audio files is as common and cheap today as internet search was the day Google launched. And to use some of those awesome NLP tools on audio files, I need words, not sounds. Okay so text extraction is also NLP, but that’s beside the point. I want to analyze the text itself. So how can we discern the throw-away “Waddup?” from the concerned “So what is up?” without non-textual context? In truth, we can’t. But we can get close.

I used FlairNLP, a very simple framework for state-of-the-art Natural Language Processing (NLP). I took an out-of-the-box question classification dataset, modified the use case (and the data) to better suit my open-ended question use case, found additional data sources, munged and wrangled like a good data scientist, and in a few days began training a model to predict if an arbitrary question was either open- or close-ended.

It took a few tweaks. Some more wrangling. A dash more munging. But what do you know — yes it is possible to predict with a decent degree of accuracy, based only on the text of the question, if a question is open-ended. It’s not perfect, but it’s not bad:

Results from 10 training epochs:

F-score (micro) 0.97
F-score (macro) 0.9636
Accuracy 0.97

Although these results appear promising, the importance of conversational context cannot be overstated. But the model does appear to be pretty decent as a start for recognizing open-ended questions out of context.

I created a github repo for the notebook. It’s not that much in the way of code. Most of the work was finding and munging data….which is most often the case.

Now on to another question. Because that’s what we do. We ask questions.

Leave a Reply

Your email address will not be published. Required fields are marked *

*