Background

Every company has always taken pride in providing excellent customer service, therefore it’s crucial as part of ongoing improvements for product to collect and analyse feedback after each customer interaction on support channels

A top-notch analyst in Hi-Tech industry should be in capable of handle at least base feedback analysis quickly and efficiently and going through this article you will be introduced with the necessary concepts and eventually given the base guide on how to unlock the true power of text data

Well, no more words, let’s have a look on how

Disclaimer: I’m not pretending on having this article as an exhaustive everything you need notebook for analyzing customer feedback, but it’s the steps I usually follow while working on customers text data and on the basis of my previous experience - 80% of your needs is covered here

Prerequisites

It’s expected that the reader has an experience with Python and it’s main Data Analysis libraries

The actual notebook was written on Python 3.7.9 and to keep the results reproducible here is the list of particular packages that are being used in the article specified with their versions

Code
numpy==1.21.6
pandas==1.3.5
pandas-profiling==3.1.0
matplotlib==3.4.2
unidecode==1.3.7
nltk==3.6.2
sklearn==0.24.2
wordcloud==1.9.2
shap==0.42.1
transformers==4.30.2
gensim==4.0.1

Methodology

During this notebook the feedback which comes in the form of ratings (from 1-5) and textual comments is considered. The general purpose is to dive deeper into this feedback, identify common themes, if certain issues lead to more negative feedback than others and understand areas of improvement

Code
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

from matplotlib import pyplot as plt
%matplotlib inline

As a general rule the analysis should be organized in a top-down manner, simple exploration and low-hanging fruits first and sophisticated approach after only if you really need more in-depth expertise. According to the above the analysis will be organized in the three steps:

  1. Data Mining - simple unsupervised data exploration to have a general idea of the data nature
  2. Sentiment Analysis - search on what matters the most which reasons drive review to be positive or negative
  3. Topic Modelling - main feedback themes extraction, improvement focuses identification

Analysis Process

Data Mining

Basic EDA

First of all - let’s have a fluent look at the suggested data, pandas-profiling is a very useful tool to perform basic and boring Exploratory Data Analysis in a minute. View Data Profile by clicking here

Code
from pandas_profiling import ProfileReport

df = pd.read_csv('feedback-data-sample.csv', index_col=0)

profile = ProfileReport(
    df,
    minimal=True,
    dark_mode=True,
    title="Feedback Data Report",
)

profile.to_file("data-profile.html")

Well, what are the main data patterns

  • the dataset constitutes 1276 tickets with customer feedback consisted of csat_score and comment
  • ticket_id - is a primary key, given that we don’t have timestamp column and it’s looks like bigint not a random hash, it’s better to sort by it beforehand, to be sure we are not predicting the past on the future features
  • all the reviews are in English language, good for us, we can forget about additional translators and this column in general
  • csat_score has only two different values 0 and 4 stars, and the majority of reviews are positive, well not much, but it’s even easier to transform it into a bool target and work with it further
  • comment has a few NULL values, let’s keep it in mind
  • from the simple frequentist analysis it’s clear that tokens not and very might be informative, so don’t forget to exclude them from stop words list
Code
df = df.sort_index().drop(columns={'language'})

def define_sentiment(rating: float) -> int:
    if rating < 3:
        return -1 # negative sentiment
    elif rating > 3:
        return 1 # positive sentiment
    else:
        return 0 # neutral sentiment

df['sentiment'] = df['csat_score'].apply(define_sentiment)

Text cleaning

Barely the role of data cleaning might be underestimated, it’s an incredibly important step, if this is skipped the rest of analysis doesn’t make any sense then.

Main cleaning that should be applied:

  • remove all the symbols and keep only words
  • remove redundant short words
  • remove general language words
  • transliterate unicode symbols to ascii
  • lowercase

The next step is tokenization: the are two main approaches here:

  • stemming - fast process of removing prefixes and suffixes to give a word a short form, that might not be a dictionary word though
  • lemmatization - finds meaningful word representation from dictionary and depends on the part-of-speech

To summarize, the stemming is just searching for a common ground between words and cutting ends then and therefore it takes less time whereas lemmatization provides better results by performing a specific morphological analysis and produces a real word which is extremely important for some human-interactive applications

Sounds like if the resources are not a problem it’s better to use lemmatization by default, but there is an opinion that Stemming works efficiently for some specific tasks like: spam classification and feedback sentiment classification, given that it’s the case, let’s apply both and take a choice in the end

Code
import re
import nltk

# If the code below doesn't work - download add-ons first
# nltk.download(['stopwords', 'wordnet'])

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# this words might be useful, better to retain for now
for word in ['very', 'not']:
    stop_words.remove(word)

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# better version of the Porter Stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")

for example in ['programming', 'quickly', 'very']:
    print(f"Example: {example}")
    print(f" - Lemma: {lemmatizer.lemmatize(example, pos='v')}")
    print(f" - Stem: {stemmer.stem(example)}")
Example: programming
 - Lemma: program
 - Stem: program
Example: quickly
 - Lemma: quickly
 - Stem: quick
Example: very
 - Lemma: very
 - Stem: veri

Unicode transliteration is needed, given that the data contains non-ascii symbols

Code
from unidecode import unidecode

df[df.comments.notnull()][
    df[df.comments.notnull()].comments != df[df.comments.notnull()].comments.apply(unidecode)
].comments.sample(5)
ticket_id
43532202076873    Just asking me to show what I’m doing to give ...
43532202076331                                I haven’t had a reply
43532202117219                                    All good 👍 thanks
43532202073174    Chat went silent. After talking to someone the...
43532202202546    Hi Valéria,\nIt still does not work. There mus...
Name: comments, dtype: object

Putting it all together

Code
def clean_data(x, tokenizer, black_list=stop_words):
    """
    The method removes from a sentence `x`
     - punctuation & digits
     - too short words (less than 3 letters)
     - unicode symbols (translate to ascii)
     - words from `black_list`
    Return lowercased and tokenized text
    """
    words = re.findall('\w{3,}', re.sub('[^a-zÀ-ÿ ]', ' ', str(x).lower() if x is not np.NaN else ''))
    tokens = [tokenize(unidecode(word), tokenizer) for word in words]
    return ' '.join([word for word in tokens if word not in black_list])


def tokenize(x: str, tokenizer) -> str:
    """
    Applies either stemming or lemmatization to a token `x`
    """
    if hasattr(tokenizer, 'lemmatize'):
        return tokenizer.lemmatize(x, pos='v')
    elif hasattr(tokenizer, 'stem'):
        return tokenizer.stem(x)
    else:
        raise ValueError("tokenizer should be either Lemmatizer or Stemmer")


df["lemma_text"] = df.comments.apply(clean_data, args=(lemmatizer,))
df["stem_text"] = df.comments.apply(clean_data, args=(stemmer,))

Words Frequency

There are many different ways how to tackle the visual text analysis, the popular and convenient way is Word Clouds where the size of the word reflects its frequency within the given text

P.S. In any data mining initiative, it is a good idea to retain some portion of the data to validate your final findings, so let’s create a holdout piece of data to adhere true-to-life approach

Code
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[["lemma_text", "stem_text"]],
    df['sentiment'],
    train_size=1000,
    shuffle=False
)

from collections import Counter

def create_ngrams(tokens: list, n: int):
    ngrams = zip(*[tokens[idx:] for idx in range(n)])
    return [" ".join(sorted(ngram)) for ngram in ngrams]


def frequent_ngrams(documents: list, n: int = 1):
    """
    Use .most_common(top_n) to get top n ngrams
    """
    ngrams = []
    if n == 1:
        ngrams = " ".join(list(documents)).split()
    elif n >= 2:
        for tokens in documents:
            ngrams.extend(create_ngrams(tokens.split(), n))
    else:
        raise ValueError("n for n-grams should be a positive number")
    return Counter(ngrams)


import wordcloud

def make_word_cloud(text, stop_words=None):
    plt.figure(figsize=(12, 9))
    kwargs = {
        'width': 1600,
        'height': 900,
        'min_font_size': 10
    }
    if isinstance(text, str):
        word_cloud = wordcloud.WordCloud(stopwords=stop_words, **kwargs).generate_from_text(text)
    elif isinstance(text, list) or isinstance(text, np.ndarray):
        word_cloud = wordcloud.WordCloud(stopwords=stop_words, **kwargs).generate(" ".join(text))
    else:
        if stop_words:
            text = {word: value for word, value in text.items() if word not in stop_words}
        word_cloud = wordcloud.WordCloud(**kwargs).generate_from_frequencies(text)
    plt.imshow(word_cloud)
    plt.axis("off")
    plt.show()

First of all, as promised different tokenizers should be compared

Code
make_word_cloud(frequent_ngrams(X_train["stem_text"], 1))

Code
make_word_cloud(frequent_ngrams(X_train["lemma_text"], 1))

Well, at first glance it looks like both tokenizers work very similarly, one noticeable difference is that stemmer treats word pairs like help and helpful or quick and quickly as one token and actually it might be wrong, imagine if we encounter helpful more in positive sentence and help in negative, then they shouldn’t be united

Code
for word in ["help", "helpful"]:
    score = df.loc[
        df["comments"].apply(
            lambda x: f" {word} " in x if x is not np.nan else False
        ), "sentiment"
    ].mean()
    print(f"for `{word}` average sentiment score = {score:.2f}")
for `help` average sentiment score = 0.31
for `helpful` average sentiment score = 0.95

Indeed, that is the case, so let’s end up with traditional lemmatizer and take a look at word clouds separately for different sentiments

Code
make_word_cloud(frequent_ngrams(X_train.loc[y_train < 0, "lemma_text"], 1))

Code
make_word_cloud(frequent_ngrams(X_train.loc[y_train > 0, "lemma_text"], 1))

Good news, they are very different in essence, in positive reviews customers use gratitude words more like thank and helpful on the other hand in negative sentences customer highlight that the issue still not resolved. In addition it might be useful to take a look at popular collocations

Code
make_word_cloud(frequent_ngrams(X_train.loc[y_train > 0, "lemma_text"], 2))

Code
make_word_cloud(frequent_ngrams(X_train.loc[y_train < 0, "lemma_text"], 3))

N-grams appear to be very informative:

  • in positive sentences, based on bigrams, customers say that the response was quick, problem was solved and the support was very helpful
  • in negative sentences, on the basis of trigrams, customers claim that the issue/problem not resolved sometimes the add yet or still to fully express their dissatisfaction

To summarize, the basic approach has already given some meaningful insights and the hope that the reviews might be classified automatically quite well and themes can be modelled then

Unsupervised TF-IDF

It’s critical to reduce feature number otherwise X-matrix will be too sparse, and the clusterization will fail to give a substantial result, actually only quite often words should be taken into account

Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans

min_word_counts = range(1, 100, 2)
n_features, scores = [], []
for min_word_count in min_word_counts:
    tfidf = TfidfVectorizer(min_df=min_word_count)
    X = tfidf.fit_transform(X_train["lemma_text"])
    n_features.append(X.shape[1])
    km = KMeans(
        n_clusters=2,
        init='k-means++',
        max_iter=300,
        random_state=20231020
    )
    km.fit(X)
    score = accuracy_score(2 * km.predict(X) - 1, y_train)
    scores.append(
        max(score, 1 - score)
    )

plt.style.use('ggplot')
fig, ax1 = plt.subplots(figsize=(12, 6))
ax2 = ax1.twinx()

ax2.bar(min_word_counts, n_features, color="b", width=1.5, alpha=0.33)
ax1.plot(min_word_counts, scores, "g--", linewidth=2)

ax1.set_ylabel('Accuracy, %', color="g")
ax1.set_xlabel('Minimal Word Frequency, #')
ax2.set_ylabel('N Features, #', color="b")
plt.title('K-Means Clusterization')
plt.show()

From the chart is clear that the larger number of features doesn’t lead to the Accuracy increase, from the principal of maximum Accuracy the optimal number of features might be 69 for example, let’s fix it and track which features the model decided to consider

Code
tfidf = TfidfVectorizer(min_df=69)
X = tfidf.fit_transform(X_train["lemma_text"])
words = np.array(tfidf.get_feature_names())

print(f"Number of features: {X.shape[1]}, namely: ")
print(*words)
Number of features: 14, namely: 
answer get help helpful issue not problem quick resolve response solve still thank very

The clusterization which is based just on 14 words! gives an accuracy higher than 75%, but it’s the result only valid for the train sample, which was used to identify min_df hyperparameter, so to have an unbiased estimation test sample should be considered here

Code
km = KMeans(
    n_clusters=2,
    init='k-means++',
    max_iter=300,
    random_state=20231020
)

km.fit(X)

centroids_important_indexes = km.cluster_centers_.argsort()[:,::-1]

for idx in range(km.get_params()['n_clusters']):
    print("Cluster No.", idx, *words[centroids_important_indexes[idx, :7]])
Cluster No. 0 very issue thank helpful quick resolve response
Cluster No. 1 not issue resolve still solve problem get

Well, looks like Cluster-0 caught positive feedback and Cluster-1 negative, then given that target is the value from [-1, 1] set, to define an accuracy some transformation must be put in place first

Code
predictions_train = 2 * -km.predict(X) + 1

print(f"Accuracy of K-Means train sample = {accuracy_score(predictions_train, y_train):.1%}")
Accuracy of K-Means train sample = 76.4%
Code
predictions_test = 2 * -km.predict(
    tfidf.transform(
        X_test["lemma_text"]
    )
) + 1

print(f"Accuracy of K-Means test sample = {accuracy_score(predictions_test, y_test):.1%}")
Accuracy of K-Means test sample = 74.6%

Splendid, this toy example gives a clue that it’s pretty good approach when you need to classify your customer’s feedback while you don’t have any ratings (only text comments). Unsupervised approach based on TF-IDF Vectorizer and K-Means gives a nice baseline, but fortunately, it’s not our case let’s go ahead and use rating set by a customer in addition to texts to reach the text data potential

Sentiment Analysis

The goal of this part of the article is to enhance the unsupervised model and build a powerful classifier to eventually understand key drivers for review to be positive or negative from features extraction

There are 2 main ways of doing Semantic Analysis:

  • train your own model using the available data
  • use pre-trained deep learning models and fine-tune them if needed for this particular text specific

Both approaches is considered below

Custom Regression Model

Training of the custom model will be held in 3 steps (again?)

  • Model architecture selection
  • Selected model training and cross-validation
  • Feature analysis and general evaluation

Well, by the model term stands combination of Vectorizer, which transform text data into a vector representation and Classifier that is training on these vectors to predict sentiment

Out of vectorizers we are going to try both most popular options: Classic Counter and TF-iDF, for classificators let’s search among classic linear, tree-based and in addition naive bias method, which might be extremely useful for text classification

In addition as it was shown during unsupervised analysis, barely all the words should be taken into account to build a substantial model and it alleviates the learning process also, therefore SVD application for feature space reduction will be considered

Code
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

from sklearn.decomposition import TruncatedSVD

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

from typing import List, Union, Optional

def model(vectorizer, classifier, transformer=None):
    if transformer:
        return Pipeline([
            ("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)
        ])
    else:
        return Pipeline([
            ("vectorizer", vectorizer),
            ("classifier", classifier)
        ])


def get_entity_name(entity: Union[List[object], Optional[object]]) -> str:
    if not isinstance(entity, List):
        entity = [entity]
    return [re.sub(">|'", '', str(e)).split(".")[-1] for e in entity if e]

Using vanilla classifiers let’s define which one suits the data better from scratch and reveal hyperparameters on the validation basis after for this candidate

Code
np.random.seed(20231024)

for vectorizer in [CountVectorizer, TfidfVectorizer]:
    for classifier in [LogisticRegression, SGDClassifier, RandomForestClassifier, LinearSVC, MultinomialNB]:
        transformers = [None]
        if vectorizer == CountVectorizer:
            transformers.append(TfidfTransformer())
        if classifier != MultinomialNB:
            transformers.extend([
                TruncatedSVD(n_components=100),
                TruncatedSVD(n_components=10),
            ])
        for transformer in transformers:
            print(get_entity_name([vectorizer, classifier, transformer]), end=": ")
            score = cross_val_score(
                model(vectorizer(), classifier(), transformer),
                X_train["lemma_text"],
                y_train,
                cv=5,
                scoring='f1'
            ).mean()
            print(f"f1-score: {score:.1%}")
['CountVectorizer', 'LogisticRegression']: f1-score: 90.2%
['CountVectorizer', 'LogisticRegression', 'TfidfTransformer()']: f1-score: 90.5%
['CountVectorizer', 'LogisticRegression', 'TruncatedSVD(n_components=100)']: f1-score: 88.7%
['CountVectorizer', 'LogisticRegression', 'TruncatedSVD(n_components=10)']: f1-score: 87.5%
['CountVectorizer', 'SGDClassifier']: f1-score: 88.0%
['CountVectorizer', 'SGDClassifier', 'TfidfTransformer()']: f1-score: 89.5%
['CountVectorizer', 'SGDClassifier', 'TruncatedSVD(n_components=100)']: f1-score: 88.0%
['CountVectorizer', 'SGDClassifier', 'TruncatedSVD(n_components=10)']: f1-score: 84.7%
['CountVectorizer', 'RandomForestClassifier']: f1-score: 90.2%
['CountVectorizer', 'RandomForestClassifier', 'TfidfTransformer()']: f1-score: 90.1%
['CountVectorizer', 'RandomForestClassifier', 'TruncatedSVD(n_components=100)']: f1-score: 89.1%
['CountVectorizer', 'RandomForestClassifier', 'TruncatedSVD(n_components=10)']: f1-score: 88.3%
['CountVectorizer', 'LinearSVC']: f1-score: 89.3%
['CountVectorizer', 'LinearSVC', 'TfidfTransformer()']: f1-score: 90.3%
['CountVectorizer', 'LinearSVC', 'TruncatedSVD(n_components=100)']: f1-score: 89.2%
['CountVectorizer', 'LinearSVC', 'TruncatedSVD(n_components=10)']: f1-score: 87.3%
['CountVectorizer', 'MultinomialNB']: f1-score: 90.6%
['CountVectorizer', 'MultinomialNB', 'TfidfTransformer()']: f1-score: 90.2%
['TfidfVectorizer', 'LogisticRegression']: f1-score: 90.5%
['TfidfVectorizer', 'LogisticRegression', 'TruncatedSVD(n_components=100)']: f1-score: 89.7%
['TfidfVectorizer', 'LogisticRegression', 'TruncatedSVD(n_components=10)']: f1-score: 86.7%
['TfidfVectorizer', 'SGDClassifier']: f1-score: 89.4%
['TfidfVectorizer', 'SGDClassifier', 'TruncatedSVD(n_components=100)']: f1-score: 88.4%
['TfidfVectorizer', 'SGDClassifier', 'TruncatedSVD(n_components=10)']: f1-score: 83.5%
['TfidfVectorizer', 'RandomForestClassifier']: f1-score: 89.9%
['TfidfVectorizer', 'RandomForestClassifier', 'TruncatedSVD(n_components=100)']: f1-score: 88.9%
['TfidfVectorizer', 'RandomForestClassifier', 'TruncatedSVD(n_components=10)']: f1-score: 88.0%
['TfidfVectorizer', 'LinearSVC']: f1-score: 90.3%
['TfidfVectorizer', 'LinearSVC', 'TruncatedSVD(n_components=100)']: f1-score: 90.7%
['TfidfVectorizer', 'LinearSVC', 'TruncatedSVD(n_components=10)']: f1-score: 87.4%
['TfidfVectorizer', 'MultinomialNB']: f1-score: 90.2%

Winners:

  • TfidfVectorizer & LinearSVC & TruncatedSVD
  • CountVectorizer & MultinomialNB
  • TfidfVectorizer & LogisticRegression

On the basis of above analysis, the better approach will be to go with the first option because:

  1. there is a clear rationale to apply tf-idf over usual counter for the majority of text analysis task
  2. regression is more flexible model than Naive Bayes and highly likely that after cross-validation it can accomplish even higher accuracy
  3. reducing the feature space makes sense as it was for unsupervised learning
Code
from sklearn.model_selection import GridSearchCV

clf = model(
    vectorizer=TfidfVectorizer(),
    classifier=LinearSVC(random_state=20231020),
    transformer=TruncatedSVD()
)

param_grid = {
    "vectorizer__max_df": [1.0, 0.15, 0.10],
    "vectorizer__min_df": [1, 2, 3],
    "vectorizer__ngram_range": [(1, 1), (1, 2), (1, 3)],
    "classifier__C": [0.75, 1, 1.25],
    "transformer__n_components": [100, 500, 1000]
}
search = GridSearchCV(clf, param_grid, cv=3)
search.fit(X_train["lemma_text"], y_train)
print("Best parameter (CV score = %0.3f):" % search.best_score_)
for key, value in search.best_params_.items():
    print(f"{key}: {value}")
Best parameter (CV score = 0.897):
classifier__C: 1
classifier__loss: hinge
transformer__n_components: 1000
vectorizer__max_df: 1.0
vectorizer__min_df: 1
vectorizer__ngram_range: (1, 2)
Code
# If you want to see all the results of the scoring use the following code
# for param, score in zip(
#     search.cv_results_['params'],
#     search.cv_results_['mean_test_score']
# ):
#     print(param, score)

The majority of parameters remains as by default, although some of them changed, it’s interesting that only bigrams are useful, trigrams on contrary to what we saw before from dummy analysis don’t give any additional info to regression from grid-search point of view, from given experience it’s better to retain them

Well, given the parameters are all defined, let’s set them up and take closer look at the trained model

Code
from sklearn.metrics import classification_report

clf = model(
    vectorizer=TfidfVectorizer(ngram_range=(1, 3)),
    classifier=LinearSVC(random_state=20231020),
    transformer=TruncatedSVD(n_components=1000),
)

clf.fit(X_train["lemma_text"], y_train)
print(classification_report(y_test, clf.predict(X_test["lemma_text"])))
              precision    recall  f1-score   support

          -1       0.90      0.86      0.88       111
           1       0.91      0.94      0.92       165

    accuracy                           0.91       276
   macro avg       0.91      0.90      0.90       276
weighted avg       0.91      0.91      0.91       276

If we just out of curiosity take a look at another candidate - Naive Bayes, then the results are

Code
clf = model(
    vectorizer=CountVectorizer(ngram_range=(1, 3)),
    classifier=MultinomialNB(),
)

clf.fit(X_train["lemma_text"], y_train)
print(classification_report(y_test, clf.predict(X_test["lemma_text"])))
              precision    recall  f1-score   support

          -1       0.90      0.82      0.86       111
           1       0.89      0.94      0.91       165

    accuracy                           0.89       276
   macro avg       0.89      0.88      0.89       276
weighted avg       0.89      0.89      0.89       276

Well, regression works a bit better and it’s a pretty powerful classificator, but what is really needed is feature exploration, which words have been determined by the model sentiment and what do customers really appreciate or complain about. In order to evaluate features easily SVD step is skipped here, it doesn’t inflict tangible damage on model quality but simplifies analysis a lot

Code
import shap

vectorizer=TfidfVectorizer(ngram_range=(1, 3))
classifier=LinearSVC(random_state=20231020)

X_train_vec = vectorizer.fit_transform(X_train["lemma_text"])
classifier.fit(X_train_vec, y_train)

explainer = shap.Explainer(
    classifier, X_train_vec, feature_names=vectorizer.get_feature_names()
)
shap_values = explainer(X_train_vec)

shap.plots.beeswarm(shap_values, max_display=30, plot_size=(12, 8))

Well, the results almost don’t reveal any new pattens:

  • clients like quick and clear answered questions, they are thankful for fast and friendly support and of course solved problem is everything
  • clients dislike: if have no response (reply) and if the problem still not resolved (yet), in general they don’t like to wait for getting a solution

However, there are some new words here - account, pleo and sonja, and there is an easy way to check how the model works for a particular review

P.S. if visualization doesn’t work run shap.initjs() first

Code
for word in['account', 'pleo', 'sonja']:
    print(word, *X_train["lemma_text"].apply(lambda x: word in x).values.argsort()[-3:])
account 154 375 104
pleo 386 532 242
sonja 826 943 190
Code
idx = 154
print('Positive' if y_train.values[idx] > 0 else 'Negative')
print('Text:', df.iloc[idx]["comments"])
shap.plots.force(
    explainer.expected_value,
    shap_values.values[idx],
    feature_names=vectorizer.get_feature_names(),
    matplotlib=True,
)
Negative
Text: my email account is still not connected and it does not work. tried a couple of times to reconnect

Code
idx = 242
print('Positive' if y_train.values[idx] > 0 else 'Negative')
print('Text:', df.iloc[idx]["comments"])
shap.plots.force(
    explainer.expected_value,
    shap_values.values[idx],
    feature_names=vectorizer.get_feature_names(),
    matplotlib=True,
)
Positive
Text: Fast answer and a perfect answer because the function I was looking for was in Pleo :D

Code
idx = 826
print('Positive' if y_train.values[idx] > 0 else 'Negative')
print('Text:', df.iloc[idx]["comments"])
shap.plots.force(
    explainer.expected_value,
    shap_values.values[idx],
    feature_names=vectorizer.get_feature_names(),
    matplotlib=True,
)
Positive
Text: Quick and thorough answer by support agent Sonja!

From several examples it comes that:

  • account word usually means a general problem with account
  • pleo appears in formal reviews, mostly with some claim
  • sonja seems to be a chat bot agent and it gets positive reviews

Pretrained neural network

Well, we got some new insights using custom model approach, let’s see whether the modern NN architecture will be able to unlock even more meaningful take-away’s without additional training

Code
from transformers import pipeline

sentiment_transformer_model = pipeline(
    task="sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    return_all_scores=True
)

scoring_results = sentiment_transformer_model(X_test["lemma_text"].to_list())

The drawback of this approach becomes obvious as soon as you start applying it, even inference is taking a lot of time, fingers crossed it’s worths it, let’s look at the classification quality

Code
def twitter_roberta_predict(scoring_output):
    scoring_output.sort(key=lambda x: x['score'], reverse=True)
    prediction = scoring_output[scoring_output[0]['label'] == 'neutral']['label']
    if prediction == "positive":
        return 1
    elif prediction == "negative":
        return -1
    else:
        raise ValueError("unexpected scoring results")

sentiment_transformer_predictions = [twitter_roberta_predict(scoring) for scoring in scoring_results]

print( classification_report(y_test, sentiment_transformer_predictions) )
              precision    recall  f1-score   support

          -1       0.19      0.23      0.21       111
           1       0.38      0.32      0.35       165

    accuracy                           0.29       276
   macro avg       0.29      0.28      0.28       276
weighted avg       0.31      0.29      0.29       276

It’s kind of expected, given that we use the model which was trained and fine tuned on the texts which have a bit different nature (tweets). To unlock the true power of neural network approach it should be fine tuned to reflect the particular data specific and if you don’t have the sufficient amount of data - it should be a red flag not to wasting time with too comprehensive models

Anyway, let’s take a look on how the language model works before pigeonholing it, here we are just some examples, to get a better summary a larger portion of the dataset is needed

Code
explainer = shap.Explainer(sentiment_transformer_model)

shap_values = explainer(
    X_train.loc[X_train["lemma_text"].apply(lambda x: len(x.split()) > 5), "lemma_text"].sample(3),
    silent=True
)

shap.plots.text(shap_values)


[0]

outputs
negative
neutral
positive


0.340.320.30.280.360.380.40.3007920.300792base value0.279840.27984fnegative(inputs)0.005 go 0.002 silent 0.002 silent 0.0 past -0.012 chat -0.011 talk -0.004 minutes -0.002 someone -0.001 go
inputs
0.0
-0.012
chat
0.005
go
0.002
silent
-0.011
talk
-0.002
someone
-0.001
go
0.002
silent
0.0
past
-0.004
minutes
0.0


[1]

outputs
negative
neutral
positive


0.340.320.30.280.360.380.40.2964090.296409base value0.33950.3395fnegative(inputs)0.008 disappoint 0.006 however 0.006 literally way 0.005 cannot 0.004 cannot 0.004 payment 0.003 good 0.003 kk 0.003 service 0.003 payment 0.002 produce 0.002 prove 0.001 document 0.001 expect 0.001 o 0.001 item 0.001 see 0.001 er 0.001 rep 0.001 ple 0.001 produce 0.0 document 0.0 -0.003 merchant -0.003 say -0.002 almost -0.001 custom -0.001 buy -0.001 confirmation -0.001 surprise -0.001 quite -0.001 d
inputs
0.0
-0.001
custom
0.001
er
0.003
service
0.001
rep
0.003
good
0.006
however
0.008
disappoint
0.004
cannot
0.001
produce
0.003
payment
-0.001
confirmation
0.0
document
-0.001
buy
0.001
item
-0.002
almost
-0.001
d
0.003
kk
-0.003
merchant
-0.003
say
0.005
cannot
0.001
see
0.006 / 2
literally way
0.002
prove
0.004
payment
0.001
expect
0.001
ple
0.001
o
0.002
produce
0.001
document
-0.001
quite
-0.001
surprise
0.0


[2]

outputs
negative
neutral
positive


0.340.320.30.280.360.380.40.3000850.300085base value0.348020.34802fnegative(inputs)0.025 fortunately 0.011 efficient 0.007 much 0.006 solution 0.005 un 0.004 appreciate 0.003 not -0.006 feature -0.004 help -0.003 exist
inputs
0.0
0.005
un
0.025
fortunately
0.011
efficient
0.006
solution
0.004
appreciate
-0.004
help
0.007
much
-0.006
feature
0.003
not
-0.003
exist
0.0
Code
shap_values = explainer(
    X_train["lemma_text"].sample(10),
    silent=True
)

shap.plots.bar(
    shap_values[:, :, "positive"].mean(axis=0),
    # max_display=10,
    # order=shap.Explanation.argsort,
)

The shap library is a very powerful tool to analyze text data, the logic which is hidden by many layers of the neural network might be easily and explicitly reflected, just wait

Given that the model isn’t very precise, there isn’t much sense in detailed analysis of its work, let’s go to the next step and keep in mind that the neural network altogether with shap is a very convenient tool to analyze texts in case of large datasets

Topic Modelling

The final section is devoted to the topic modelling problem. It’s a family of algorithms that allows you to build the topics distribution model for the corpus of texts in an unsupervised manner.

The idea is to analyze positive and negative feedback separately and retrieve the things that customers appreciate in the support and the improvement focuses respectively. For now the focus will be on negative feedback, because it’s more critical to identify the potential process problem whereas the positive feedback isn’t as informative, apparently it should be modeled too if time permits

Here for topic modelling is used one of the well-known model - LDA

Code
from gensim import corpora, models

data = df.loc[df["sentiment"] < 0, "lemma_text"].apply(str.split)
texts = data.to_list()
indexes = data.index

dictionary = corpora.Dictionary(texts)

corpus = [dictionary.doc2bow(text) for text in texts]

Coherence is the metric that is most correlated with an external human assessment quality metric and hopefully it can be calculated without any manual interposition hence it’s very useful to determine the optimal topic number

Code
np.random.seed(20231025)
topics = range(2, 8)
out = []

for t in topics:
    ldamodel = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=t, passes=20)
    print('Topic Num =', t, end = ' ')
    coherence = ldamodel.top_topics(texts=texts, window_size=10, coherence='c_uci')
    avg_coherence = 0
    for topic in range(len(coherence)):
        avg_coherence += coherence[topic][1] / len(coherence)
    print('Coherence = {:.2f}'.format(avg_coherence))
    out.append(avg_coherence)
Topic Num = 2 Coherence = -4.26
Topic Num = 3 Coherence = -3.88
Topic Num = 4 Coherence = -4.21
Topic Num = 5 Coherence = -4.72
Topic Num = 6 Coherence = -4.54
Topic Num = 7 Coherence = -5.01
Code
plt.plot(topics, out, "go-")
plt.title(f"Best Topic Num = {topics[np.array(out).argmax()]}")
plt.ylabel("Coherence")
plt.xlabel("Topics Num")
plt.show()

Well, let’s build the model for 3 topics then as it provides the highest value of coherence

Code
np.random.seed(20231025)
ldamodel = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=3, passes=20)
ldamodel.show_topics(num_topics=ldamodel.num_topics, num_words=5, formatted=True)
[(0,
  '0.021*"account" + 0.018*"pleo" + 0.018*"need" + 0.017*"not" + 0.015*"still"'),
 (1,
  '0.075*"not" + 0.045*"issue" + 0.039*"resolve" + 0.015*"get" + 0.014*"still"'),
 (2,
  '0.036*"not" + 0.030*"still" + 0.018*"problem" + 0.017*"work" + 0.014*"solve"')]

Methods to get $\Phi$ and $\Theta$ matrices from the LDA model which are represent $P(word|topic)$ and $P(topic|document)$ respectively from LDA model

Code
def get_phi(ldamodel):
    return pd.DataFrame(ldamodel.get_topics()).T

def get_theta(ldamodel):
    theta = []
    for bow in corpus:
        theta.append([x[1] for x in ldamodel.get_document_topics(bow)])
    return pd.DataFrame(theta).T

def get_word(idx):
    return dictionary.id2token[idx]

phi = get_phi(ldamodel)
theta = get_theta(ldamodel)

Let’s make a map of words broken down by topic and to do this the matrix $\Phi$ should be turned over first. To get the topic profile of a word Bayes’ formula might be used:

$$ P(topic|word) = \frac {P(word|topic) P(topic)} {\sum_t’ P(word|t’) P(t’)} $$

Code
phi_top_words = (
    phi.assign(sum=phi.sum(axis=1))
    .sort_values(by='sum', ascending=False)
    .drop(columns=['sum'])
    .reset_index()
    .head(60)
)

phi_top_words = (
    phi_top_words
    .set_index(phi_top_words["index"].apply(get_word))
    .drop(columns=["index"])
)

topic_prior = pd.DataFrame((theta.sum(axis=1) / theta.shape[1]).values, columns=["p(t)"])

topic_profile = np.zeros(phi_top_words.shape)
word_prior = phi_top_words.values @ topic_prior.values

for word in range(topic_profile.shape[0]):
    for topic in range(topic_profile.shape[1]):
        topic_profile[word, topic] = (
            phi_top_words.iloc[word, topic] * topic_prior.iloc[topic] / word_prior[word]
        )

Once we have all the matrices, let’s visualize the data with Word Cloud where the size represents the total probability for the word over all the themes and the color reflects the theme for which the probability is reaching the maximum when receiving such a word as an input

Code
%matplotlib inline

colors = ["white", "yellow", "green"]

def get_color_func(word, **kwargs):
    return colors[topic_profile[phi_top_words.index == word].argmax()]

plt.figure(figsize=(12, 9))
kwargs = {
    'width': 1600,
    'height': 900,
    'min_font_size': 10
}
word_cloud = wordcloud.WordCloud(
    color_func=get_color_func, **kwargs
).generate_from_frequencies(phi_top_words.sum(axis=1))
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

The three extracted theme makes sense at first glance:

  • Yellow general one with the feedback from customers whose issue was not solved
  • Green more formal feedback with a direct pretension to Pleo on the field of too long waiting for a solution
  • White specific customer issues with account, system or expense transaction

It’s time to name the topics then and take a look at the prior distribution, note that the general theme is more popular as expected from a common sense

Code
topic_labels = [
    "Specific feedback - Particular account or payment problem",
    "General negative customer experience - Issue was not duly resolved",
    "Formal direct feedback - Pleo service takes too much time",
]

topic_prior.assign(
    Label = topic_labels
).reset_index().set_index("Label").sort_values(
    by = 'p(t)', ascending=False
)[['p(t)']].apply(round, args=(2,))

p(t)
Label
General negative customer experience - Issue was not duly resolved 0.42
Formal direct feedback - Pleo service takes too much time 0.35
Specific feedback - Particular account or payment problem 0.24

Here are examples of review for each of the topic which quite good reflect their naming

Code
for i, nm in enumerate(topic_labels):
    print(i, nm)
    for comment in df.loc[
        indexes[theta.T.values.argsort()[:, -1] == i], "comments"
    ].sample(1):
        print(comment)
    print()
0 Specific feedback - Particular acccount or payment problem
when verifying my email. The link has expired? This has been consistent for the last 24hours. Please help.

1 General negative cusomer experience - Issue was not duly resolved
Didn't help and it still isn't resolved

2 Formal direct feedback - Pleo service takes too much time
figured it out myself before anyone got back to me 

Conclusion

Using the excerpt of the data that is provided, it is barely possible to draw conclusions for the whole feedback population, like the positive reviews prevail over negative, but other in a way even more fruitful insight might be highlighted

  1. The primitive one which permeates the entire analysis is that general customer satisfaction highly depend on whether their problem was finally resolved and especially the speed of this solution makes difference

  2. In addition from sentiment analysis it comes that the majority of reviews where the customer writes about chatbot sonja is positive, of course it doesn’t necessarily mean that this bot is very helpful, it might be that clients just don’t mention its name within the negative reviews, for example because of the ask to leave a comment is formulated in a different way after customer rate the support as 0 starts. Here more domain and business field knowledge are required to have an ultimate judgment, without such an expertise the bot looks useful at first glance

  3. When it comes to improvement focuses on the basis of topic modelling approach it might be concluded that the negative feedback has 3 main categories (with examples):

  • Specific feedback - Particular account or payment problem > When verifying my email. The link has expired? This has been consistent for the last 24hours. Please help.
  • General negative customer experience - Issue was not duly resolved > Didn’t help and it still isn’t resolved
  • Formal direct feedback - Pleo service takes too much time > Figured it out myself before anyone got back to me

While the first type of negative is totally reasonable its share among negative reviews is only about a quarter, it looks like the company should take an effort towards their customer support process enhancement, particular initiatives may consist of but are not limited to:

  • FAQ documentation provisioning with an automatic answer right after review was submitted to provide customer with a potential solution quickly, not every complaint need an operator presence to be resolved
  • chatbot rolling out to provide answers to basic questions quickly and automatically
  • the simple system using just the keywords might be released to catch reviews where the client used company name for example as it most likely means some formal claim and it’s risky from the perspective of losing such a customer after all or another example if customer within the review use wording like months or weeks probably it means that this case takes too much time and should be prioritized among others to mitigate client’s discontent
  • finally if none of the above is applicable - increase in support staff, as a straightforward solution

Additional Information

More information on tokenizers

Transformers Applications

Shap Documentation