Surveys

Dr Charles Martin and Karla Kelly

Surveys

In this class, we will practice administering one of the classic questionnaires in usability, the SUS (System Usability Scale).

Questionnaires like the SUS and TLX are widely used in assessing how users perceive a user interface. Questionnaires are useful in gaining numerical information from a medium to large group of users quickly. They can be particularly useful in comparing different interactive systems, situations or user types as you can use statistical techniques to assess differences between samples.

In today’s tutorial, you will do a mock survey with a user interface and the SUS. You will collect data together with others in the class and calculate descriptive statistics, generate plots, and perform significance testing using Python.

NOTE: Bring your computer to class!

Pre-Class Tasks

  1. Read the following article from NNgroup on the SUS and TLX: https://www.nngroup.com/articles/measuring-perceived-usability/
  2. Find a quantitative or rating-scale style survey somewhere in the world, take a photo or screenshot.
  3. Post on the forum with your image and explain: 1) what the survey is and where you found it, 2) what type of data is captured, 3) how it is similar or different to the SUS and NASA TLX discussed in the article.
  4. In this tutorial, we will be using Python.
    You will need either:
    1. An active Google account to use Google Colab (runs in your browser, no installation required), OR
    2. Python installed on your laptop along with a notebook environment of your choice (we recommend Jupyter Notebook).

Plan for the Class

In this class, you will:

  1. Practice administering the SUS
  2. Analyse your quantitative data
  3. Discuss what you learned.

In-Class Tasks

0. Discuss pre-class responses (10 minutes)

The tutor will bring up the pre-class responses on the big screen and lead you in a discussion. Some questions might be:

1. Administer the SUS (20 minutes)

Your tutor will allocate a technology for you to evaluate, along with a task to complete using that technology. For example:

Your tutor will also give you: - A participant identifier (participant ID) - A paper copy of the SUS questionnaire

In pairs: 1. Take turns acting as the user and the researcher. 2. As the researcher: - Welcome the user to the study. - Ask for their consent to participate. - Give the task instructions (e.g., “Please complete the task, then fill in the SUS questionnaire to rate your experience”). 3. Make sure the user: - Completes the task. - Records their answers on the SUS questionnaire. 4. Ensure the participant ID is written on the completed questionnaire — you will need this for the data entry step.

2. Collate your data (5 minutes)

Your tutor will provide a shared spreadsheet for the whole class to enter results.
This will allow us to compare SUS scores across groups to see which technology had better or worse usability.

3. Analyse your quantitative data (40 minutes)

Individually:

  1. Go to Google Colaboratory and start a New Notebook: https://colab.google/

  2. In Colab, drag the class data spreadsheet file into the Files pane.

  3. In a new code cell, load your data into a DataFrame:

    import pandas as pd, numpy as np
    from scipy import stats
    import matplotlib.pyplot as plt
    
    # --- Load ---
    df = pd.read_csv("sus_dummy_two_groups.csv")  # replace with your file name
    
    SUS = [f"SUS{i}" for i in range(1, 11)]
    df  # show the DataFrame
  4. Recode the positively worded SUS items (items 1, 3, 5, 7, and 9) by subtracting 1 from each response, so that their values range from 0 (“Strongly Disagree”) to 4 (“Strongly Agree”):

    POS = ["SUS1","SUS3","SUS5","SUS7","SUS9"]
    df[POS] = df[POS] - 1
  5. Reverse code the negatively worded items (for the SUS, these are the even-numbered items: 2, 4, 6, 8, 10).
    This makes the scale consistent so that higher numbers always indicate better usability.

    NEG = ["SUS2","SUS4","SUS6","SUS8","SUS10"]
    df[NEG] = 5 - df[NEG]
  6. Calculate the SUS score for each participant
    Make sure all 10 items have been recoded to the 0–4 scale before this step. We’ll remove any rows with missing items, then sum the items (0–40) and scale to 0–100.

    # Remove rows with missing SUS items
    df = df.dropna(subset=SUS)
    
    # Sum (0–40) and scale to 0–100
    df["SUS_score"] = df[SUS].sum(axis=1) * 2.5
    
    # (Optional) quick check of the results
    print(df["SUS_score"].describe())
  7. Get the descriptive statistics
    Find the minimum, maximum, mean, and standard deviation of the SUS scores for each group.

    # --- Descriptive statistics ---
    print("\nDescriptive stats by group:")
    print(df.groupby("group")["SUS_score"].describe().round(2))
  8. Plot a histogram of your data
    Look at the shape of the distribution for each group.
    Is the data evenly spread, skewed, or clustered?

    # --- Histogram ---
    df["SUS_score"].hist(by=df["group"], bins=10, edgecolor="black", layout=(1, 2))
    plt.suptitle("Distribution of SUS Scores by Group")
    plt.show()
  9. Create a boxplot
    Compare the median, quartiles, and range of SUS scores for each group.
    Look for any outliers (points that sit far from the rest of the data).

    # --- Boxplot ---
    df.boxplot(column="SUS_score", by="group")
    plt.title("SUS Scores by Group")
    plt.suptitle("")
    plt.ylabel("SUS (0–100)")
    plt.show()
  10. Compare the findings
    Use Welch’s t-test to check whether there is a statistically significant difference in SUS scores between the two groups.
    The output will also show which group had the higher average score.
    Interpretation guide:

    # --- Between-groups comparison (Welch's t-test) ---
    groups = [g["SUS_score"].dropna().values for _, g in df.groupby("group")]
    
    if len(groups) == 2:
        g1, g2 = groups
        group_names = list(df["group"].unique())
    
        # Welch's t-test
        t = stats.ttest_ind(g1, g2, equal_var=False)
    
        # Means for each group
        mean_g1, mean_g2 = np.mean(g1), np.mean(g2)
    
        print(f"Welch's t-test: t = {t.statistic:.2f}, p = {t.pvalue:.3f}")
        print(f"Mean SUS for {group_names[0]}: {mean_g1:.2f}")
        print(f"Mean SUS for {group_names[1]}: {mean_g2:.2f}")
    
        # Interpret significance
        if t.pvalue < 0.05:
            print("Result: Statistically significant difference (p < 0.05).")
        else:
            print("Result: No statistically significant difference (p >= 0.05).")
    
        # Which group scored higher
        if mean_g1 > mean_g2:
            print(f"{group_names[0]} had higher usability scores.")
        elif mean_g2 > mean_g1:
            print(f"{group_names[1]} had higher usability scores.")
        else:
            print("Both groups had the same average score.")
    else:
        print("Need exactly two groups for comparison.")
  11. Summarise your findings in plain language
    After running the t-test and checking your descriptive statistics, write a short summary that anyone could understand.
    Post your summary in the class thread!

    Example reporting template:

    The mean SUS score for Group 1 (Technology A) was 82.3, which falls in the “Excellent” range.
    The mean SUS score for Group 2 (Technology B) was 71.5, which falls in the “Good” range.
    A Welch's t-test found that the difference was statistically significant (t(24) = 2.30, p = 0.03), indicating that Technology A was rated as significantly more usable for the given task.
    This suggests that, for this context and task, Technology A may offer a better user experience than Technology B.

4. Discuss your key learnings (10 minutes)

Your tutor will lead you in a discussion about what you learned from using the SUS and the analysis process.

Important Notes

This activity is a quick introduction to using the System Usability Scale (SUS) and basic statistical comparison in Python.
In real usability studies, statistical tests have specific conditions and assumptions that must be checked before deciding which analysis is appropriate.

For example:

We have skipped these detailed checks in this exercise to focus on learning the mechanics of:

  1. Scoring the SUS
  2. Performing basic descriptive statistics
  3. Running a simple between-groups comparison

In practice, you should:

Resources

References