Steam Review aspect dataset

Introducing Steam Review aspect dataset for multi-label classification task.

28 May 2024 (Last updated: 11 June 2024)

Machine Learning

Header image for blog post Steam Review aspect dataset

Steam Review aspect dataset is a dataset of Steam Review with 8 review aspects. This dataset contains 1100 English Steam reviews, split into 900 train and 200 test. This dataset was initially created to identify which aspects are mentioned in English reviews as part of Analysis of 43 million English Steam Reviews.

Table of contents

Data collection and annotation

The source of the reviews comes from a snapshot of the SRec database, which was taken on 21 February 2024. SRec obtain all reviews for all games and mods using API provided by Steam. To reduce bias when selecting reviews to be annotated, I chose reviews primarily based on these criteria,

  • Character length.
  • Helpfulness score.
  • Popularity of the reviewed game.
  • Genre or category of the reviewed game.

There are 8 aspects to define review in this dataset. I am the only annotator for this dataset. A review is deemed to contain a certain aspect, even if it's mentioned implicitly (e.g., "but it'd be great if there's good looking characters...") or only mention lack of the aspect (e.g., "... essentially has no story ..."). The below table shows 8 aspects of this dataset, along with a short description and example.

RecommendedWhether the reviewer recommends the game or not. This aspect comes from the one who wrote the review.... In conclusion, good game
StoryStory, character, lore, world-building and other storytelling elements.Excellent game, but has an awful-abrupt ending that comes out of nowhere and doesnt make sense ...
GameplayControls, mechanics, interactivity, difficulty and other gameplay setups.Gone are the days of mindless building and fun. Power grids? Taxes? Intense accounting and counter-intuitive path building ...
VisualAesthetic, art style, animation, visual effects and other visual elements.Gorgeous graphics + 80s/90s anime artstyle + Spooky + Atmospheric ...
AudioSound design, music, voice acting and other auditory elements.... catchy music, wonderful narrator saying very kind words ...
TechnicalThe technical aspects of the game such as bug, performance, OS support, controller support and overall functionality.bad doesnt fit a 1080p monitor u bastard ...
PricePrice of the game or its additional content.Devs are on meth pricing this game at $44
SuggestionSuggestions for the state of the game, including external factors such as game's price or publisher partnership.... but needs a bit of personal effort to optimize the controls for PC, otherwise ...

Table 1. A description and example for 8 aspects in this dataset

Take note that few reviews contain language and content that some people may find offensive, discriminatory, or inappropriate. I DO NOT endorse, condone or promote any of such language and content.

Data format

CSV, JSON and Apache Arrow file formats are provided for convenience's sake. You can check the notebook on example directory for a bare-minimum example of how to open those files. Both raw and cleaned review text are provided. Cleaned review text was preprocessed by stripping BBcode, reducing excessive whitespaces and reducing excessive newlines.

Model benchmark

Model benchmark on Steam Review aspect dataset split into 3 categories,

  • Base: Non-attention based language model.
  • Embedding: Inspired by MTEB, obtained embedding trained on Logistic Regressor for up to 100 epochs.
  • Fine-tune.

There are 15 models benchmarked, where few of the same base models are used multiple times using different methods. You can see Appendix B for the result and visit GitHub to see its source code.

Download

You can download Steam review aspect dataset from one of these sources,

Citation

If you wish to use this dataset in your research or project, please cite this blog post: Steam review aspect dataset

Sandy Khosasi. "Steam review aspect dataset". (2024).

For those who need it, a BibTeX citation format also has been prepared.

@misc{srec:steam-review-aspect-dataset,
	title        = {Steam review aspect dataset},
	author       = {Sandy Khosasi},
	year         = {2024},
	month        = {may},
	day          = {28},
	url          = {https://srec.ai/blog/steam-review-aspect-dataset},
    urldate      = {2024-05-28}
}

License

Steam Review aspect dataset is licensed under Creative Commons Attribution 4.0 International.

Appendix A - Statistic

AspectTrainTest
Recommended667148
Story40089
Gameplay693154
Visual39187
Audio22751
Technical25957
Price21347
Suggestion9721

Table 2. Total occurrence of each aspect

Total aspectTrainTest
017
18811
221443
321855
418449
514021
6468
775
821

Table 3. Total aspect in a review

Total review
for each game
TrainTest
1280164
230118
360

Table 4. Total review for each game in this dataset

Q1417416.75390390
Q2 (Median)871867.5888888
Q31810.51753.751629.751623.5
Average1408.491389.061286.121267.96

Table 5. Statistics of total characters

Appendix B - Full model benchmark

Spacy Bag of Words0.62030.53910.5494
FastText0.62840.57130.5871Minimum text preprocessing, use pretrained vector
FastText0.69330.58210.6027Minimum text preprocessing, choose hyperparameter based on K-5 fold autotune
Spacy Ensemble0.60430.67730.6299Choose hyperparameter based on simple grid search

Table 6. Benchmark result for base model

sentence-transformers/all-mpnet-base-v2110M5140.70740.54310.5853
jinaai/jina-embeddings-v2-small-en137M81920.70680.60750.6437
jinaai/jina-embeddings-v2-base-en137M81920.68130.65010.6618
Alibaba-NLP/gte-large-en-v1.5434M81920.70010.65010.6729
nomic-ai/nomic-embed-text-v1.5137M81920.70750.64980.6756
McGill-NLP/LLM2Vec-Mistral-7B-Instruct-v2-mntp-supervised7111M327680.72380.66970.6928NF4 double quantization, instruction
WhereIsAI/UAE-Large-V1335M5120.72450.67180.6946
mixedbread-ai/mxbai-embed-large-v1335M5120.72150.68170.6989
intfloat/e5-mistral-7b-instruct7111M327680.73450.70000.7137NF4 double quantization, instruction

Table 7. Benchmark result for embedding model

jinaai/jina-embeddings-v2-base-en137M81920.74850.72570.7354Choose hyperparameter from Ray Tune (30 trials)
Alibaba-NLP/gte-large-en-v1.5434M81920.84030.81520.8231Choose hyperparameter from Ray Tune (16 trials)

Table 8. Benchmark result for fine-tuned model

Tags: Dataset, Machine Learning, Steam Review