Datasets

Aviva insurance tweets

A small dataset of 399 tweets about AVIVA insurance. The CSV file includes tweets from 26/06/2014 to 27/06/2014.

  1. AVIVA tweets [115KB]

Million graph data

The enriched graph used for join operator. The vertices have been enriched following the LDBC Social Network Benchmark protocol. The TXT files contain vertices and edges in a comma and tab separated format.

  1. Enriched Graph Data [344MB]
  2. LiveJournal edges

We also provide the dataset that we used to benchmark the graph nesting operator. In particular, we offer both the gMark -generated subgraphs and authorship's Microsoft Academic Graph subgraphs.

  1. gMark-generated operands [2GB]
  2. Microsoft Academic Authorship graph [5GB]

Public figures Facebook posts

The Facebook posts of six public figures from different social categories (i.e. politicians, journalists and singers). Each JSON file includes one thousand posts.

  1. Amanpour [205KB]
  2. Macklemore [214KB]
  3. Obama [136KB]
  4. Renzi [569KB]
  5. Travaglio [2MB]
  6. Vasco [410KB]

PubMed abstracts for biomedical clustering evaluation

The evaluation dataset is created using PubMed abstracts. Each entry of the dataset contains articles related to a disease.

  1. Dataset creation procedure.

Smartphone images

The dataset includes photos taken by 19 different smartphones, both from the front camera and the rear camera. For each smartphone a subset of 100 images (50 from the front camera and 50 from the rear one) was uploaded and downloaded on the following Social Media: Facebook, Flickr, Google+, GPhoto, Instagram, LinkedIn, Pinterest, QQ, Telegram, Tumblr, Twitter, Viber, VK, WeChat, WhatsApp and WordPress. The Readme.csv file summarizes the smartphones' characteristics.

  1. Images [53GB]
  2. Readme

Smartphone videos

The dataset includes videos taken by 13 different smartphones, both from the front camera and the rear camera. The Readme.csv file summarizes the smartphones' characteristics.

  1. Videos [80GB]
  2. Readme

Time in Text

This dataset comprises the time intervals extracted and normalized from the temporal expressions found in text corpora.

  1. Wikipedia - 89 Million timexes [8GB]
  2. New York Times 1987-2007 - 15 Million timexes [2GB]

Text Watermarking evaluation