Multimedia Phylogeny Datasets

Image Phylogeny

Dataset #1 – Image Phylogeny Trees Dataset with Complete Scenarios

  • Description: To create this dataset, we selected 50 images from the Uncompressed Color Image Database (UCID) dataset which contains a wide variety of images with 512 x 384-pixel resolution, without compression artifacts. For reproducibility, we used the images with id = i x 25 with i = [1..50]. For each image, we created trees with 10, 20, 30, 40, and 50 nodes (near-duplicates) from 50 different tree topologies. A tree with 50 nodes represents an original image with its forty-nine near-duplicates. For each topology, we created ten different set of parameters also randomly selected from the predefined parameter ranges. The final data set has 25,000 test cases for each tree size. As we select five possible tree sizes, the final data set has 125,000 test cases (50 images, 50 topologies, 10 sets of parameters, 5 different tree sizes).
  • Download file: Dataset01.tar.gz (~ 942MB)
  • Publications: [1] [2]

Return

Dataset #2: Image Phylogeny Trees Dataset with Missing Links

  • Description : It is a variation of Dataset #1 in which we consider missing links. In this scenario, we want to verify the performance of the phylogeny algorithms when some pieces of the evolution tree are missing. For instance, consider a node A which is parent of B and consider B as parent of C. In a complete scenario with no missing links, we have A, B, and C to find the phylogeny relationships among them. In the scenario with missing links, it is possible that B is not in the set then we must find the relationship between A and C (grand-parenting) without the presence of B (the missing link in the evolutionary tree). We evaluated two possible subscenarios:
    1. Missing links preserving the root. In this case, we selected all trees with 50 nodes from Dataset #1 (25,000 trees) and randomly created subsets of them with varying size (with increments of 5). Therefore, the data set for the experiment with missing links has 250,000 test cases (25,000 trees x 10 missing links setups). For each tree, we randomly select some nodes to remove (preserving the original root) generating a tree with some missing connections.
    2. Missing links in which the root is always missing. In this case, we follow a similar setup (25,000 trees x 10 missing links setups), except that for each tree, we randomly select some nodes to remove and always remove the original root. The dataset for the experiment with missing links without the original root also has 250,000 test cases.
  • Download file: Dataset02.tar.gz (~ 2.31GB)
  • Publications: [1][2]

Return

Dataset #3: 10 different target image groups from the Internet

  • Description: This dataset comprises images from 10 different target image groups from the Internet that became viral at the time of their publishing (Iranian missiles, Bush reading, WTC Tourist, BP oil spill, Israeli-Palestinian peace talks, Criminal record, Palin and rifle, Beatles rubber, Kerry and Fonda, O.J. Simpson). The total number of images across all target groups is 187, and evaluation is performed by creating a direct descendant for each image using five variations of parameters (considering the same transformations used to construct the controlled dataset). Therefore, the total number of images in this dataset is 187 x 6 = 1122 images.
  • Download file: Dataset03.tar.gz (~ 41MB)
  • Publications: [1][2][6]

Return

Datasets #1, #2, #3: Full package

Return

Dataset #4: Image Phylogeny Forests (Dataset_A)

Return

Dataset #5: Image Phylogeny Forests (Dataset_B)

  • Description: This dataset comprises 10 indoor scenes and 10 outdoor scenes, taken with 10 different cameras (Canon EOS50D, Canon PowerShot G12, Canon PowerShot SX1-IS, Fujifilm Finepix S4500, Kodak EasyShare z981, Nikon D5000, Olympus SP-800UZ, Panasonic Lumix, Sony Alpha, and Sony Cybershot DSC-HX1).
  • Complete dataset: Dataset_B.tar.gz (~2.79GB) | README
  • Dataset_B.1 (10% of Dataset B) (~81GB) | README | http://dx.doi.org/10.6084/m9.figshare.1012816
    • It comprises images randomly selected from a set of 20 different scenes, 10 different cameras, 10 images per camera, 10 different tree topologies, 10 random variations of parameters for creating the near-duplicate images, and forests with 10 trees each. In this data set, the family of image transformations considered are: re-sampling, cropping, affine warping, brightness and contrast adjustments, and lossy compression using the standard lossy JPEG algorithm. For each of the cases, single and multiple cameras, a total of 2,000 forests within this set were randomly selected with forests of size |F | = {1..10}. Therefore, this set comprises 2 x 2, 000 x 10 = 40, 000 test cases.
    • Publication: [5][6]

Return

Dataset #6: The Situation Room

  • Description:This dataset comprises the image taken by the White House photographer Peter Souza on May 1st, 2011, and its variants collected from the Internet (this episode is also known as The Situation Room). For this experiment, 98 near-duplicate images were collected through Google Images and with a manual analysis, these images were divided in different patterns considering: cases of inserting the Italian soccer player Marion Balotelli (ID a*), text overlay (ID b*), watermarking (ID c*), face swapping (ID d*), insertion of a joystick (ID e*) and hats (ID g*), and changes in the image size without splicing (ID n*). Qualitative evaluation is performed by checking whether the groups were correctly separated in the reconstructed tree.
  • Download file: situation.zip (~19MB)
  • Publication: [3][4][6]

Return

Dataset #7: The Ellen DeGeneres’ Selfie at The 86th Academy Awards (2014)

  • Description: composed by images related to the selfie taken by the host Ellen DeGeneres and some famous actors during the 86th Academy Awards held on March 2nd, 2014. The original image became increasingly popular right after it was posted on her Twitter account. To this day, it has been retweeted more than three million times, and in addition to the retweets, several edited versions of this picture appeared on the Internet, with cases of text overlay, face swap, insertion of other people and animals in the picture, among others. This dataset is a clear example of a phylogeny forest, as the images are semantically similar but they were either taken with different cameras or in different points in time. We manually collected 44 pictures from this event posted on Twitter, blogs, and news websites, and divided it in five groups:
    • Group a*: Edited versions of the picture posted at DeGeneres’ Twitter account (@TheEllenShow).
    • Group b*: The moment the selfie was being taken but from the viewpoint of another camera.
    • Group c*: Similar to group b*, but with slight differences on the posture of the people in the picture. For instance, Angelina Jolie moves her arms and Brad Pitt straightens his back.
    • Group d*: Similar to group b* and c*, but the main differences are on their facial expressions.
    • Group e*: The moment before the selfie was taken, when the people on the picture starts gathering.
  • Download file: selfie_imgs.zip (~2.5MB)
  • Publication: [6]

Return


Text Phylogeny

Dataset #1: Synthetic Dataset

  • Description: It is constructed using a subset of the Reuters_50_50 training dataset, collecting only articles whose length varies from 350 to 700 words (2,073 documents in total).
    • Reuters Mixed: It comprises 5,000 trees, equally divided among trees having 10, 20, 30, 40, and 50 nodes, 360 corresponding to 150,000 distinct synthetic documents.
    • Download file: TPT_reuters_mixed.tar.gz (~81.3MB)
    • Reuters Progressive: It comprises 1,000 trees generated for each editing 354 limit step, with five 200-element subsets of trees having size |T| equal to 10, 20, 355 30, 40 or 50 elements. In total, 10,000 trees were built, corresponding to 300,000 356 distinct synthetic documents.
    • Download file: TPT_reuters_progressive.tar (~158.4MB)
    • Publication: [8]

Return

Dataset #2: Wikipedia Dataset

  • Description: In this dataset, we used the page histories of several featured articles from Wikipedia. Each page history shows the order in which changes were made to any editable page, and the difference between any two versions. They were obtained using Wikipedia’s export tool. The data is exported in the form of xml dumps, from 369 which plain text was extracted using a parsing tool. After this cleaning process, we obtained 859 page histories, with up to 1,000 revisions each (metadata – featured articles).
  • Download file for |T| = {10, 20, 30, 40, 50}: TPT_wikipedia_10_50.tar.gz (~56.6MB)
  • Download file |T| = {100, 200, 300, 400}: TPT_wikipedia_10_400.tar.gz (~858.7MB)
  • Publication: [8]

Return


Publications

  1. Z. Dias, A. Rocha, and S. Goldenstein, Image phylogeny by minimal spanning trees.
    In IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 774-788, April 2012.
  2. Z. Dias, S. Goldenstein, and A. Rocha. Exploring Heuristic and Optimum Branching Algorithms for Image Phylogeny.
    In Elsevier Journal of Visual Communication and Image Representation (JVCI), v. 24, no.7, p. 1124-1134, 2013.
  3. Z. Dias, S. Goldenstein, and A. Rocha. Toward Image Phylogeny Forests: Automatically Recovering Semantically-Similar Image Relationships.
    In Elsevier Forensic Science International (FSI), v. 231, no.1-3, p. 178-189, 2013.
  4. Z. Dias, S. Goldenstein, and A. Rocha. Large-Scale Image Phylogeny: Tracing Image Ancestral Relationships.
    In IEEE Multimedia, v.10, no.3, p. 58-70, 2013.
  5. F. de O. Costa; M. A. Oikawa, Z. Dias, S. Goldenstein, and A. Rocha. Image Phylogeny Forests Reconstruction.
    In IEEE Transactions on Information Forensics and Security (TIFS), v.9, no.10, p. 1533-1546, 2014. [Supplementary material]
  6. M. A. Oikawa, Z. Dias, A. Rocha, and S. Goldenstein. Manifold Learning and Spectral Clustering for Image Phylogeny Forests.
    In IEEE Transactions on Information Forensics and Security (TIFS), v.11, no.1, p. 5-18, 2016. [Experiments details]
  7. M. A. Oikawa, Z. Dias, A. Rocha, and S. Goldenstein. Distances in Multimedia Phylogeny. In International Transactions in Operational Research (ITOR), v.23, p. 921-946, Special Issue: Many Faces of Distances (5), 2016.
  8. G. D. Marmerola, M. Oikawa, Z. Dias, S. Goldenstein, A. Rocha (2016): Text Phylogeny. figshare. https://dx.doi.org/10.6084/m9.figshare.4256573.v3

Click here for more details.

Return

Advertisements