large-Scale Tv Dataset
Multimodal Named Entity Recognition
(STVD-MNER)

The STVD-MNER dataset is related to the Multimodal Named Entity Recognition (MNER) task in video. The MNER in video is an established task in the Computer Vision (CV), Audio Signal Processing (ASP) and Natural Language Processing (NLP) fields which uses visual, audio and textual information to detect Multimodal Named Entities (MNEs) as well to identify their types (e.g. location, person, organization, etc.). The next figure gives examples of MNEs including different representations and types. As the STVD-MNER is taking part of the STVD collection, the multimodality is derived from television content appearing in Electronic programming guides (EPG), TV Audio/Video (A/V) streams.

Multimodal Named Entity Recognition (MNER) in video

The STVD-MNER dataset is provided in a β version with an Hello World test set. This Hello World test set contains about 820 hours of audio/video data with ≈9K textual Named Entities (NEs). It is characterized into two distinct subsets of collections called top and last. The top subset contains ≈85% of NEs dispatched in ≈⅓ of the dataset. The last subset covers the ≈⅔ and has few (or none) NEs per collection. Here, the MNER task has to be driven from scratch with the transcriptions of audio streams. A pending Full test set would cover several thousands of A/V data.

The dataset is composed of different parts provided as:

The following naming convention is applied within the dataset:

CX / NameCode / NEs_list_imdb.json
NEs_list_stvdkgstr.json
NEs_list_stvdkgall.json
day / ts_epg.csv
ts_video.mp4
ts_audio.mp4
ts_transcript_wt.srt
ts_transcript_wt.txt
ts_transcript_wb.srt
ts_transcript_wb.txt
ts_transcript_ws.srt
ts_transcript_ws.txt

For the needs of visualization and testing, some samples (NEs and types with their related audio and video segments) are given in the next table.

NE Type Audio Text Video
sample 1 Gibbs PER 1_a 1_t 1_v
sample 2 Saint-Denis LOC 2_a 2_t 2_v

The dataset is available for non-commercial research purposes. Before to download the dataset, get the agreement (in English or French version) and sign it. Then, send the scanned version to Mathieu Delalandre email. After verifying your request, we will contact you with the password to unzip the dataset.

The files constituting the dataset are given here with general statistics. For a better accessibility, the dataset is provided as different archive packages of 16 GB. Our storage service at the UT delivers at 3-16 MB/s for downloading (from a low / high speed connection, respectively) with concurrent access. It would require a near 30 to 45 minutes of downloading per package. All the packages are needed to uncompress the dataset.

Duration (h) Channels Collections NEs files NEs EPG files A/V files Transcripts Size (GB) Packages Link
819 h 7 284 439 9,256 843 x1 843 x2 843 x6 281 18 download

For the needs of clarification, we introduce here technical / scientific aspects about STVD-MNER dataset. Further information could be find in the research papers [1,2].

tiny base small
Th 42.5 30.9 11.9
Dmax 19.3h 26.5h 68.8h
  1. H.G. Vu, N. Friburger, A. Soulet and M. Delalandre. stvd-kg: A Knowledge Graph for French Electronical Program Guides. International Conference on Web Information Systems Engineering (WISE), 2025.
  2. F. Rayar, M. Delalandre and V.H. Le. A large-scale TV video and metadata database for French political content analysis and fact-checking. Conference on Content-Based Multimedia Indexing (CBMI), pp. 181-185, 2022.