Database Open Access
bigP3BCI: An Open, Diverse and Machine Learning Ready P300-based Brain-Computer Interface Dataset
Boyla Mainsah , Chance Fleeting , Thomas Balmat , Eric Sellers , Leslie Collins
Published: May 19, 2025. Version: 1.0.0
When using this resource, please cite:
(show more options)
Mainsah, B., Fleeting, C., Balmat, T., Sellers, E., & Collins, L. (2025). bigP3BCI: An Open, Diverse and Machine Learning Ready P300-based Brain-Computer Interface Dataset (version 1.0.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/0byy-ry86
Please include the standard citation for PhysioNet:
(show more options)
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
Abstract
Brain–computer interfaces (BCIs) have wide-ranging applications as solutions for replacing or substituting neural output that has been lost because of severe neuromuscular injury or disease, such as individuals with late-stage amyotrophic lateral sclerosis (ALS). The P300-based BCI is one of the most commonly researched BCI for communication. This BCI dataset is curated from data originally generated from previous visual P300-based BCI speller studies, which include single- and multi-session experiments under a wide range of conditions. The BCI data are provided in an enriched and standardised format with BCI data elements that align with developing IEEE P2731 Working Group standards for BCI data to facilitate reusability. The data files, provided in open European Data Format ‘plus’, contain: i) electroencephalography (EEG) signals; ii) the BCI encoder, target characters and stimulus event markers for P300 event related potential analysis; iii) BCI spelling outcomes and feedback event markers for error related potential analysis; and if available, iv) self-reported demographics (age, sex, race, ethnicity); v) ALS diagnosis and a revised ALS Functional Rating Scale score obtained from medical records; and vi) eye tracker signals.
Background
Current BCIs have relatively low communication rates due to the inherent limitations associated with processing inherently noisy data and highly variable neural signal components of interest to extract the relevant information that is needed to control the BCI. Thus, improving BCI communication efficiency is an area of significant research interest. Part of the development process for any BCI algorithm involves performing simulations with EEG data collected from previous BCI studies to pre-assess various BCI algorithms or strategies under consideration and selecting promising candidates for real-world testing. Acquiring BCI data is time-consuming and expensive; thus, most BCI research groups rely on publicly available datasets to obtain the necessary data to perform simulations with EEG data, rather than conducting real-time BCI studies in-house. However, current publicly available BCI datasets have a limited number of participants, have an under-representation of target BCI end users, mostly use a proprietary file format or a non-standardised data dictionary. Also, there is a lack of serial data collected over several hours and days of BCI use that are needed for simulating long-term evaluation of BCI algorithms.
This open dataset represents a collection of data acquired from BCI research at the Applied Machine Learning Laboratory at Duke University and the Brain-Computer Interface Laboratory at East Tennessee State University. From research supported by the National Institutes of Health (NIH), we have acquired a large amount of single- and multi-session data from P300-based BCI speller [1] studies with abled-bodied individuals and individuals with ALS tested under a wide range of experiment conditions. We have performed data curation, data cleaning, and data engineering to transform proprietary data files into an open and nonproprietary file format and packaged the transformed files into a machine-readable dataset.
Methods
Data Acquisition
Data were recorded using BCI2000 [2], an open-source BCI software platform supported by the NIH. EEG signals were collected non-invasively at 256 Hz using passive gel-based electrodes or active dry electrodes connected to biosignal amplifiers (g.tec medical engineering GmbH) [3]. An electrode impedance check was conducted to ensure low impedance prior to signal recording. Raw EEG signals were bandpass filtered and sometimes, notch filtered, at the biosignal amplifier stage prior to data storage [3].
For hybrid BCI use, eye gaze position, eye position, eye distance from the screen and pupil diameter were collected using a Tobii Pro X2-30 (Tobii AB) infrared eye tracker. The eye tracker was calibrated for each participant prior to BCI use. Eye tracker data were acquired via the EyeTrackerLogger filter in BCI2000 [4] and synchronised to EEG data collection during BCI use. Raw eye tracker position data are pre-processed based on the technical specifications of the Tobii EyeTrackerLogger filter in BCI2000 [4].
Experiment Setup
During an experiment session, participants performed copy-spelling of predefined tokens using the P300 speller application in BCI2000. A user is presented with a set of choices on a speller grid, one of which is assumed to be the user’s desired or target character. In copy-spelling mode, the user spells a pre-defined token (words or number sequence) and is provided a cue on which target character to spell one at a time. To select a new target character, the user focuses on that character as subsets of characters are illuminated on the screen; the illumination of a character subset represents a visual stimulus event. The BCI infers the user’s target character by: i) processing a time window of EEG data time-locked to each stimulus event; ii) using a classifier to detect P300 event-related potentials (ERPs) embedded in EEG data that are elicited in response to the presentation of the target character; and iii) estimating the user’s intended character by matching the character presentation patterns to the detected P300 ERPs with a character decoding function.
In general, an experiment session consists of a calibration phase and a test phase. During the calibration phase, participants perform copy-spelling with no BCI feedback to collect labelled EEG data to train a classifier. During the test phase, the trained BCI classifier is applied and participants performed copy-spelling with BCI feedback to evaluate a new BCI algorithm or strategy. Relevant studies in this BCI dataset and related publications detailing study-specific experiments protocol are listed in Table 1.
Study ID |
Related Publication |
No. of Participants | Has Target End User? | No. of Sessions | Stimulus Paradigm(s) | Grid Size§ |
---|---|---|---|---|---|---|
A |
[5] |
13 |
|
1 |
RC, CB, RD |
9 8 |
B |
[6] |
18 |
Yes |
Var |
CB |
6 6 |
C |
[7] |
19 |
|
1 |
CB |
9 8 |
D |
[8] |
17 |
|
1 |
RC |
9 8 |
E |
* |
8 |
|
1 |
CB |
9 8 |
F |
[9] |
10 |
Yes |
3 |
CB |
9 8 |
G |
[10] |
20 |
|
1 |
CB |
9 8 |
H |
[11] |
16 |
|
1 |
CB |
9 8 |
I |
ꝉ |
13 |
|
1 |
CB, PB |
9 8 |
J |
[12] |
20 |
|
1 |
RC, PB |
6 6 |
K |
[13] |
5 |
|
1 or 2 |
CB, AD |
9 8 |
L |
[14] |
11 |
Yes |
1 |
RC, CB, CB |
6 6 |
M | [15] | 21 | 1 | CD, AD |
9 8 |
|
N | [16] | 8 | Yes | 2 | CB |
6 6 |
O | [17] | 18 | 2 | CB, CB |
9 8 |
|
P | [18] | 19 | 2 | CB |
9 8 |
|
Q | [19] | 36 | 3 | CB, CB |
9 8 |
|
R | [20] | 20 | 2 | CB |
9 8 |
|
S1 | [21] | 10 | 1 | CB |
9 8 |
|
S2 | [21] | 24 | 1 | CB |
9 8 |
|
Abbreviations: AD, Adaptive; AD, Adaptive Diffuse; ALS, Amyotrophic Lateral Sclerosis; CB, Checkerboard; CB, Checkerboard Colour; CB, Checkerboard with suppressed characters; No., number; PB, Performance-Based; RD, Random; RC, Row-Column; Var, variable. §Grid size is specified as number of rows number of columns in a matrix layout. *The experiment protocol of study E is similar to that of study D. ꝉThe experiment protocol of study I is similar to that of study J. |
Data Description
Data were extracted from native BCI2000 source files in .dat file format and stored in the open European Data Format plus (EDF+) [22]. The EDF+ data files contain, where available:
- EEG signals;
- The BCI encoder, target characters and stimulus event markers;
- BCI spelling outcomes and feedback event markers;
- Self-reported demographics (race, ethnicity, age, sex);
- ALS diagnosis and a revised ALS Functional Rating Scale score obtained from medical records, if available; and
- Eye tracker signals.
File Hierarchy
Files are organised by study (see Table 1). The base hierarchy levels are Study#\#_$$\SE@@@, where #, $$ and @@@ are the study, subject and session identifier numbers, respectively. The next hierarchy level specifies either the ~\Train or ~\Test phase of the experiment. The subsequent hierarchy levels until the data files depend on the conditions tested during the experiment. File names end with either *Train%%.edf (train phase) or *Test%%.edf (test phase), where %% is the file number.
./
├── StudyA
│ ├── A_01
│ │ └── SE001
│ │ ├── Train
│ │ │ ├── Condition1
│ │ │ │ ├── A_01_SE001CND1_Train01.edf
│ │ │ │ └── A_01_SE001CND1_Train02.edf
│ │ │ └── Condition2
│ │ │ ├── A_01_SE001CND2_Train01.edf
│ │ │ └── A_01_SE001CND2_Train02.edf
│ │ └── Test
│ │ ├── Condition1
│ │ │ ├── A_01_SE001CND1_Test01.edf
│ │ │ └── A_01_SE001CND1_Test02.edf
│ │ └── Condition2
│ │ ├── A_01_SE001CND2_Test01.edf
│ │ └── A_01_SE001CND2_Test02.edf
: :
EDF+ File Header
Conventionally, the EDF+ file header contains information about the patient identity and the technical specifications of the recorded signals [22, 23]. While our BCI dataset is anonymised, we adapted the patient identification subfields in the EDF+ header to include participant demographics. If available, participant demographics (BCI data level 0 [24]) are contained in the EDF+ file header. We also adapted the patient record identification subfields to include a study identifier, session number and equipment model. Our BCI data dictionary for the EDF+ file header is provided in Table 2.
File Header Field | File Header Subfield | Custom File Header Subfield Label | Values |
---|---|---|---|
Patient identification | Patient code | Subject number | A_01, A_02, etc. |
Sexa |
- Male - Female |
||
Date of birth |
- 01-JAN-YYYY, date-shifted in years based on assumed recording start date of 01-JAN-2020. - If age is not available, set to 01-JAN-YYYY. Note: All study participants are adults (at least 18 years). |
||
Patient name | <Race>_<Ethnicity>_<ALS Status> |
Racea - White - Black/African American - Asian - American Indian or Alaska Native - Native Hawaiian or Other Pacific Islander - X (if not available) |
|
Ethnicitya - Hispanic or Latino - Not Hispanic or Latino - X (if not available) |
|||
ALS Severityb - NonALS - ALS_#, where # is the revised ALS Functional Rating Scale (ALSFRS-R) score ranging from 0 to 48 - ALS_X (if the ALRFRS-R score is not available) |
|||
Recording identification | Start date |
01-JAN-2020 Note: All recording start dates are set to the same value |
|
Hospital administration code | <Dataset name>_<Dataset version>_<Study name> |
Dataset name: bigP3BCI Dataset version: v#.#.# (following convention for semantic versioning) Study name: StudyA, StudyB, etc. Example: bigP3BCI_v1.0.0_StudyA |
|
Technician | Session number | SE001, SE002, etc. | |
Equipment code | Equipment model | ||
aSex, race and ethnicity categories are primarily based on definitions outlined by the NIH [25]. bThe revised Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS-R) is an instrument for evaluating degree of functional impairment in individuals with ALS; the total ALSFRS-R score ranges from 0 (worst) to 48 (best) [26]. |
EDF+ Data Records
We used the IEEE P2731 Working Group (WG) [24, 27] Standard for a Unified Terminology for Brain-Computer Interfaces for data storage and sharing that are currently under development, which recommends providing BCI data at three levels. BCI data level 0 includes information about data acquisition, physiological signals and individual attributes. BCI data level 1 includes information about the BCI paradigm that is needed to train the BCI machine learning algorithm. BCI data level 2 includes information related to the BCI feedback. The data dictionary for the EDF+ data records organised by the proposed IEEE P2731 BCI data levels are presented in Table 3. Most labels for the EDF+ data records are derived from parameter definitions in the P3SpellerTask and EyeTrackerLogger modules in BCI2000 [2].
BCI Data Level | Data |
Data Label (see table notes) |
Values |
---|---|---|---|
Level 1: Biosignals | EEG signals |
EEG_<electrode> e.g., EEG_Cz |
|
Eye tracker data validity | ET<Left/Right>EyeValid |
0: Eye tracker data invalid 1: Eye tracker data valid |
|
Eye gaze position | ET<Left/Right>EyeGaze<X/Y> |
- Ratio between 0 and 1, location on the screen the participant is looking - corresponds to the top left of the screen |
|
Eye position | ET<Left/Right>EyePos<X/Y> |
- Ratio between 0 and 1 (eye position relative to the camera in 2D space) - corresponds to the top left of the camera's view |
|
Eye distance | ET<Left/Right>EyeDist | mm (distance between the screen and the eyes) | |
Pupil size | ET<Left/Right>PupilSize | mm | |
Level 1: BCI Training | Target character trial events | PhaseInSequence |
0: Pre-run 1: Pre-trial 2: During trial 3: Post-trial |
Stimulus events (Phase 2) | StimulusBegin |
0: stimulus off 1: stimulus on |
|
StimulusType |
0: non-target stimulus event when StimulusBegin = 1 1: target stimulus event |
||
Character events (Phase 2) |
<Character>_<rowIndex>_<column Index> (e.g., K_2_3) |
0: character not presented 1: character presented |
|
Target character (Phase 2) | CurrentTarget |
- Index of target current character - Character index = , where is row index; is number of rows; , is number of columns; and is the grid size. - E.g., index of K_2_3 in a grid is 11. |
|
Level 2: BCI Feedback | Predicted target character (Phase 3) | Selected<Target/Row/Column> | Index of predicted target character/row/column). |
Presented feedback (Phase 3) | DisplayResult |
0: Feedback not presented 1: Feedback presented |
|
FakeFeedback | Index of presented character overriding predicted target character during a character trial. | ||
<⋯> indicates a substring within the angle brackets. <option> in italics indicates a variable substring within the angle brackets. <option1/option2/…/optionN> in solid indicates a variable substring from the fixed set {option1, option2,…, option}. E.g., ET<Left/Right>PupilSize indicates two options, ETLeftPupilSize and ETRightPupilSize, for the pupil size label of the eye tracker (ET). |
Usage Notes
This dataset can be used for P300 ERP analysis and error-related potential analysis based on information provided in BCI data levels 1 and 2, respectively (Table 3). EDF+ is an open format and can be read using various proprietary and open platforms, such MATLAB, R, Python and C++ [23].
The BCI studies that generated the data are listed in Table 1 above. It is important to note that there is missing demographic data within files and missing files for some participants.
Release Notes
Version 1.0.0: Initial release of the dataset. The dataset version number is included in the EDF+ file header, see Table 2.
Ethics
All data were recorded during online P300-based BCI studies approved by Institutional Review Boards at Duke University, Duke University Health System and East Tennessee State University. All participants gave informed consent, either by themselves or via a legally authorized representative (for some participants with ALS), prior to data collection and were compensated for their time. The dataset is anonymised to protect participant privacy: all personal identifiable information has been removed and all dates and times (e.g., data collection time, birth date) have been time-shifted.
Acknowledgements
The development of this dataset was funded by the National Institutes of Health under a grant supplement administered by the National Institute on Deafness and Other Communication Disorders at the National Institutes of Health (Grant R21DC018347-02S1). The authors would like to thank Mr. Thomas Napoles, Ms. Katie Kilroy and Dr. John Board at the Duke University Office of Information Technology for their technical support in preparing this BCI dataset.
Conflicts of Interest
The authors have no conflicts of interest to declare.
References
- L. A. Farwell and E. Donchin, "Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials," Electroencephalography and Clinical Neurophysiology, vol. 70, no. 6, pp. 510-523, 1988.
- G. Schalk and J. Mellinger, A practical guide to brain–computer interfacing with BCI2000: General-purpose software for brain-computer interface research, data acquisition, stimulus presentation, and brain monitoring. Springer Science & Business Media, 2010.
- BCI2000 Wiki, User Reference:gUSBampADC. Available: https://www.bci2000.org/mediawiki/index.php/User_Reference:gUSBampADC
- BCI2000 Wiki, Contributions:EyetrackerLogger. Available: https://www.bci2000.org/mediawiki/index.php/Contributions:EyetrackerLogger
- C. S. Throckmorton, D. B. Ryan, B. Hamner, K. Caves, K. A. Colwell, E. W. Sellers, and L. M. Collins, "Towards clinically acceptable BCI spellers: Preliminary results for different stimulus selection patterns and pattern recognition techniques," presented at the 4th International BCI Meeting, Asilomar, CA, 2010.
- N. A. Gates, C. K. Hauser, and E. W. Sellers, "A longitudinal study of P300 brain-computer interface and progression of amyotrophic lateral sclerosis," in International Conference on Foundations of Augmented Cognition, 2011, pp. 475-483.
- B. O. Mainsah, K. D. Morton, L. M. Collins, E. W. Sellers, and C. S. Throckmorton, "Moving away from error-related potentials to achieve spelling correction in P300 spellers", IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 23, no. 5, pp. 737-743, 2015.
- B. O. Mainsah, K. A. Colwell, L. M. Collins, and C. S. Throckmorton, "Utilizing a language model to improve oline dynamic data collection in P300 spellers," IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 22, no. 4, pp. 837-846, 2014.
- B. O. Mainsah, L. M. Collins, K. A. Colwell, E. W. Sellers, D. B. Ryan, K. Caves, and C. S. Throckmorton, "Increasing BCI communication rates with dynamic stopping towards more practical use: An ALS study," Journal of Neural Engineering, vol. 12, no. 1, p. 016013, 2015.
- B. Mainsah, K. Morton, L. Collins, and C. Throckmorton, "Extending language modeling to improve dynamic data collection in ERP-based spellers." 6th International Brain-Computer Interface Conference, Graz, Austria, 2014.
- D. Kalika, L. Collins, K. Caves, and C. Throckmorton, "Fusion of P300 and eye-tracker data for spelling using BCI2000," Journal of Neural Engineering, vol. 14, no. 5, p. 056010, 2017.
- B. Mainsah, G. Reeves, L. Collins, and C. Throckmorton, "Optimizing the stimulus presentation paradigm design for the P300-based brain-computer interface using performance prediction," Journal of Neural Engineering, vol. 14, no. 4, p. 046025, 2017.
- B. Mainsah, D. Kalika, L. Collins, S. Liu, and C. Throckmorton, "Information-based adaptive stimulus selection to optimize communication efficiency in brain-computer interfaces," Advances in Neural Information Processing Systems, vol. 31, 2018.
- D. B. Ryan, K. A. Colwell, C. S. Throckmorton, L. M. Collins, K. Caves, and E. W. Sellers, "Evaluating brain-computer interface performance in an ALS population: Checkerboard and color paradigms," Clinical EEG and Neuroscience, vol. 49, no. 2, pp. 114-121, 2018.
- X. J. Chen, D. Kalika, C. S. Throckmorton, L. M. Collins, and B. O. Mainsah, "Bayesian Adaptive Stimulus Optimization in Stimulus-driven Brain Computer Interfaces," 2024.
- J. Clements, E. Sellers, D. Ryan, K. Caves, L. Collins, and C. Throckmorton, "Applying dynamic data collection to improve dry electrode system performance for a P300-based brain–computer interface," Journal of Neural Engineering, vol. 13, no. 6, p. 066018, 2016.
- G. Frye, C. Hauser, G. Townsend, and E. Sellers, "Suppressing flashes of items surrounding targets during calibration of a P300-based brain–computer interface improves performance," Journal of Neural Engineering, vol. 8, no. 2, p. 025024, 2011.
- D. B. Ryan, G. Frye, G. Townsend, D. Berry, S. Mesa-G, N. A. Gates, and E. W. Sellers, "Predictive spelling with a P300-based brain–computer interface: Increasing the rate of communication," Intl. Journal of Human–Computer Interaction, vol. 27, no. 1, pp. 69-84, 2010.
- D. Ryan, G. Townsend, N. Gates, K. Colwell, and E. Sellers, "Evaluating brain-computer interface performance using color in the P300 checkerboard speller," Clinical Neurophysiology, vol. 128, no. 10, pp. 2050-2057, 2017.
- M. Kellicut-Jones and E. Sellers, "P300 brain-computer interface: Comparing faces to size matched non-face stimuli," Brain-Computer Interfaces, vol. 5, no. 1, pp. 30-39, 2018.
- M. R. Jones and E. Sellers, "Faces, locations, and tools: a proposed two-stimulus P300 brain computer interface," Journal of Neural Engineering, vol. 16, no. 3, p. 036026, 2019.
- B. Kemp and J. Olivan, "European data format ‘plus’(EDF+), an EDF alike standard format for the exchange of physiological data," Clinical Neurophysiology, vol. 114, no. 9, pp. 1755-1761, 2003.
- European Data Format. Available: https://www.edfplus.info/
- L. Bianchi, A. Antonietti, G. Bajwa, R. Ferrante, M. Mahmud, and P. Balachandran, "A functional BCI model by the IEEE P2731 working group: Data storage and sharing," Brain-Computer Interfaces, vol. 8, no. 3, pp. 108-116, 2021.
- (2015). Racial and Ethnic Categories and Definitions for NIH Diversity Programs and for Other Reporting Purposes, Notice Number: NOT-OD-15-089. Available: https://grants.nih.gov/grants/guide/notice-files/NOT-OD-15-089.html
- J. M. Cedarbaum, N. Stambler, E. Malta, C. Fuller, D. Hilt, B. Thurmond, A. Nakanishi, B. A. S. Group, and A. c. l. o. t. B. S. Group, "The ALSFRS-R: A revised ALS functional rating scale that incorporates assessments of respiratory function," Journal of the Neurological Sciences, vol. 169, no. 1-2, pp. 13-21, 1999.
- C. Easttom, L. Bianchi, D. Valeriani, C. S. Nam, A. Hossaini, D. Zapała, A. Roman-Gonzalez, A. K. Singh, A. Antonietti, and G. Sahonero-Alvarez, "A functional model for unifying brain computer interface terminology," IEEE Open Journal of Engineering in Medicine and Biology, vol. 2, pp. 91-96, 2021.
Access
Access Policy:
Anyone can access the files, as long as they conform to the terms of the specified license.
License (for files):
Creative Commons Attribution 4.0 International Public License
Discovery
DOI (version 1.0.0):
https://doi.org/10.13026/0byy-ry86
DOI (latest version):
https://doi.org/10.13026/ckte-3g83
Topics:
brain-computer interface
electroencephalography
ieee p2731 working group standard
amyotrophic lateral sclerosis
p300 speller
p300 event related potential
oddball paradigm
error-related potential
Corresponding Author
Files
Total uncompressed size: 44.6 GB.
Access the files
- Download the ZIP file (10.3 GB)
-
Download the files using your terminal:
wget -r -N -c -np https://physionet.org/files/bigp3bci/1.0.0/
Name | Size | Modified |
---|---|---|
bigP3BCI-data | ||
LICENSE.txt (download) | 14.5 KB | 2024-10-24 |
README.md (download) | 3.5 KB | 2024-07-01 |
SHA256SUMS.txt (download) | 957.9 KB | 2025-05-29 |
bigP3BCI_v1_0_0.pdf (download) | 900.2 KB | 2025-05-15 |