The ISPS Data Archive supports the sharing of quality data by ensuring that deposits meet certain accessibility and usability standards. Meeting these standards ultimately contributes to more reproducible science.
Use YARD to submit study information and replication materials directly to ISPS data curators for deposit in the ISPS Data Archive.
YARD is easy to use! Follow the four steps described here.
How to deposit in YARD
The following steps guide first time users through the process of submitting study materials.
Step 1: Create an account or log in
Step 2: Create a new catalog record and enter study information
Once you are logged in, create a new catalog record for your project. A catalog record will contain your study information, files, and review history. Once a catalog record is created, you will be prompted to enter background information about your study and the research methods used. Required fields are marked with an asterisk, but please try to complete as many fields as possible. If you are unsure of how to answer any of the prompts, try placing your cursor above the prompt to see additional information. You can save the information and return later to add or update information.
Step 3: Upload files
Next you will be directed to upload study files. Please include all files necessary to replicate results in the associated publication (see What to deposit).
It can be extremely helpful to future users to include a README document that briefly describes the contents of each file, especially for studies with more than two data files. In addition, it is often appropriate to submit additional files beyond data and code. For example, you may want to include treatment materials, survey instruments, codebooks, and output files. Please use plain text or .pdf format for all supplemental material.
The ISPS Data Archive file size limit is 2GB.
Step 4: Submit for curation
When you have fully described your project, uploaded all relevant files, and included all required metadata, you can request curation. ISPS Data Archive will now curate and review your files before publishing on the ISPS website. You can read about what happens with your project during this process.
Guidance and more information:
For questions, please contact firstname.lastname@example.org.
>>Back to top<<
ISPS-affiliated authors and PIs are expected to provide raw data and other information related to ISPS-supported research (e.g., instructions, treatment manuals, questionnaires, software, details of procedures, etc.). Deposits should include all data and documentation necessary to independently read and interpret the data collection.
To use the ISPS Data Archive, authors and PIs are required to deposit the following at minimum:
- Data File(s)
- Program File(s)
- Link to publication
Other types of files are strongly encouraged but not required:
- README file
- Output File(s)
- Study metadata
- Treatment Materials
- Supplementary Materials
>>Back to top<<
See example https://isps.yale.edu/research/data/d089
- Please submit the complete and up-to-date file(s) that you used to generate results in your paper. Please also include weights and constructed variables if applicable.
- ASCII format is preferable (system files created in older versions of statistical packages may have limited readability and usability in the future). This format maximizes the potential for use across different software packages, as well as prospects for long-term preservation.
- A comma delimited file is easy to create – use StatTransfer or similar software to convert to an Excel CSV file.
- If you have a dataset in Stata or similar, please include it as well.
- If you’re working in R with an R dataset, please also generate a comma delimited file.
- File naming conventions: The contents of the file should be easily identified from its name. Preferably, the data file name should identify the author, project or study, and year. For example: “Gerber_Green_Larimer_APSR_2008.dat”.
- If you have more than one data file (for example, if you have a data set for each experiment, or from various sources), please name each file clearly identifying either the number or type of experiment, geographic location, date, or data source in the name. For example “Gerber_Huber_APSR_2009_ExperimentA.dat”.
- Variable labels and value labels should clearly describe the information or question recorded in that variable (see more in “Codebook” below).
- When applicable, all identifying information should be removed from the records to ensure confidentiality. To prepare a data file for public access, authors and PIs should remove personal identifiers contained in variables that allow direct or indirect identification of individuals and include:
- Addresses, including ZIP codes
- Telephone numbers, including area codes
- Social Security numbers
- Other linkable numbers such as driver license numbers, certification numbers, etc.
- Detailed geographic information (e.g., state, county, or census tract of residence)
- Organizations (to which the respondent belongs)
- Educational institutions (from which the respondent graduated and year of graduation)
- Exact occupations
- Place where respondent grew up
- Exact dates of events (birth, death, marriage, divorce)
- Detailed income
- Offices or posts held by respondent
- Derivative, or constructed (as opposed to raw) data files should be identified as such.
- You will have an opportunity to indicate whether you have the rights to distribute the data or whether the data are deposited only for the purpose of verifying computational reproducibility. In cases where access to the data is restricted and you do not have the rights to deposit the data in the ISPS Data Archive, you’ll have the opportunity to provide information on how to access the data.
- If the code relies on libraries or outside files, put installation instructions and code at the top of the script.
- Please submit all the relevant program file(s) that accompany the data file(s). Make sure you include all syntax that produces the tables and figures and all other results that appear in the published manuscript, ideally in the order they appear in the manuscript.
- If possible, create a master file to run all your scripts in sequence (if you have multiple code files).
- Use comments to label sections and output.
- Calling data:
- Make sure your code has a command that opens/reads the data, so it is clear which dataset is being used (even if you only have one dataset).
- Use relative paths, instead of absolute paths, in your code when calling or saving files.
- Format and file naming:
- R format is preferable.
- The contents of the file should be easily identified from its name. If you have one data file, please name it to identify the author, project or study, year. For example: “Gerber_Green_Larimer_APSR_2008.do”
- Make sure this name corresponds to the data file.
- If you have more than one program file (for example, if you have separate .do files for each table), please name each file clearly using the main name (as above) and short identifier (e.g., “Gerber_Huber_APSR_2009_table1.do”).
- The README file should include,
- Title, author(s), linkt o article (if available), email address where to be contacted;
- A description of the computing environment used to run the original analysis:
- Operating System and version (e.g, Windows 10, Ubuntu 18.0.4, etc.).
- Number of CPUs/Cores.
- Size of memory.
- Statistical software package(s) used in the analysis, and their version.
- Packages, ado files, or libraries used in the analysis, and their version.
- A description of the contents of the directory and each file, especially for studies with more than two data or program files. Make sure all files associated are mentioned;
- If multiple code files, specify the sequence of execution (it is also recommended to prefix the file name with Step##_).
- Plain text or PDF formats are preferable.
- The output file shows the results of using the program and data files.
- Please also include summary statistics (frequency distributions, means, etc.) of all variables. Unweighted frequency distribution should show both valid and missing cases.
- Plain text or .log formats are preferable.
- The codebook is critical to the interpretation of your data and output files. The codebook should provide information about each variable, including variable label and value label (see more below). Each factor variable in the data collection should have a set of exhaustive, mutually-exclusive, and clearly defined codes.
- Plain text format is preferable; other formats are acceptable.
- For each variable, the following information should be provided:
- Location in the data file. Ordinarily, the order of variables in the documentation will be the same as in the file; if not, the position of the variable within the file should be indicated.
- Variable name and label. For example, “g2004: Voted in the general elections of 2004.”
- The exact question wording or the exact meaning of the datum. Sources should be cited for questions drawn from previous surveys or published work. For example, “q2: political leaning (exact Q wording: “Do you lean more toward the Democratic or Republican party?” source: ANES)”
- Universe information, i.e., from whom the data was actually collected. If this is a survey, documentation should indicate exactly who was asked the question. If a filter or skip pattern indicates that data on the variable were not obtained for all observations, that information should appear together with other documentation for that variable.
- Value labels. A clear label to interpret each of the codes assigned to the variable. For example, “g2004: 1=yes, 0=no.”
- Missing data codes. Codes assigned to represent data that are missing. Different types of missing data should have distinct codes. For example: “g2004: 9=system missing.”
- Imputation and editing information. Documentation should identify data that have been estimated or extensively edited.
- Details on constructed and weight variables. Datasets often include variables constructed using other variables. Documentation should include “audit trails” for such variables, indicating exactly how they were constructed, what decisions were made about imputations, and the like. Ideally, documentation would include the exact programming statements used to construct such variables. Detailed information on the construction of weights should also be provided.
- Variable groupings. For large datasets, it is useful to categorize variables into conceptual groupings.
Treatment and study materials:
- Electronic copies of materials used to administer the intervention (treatment). For example, mailings, transcripts of robo-calls, summary of curriculum, TV ads, audio files. Also include original instructions. The instructions should be presented in a way that, together with the design summary, conveys the protocol clearly enough that the design could be replicated by a reasonably skilled experimentalist.
- Plain text or PDF formats are preferable. If multimedia format, please contact email@example.com.
Other supplementary documents:
- Plain text or PDF formats are preferable.
- This may include:
- Survey questionnaires, self-administered questionnaires
- include all question and response option wording, logic code (e.g., skip patterns, randomization)
- Interview schedules
- Interviewer and coder instructions
- Data collection forms for transcribing information from records
- Paper tests and scales
- Screening forms
- Call-report forms
- Final project report, project summary, or other description of the project
- Informed Consent Statement
- Survey questionnaires, self-administered questionnaires
>>Back to top<<
- Deposit original full data file – make it restricted if you need to. Allow public access to relevant data files, subject to confidentiality or other legal, policy, and ethical restrictions.
- Keep all original variables and recode variables in the syntax to create public datasets, or sub-datasets. Mark these data files clearly.
- Only include data in a data file; include figures or analyses in additional files.
- Consider aggregating data into fewer, larger files, rather than many small ones. It is more difficult and time consuming to manage many small files and easier to maintain consistency across data sets with fewer, larger files. It is also more convenient for other users to select a subset from a larger data file than it is to combine and process several smaller files. Very large files, however, may exceed the capacity of some software packages. In such cases, files might be grouped by data type, site, time period, measurement platform, investigator, method, or instrument. Alternatively, files can be compressed. Please contact firstname.lastname@example.org.
- File names should be meaningful, and ideally, describe content, date range, geographic location, and version information.
Additional resources on preparing data and other materials for deposit in an archive or repository:
- ICPSR (Inter-University Consortium for Politcial and Social Research) Guide to Social Science Data Preparation and Archiving.
- UK Data Service Guidance on Preparing Data for Deposit.
- Cornell CISER guidelines on preparing files for reproducibility.
- World Bank DIME analytic reproducibility checklist
Help at Yale:
>>Back to top<<
What happens to your files
When you submit your catalog record for curation, an ISPS administrator will be notified. The administrator will assign the catalog record to a curator who will review the files.
Curators check that all data is meaningfully labeled, that no personally identifying information is listed, and that sample sizes match those reported in the associated publication. Curators also check that all code executes without error, and returns all results stated in the associated publication.
The ISPS Data Archive team will also create additional files formats, for example ASCII and R, metadata describing the study and associated files, and an XML file with machine-readable study-level and variable-level information using DDI 3.2.
The Archive team might contact you with questions throughout this process. You can also follow along with the review by logging on to YARD and viewing your catalog record. Each step of the process will be marked as completed when the curator finishes that review step.
Behind the scenes
Upon deposit, a safe copy is created and deposited in a dark archive. A public copy of the files is created and begins processing, which includes generating study-level and file-level metadata, confirming all variables and values are labeled, standardizing missing values, creating and augmenting documentation, and assessing and minimizing disclosure risk by applying techniques such as recoding, masking, or removal of variables, and assigning persistent links. The review of code files—statistical and other programming scripts—includes verifying that the code executes and that the published scientific results can be reproduced with the given code and data. The data and code review processes include an assessment of the quality of documentation and contextual information necessary for long-term usability (for example, a codebook, a readme file, a commented code). In cases where these are found lacking or insufficient, the archive works with researchers on remedial actions. All files formats are normalized (including migrating software-specific data files to flat file formats such as ASCII, text, or comma delimited, and rewriting code written using licensed statistical software such as SPSS to open-source statistical languages such as R). All files are assigned a unique identifier (handle), and files sets have citation information. After completion of the process, materials are stored and made publicly available via the ISPS Data Archive Web portal.
For ISPS, the software will be deployed on Yale infrastructure in partnership with IT and the Yale University Library, and additional access to these materials will be provided through the Library’s Digital Collections portal.
The ISPS process, which aligns data curation with quality review, has been influential and informed the development of similar practices in other social science data archives who recently joined together under a consortium called Curating for Reproducibility (CURE). See more about the ISPS approach.
>>Back to top<<