General Purpose
The Fast Track Data Center, serving the entire multi-site Fast Track Project, is located at the Center for Child and Family Policy at Duke University. The Data Center is responsible for the following:
- Processing, archiving, and distributing all data collected for the project; creating and archiving aggregate and scored SAS datasets; providing datasets to project researchers; and processing requests from the wider research community for use of Fast Track data.
- Creating and updating all the necessary documentation for the data, including technical reports for all datasets.
- Archiving all measures used in the project, as well as the codebooks, manuals, technical reports, and outcome reports created by the project.
- Disseminating information to national and international researchers via the Fast Track Project website.
Processing Data
The Data Center receives data collected by research teams led by the Research Coordinators at each of the four Fast Track sites. The Data Center maintains an inventory file for recording all stages in the processing routine, from the receipt of the data through its final storage. Data are collected on scan forms, plain paper forms, and laptop computers. Data sent on scan forms are read using an optical scanner to create ASCII files containing the data. At this time, bubble errors such as missing bubbles are noted and corrected. Once each scan sheet is checked and corrected, it is added to the rest of its group and the ASCII data file is then sent on to the next processing stage. The data collected on paper forms are entered (either at the sites or by the Data Center) into computerized data entry versions of the forms to create ASCII files. Data that are entered manually go through an additional data verification step using a dual-entry system. The original scan and paper forms are shipped to the Data Center and photocopies of these forms are stored at the research sites. The data collected via laptop, already in ASCII format, are sent by FTP (File Transfer Protocol) to the Data Center. Laptop data initially are given a visual check to identify any incomplete files or other structural errors so that these problems can be corrected prior to processing. Copies of the laptop data are stored by the Research Coordinators. All data shipped to the Data Center are copied to create backups of the data, with one copy being placed on the hard drive of at least two Data Center computers and another copy being burned to CD.
The next stage of processing is to create SAS datasets from the ASCII files, also known as “raw data”, with SAS programs that are written specifically for each measure. Many measures have been created, modified, or converted from scan or paper forms to computerized measures, especially since the eighth year of the study. The Data Center keeps documentation of all changes made to any of the measures used in the project. It is during this stage of processing that the raw data are read into SAS, the variables are formatted and labeled, records are checked to eliminate more than 1 record per subject, outliers are resolved, and several more levels of error checking take place. The results of this processing are called the “unscored dataset” and it is this dataset that goes on to the next level of processing. These datasets contain all of the variables read from the scan sheets as well as the results of other manipulations (for example, corrections made within the SAS program) that might be needed for that particular instrument. Recently, the programs were updated to take into account conversion from paper and pencil or scan sheets to computerized measures.
Verifying Data
SAS datasets are checked again to verify that the data have completed all stages of data processing and are ready to be posted to the server. This verification process is accomplished using an administrative checking program that compares the dataset in question against the master database containing the identification information for each intervention, control, and normative child. The administrative checking program identifies children with missing records or incorrect TCIDs (Target Child Identification Numbers). Missing records identified by the administrative program are cross-checked against the exception reports that are sent with each shipment of data. Exception reports list the records included in the data files, as well as the records missing from the data files and the reason for each missing record. The results of the administrative check are compared to the exception report of the verified list of missing records. Any discrepancies between these two reports are resolved through communication with the sites. At this point, the datasets are considered to be “clean” and these “unscored datasets” are placed on the FTP server for the data analysts to access. It is in this stage of the process that the “scored datasets” are created.
Data Analysis
SAS datasets are created for each measure for each year, site, and cohort that the measure was administered. In the year 2000, the Data Center began creating aggregate datasets (combining across sites and cohorts) to facilitate downloading of datasets by data analysts. Errors found in the aggregate or scored level data are corrected at the unscored level and all datasets are corrected and replaced. Analysts at the Data Center and at each site continue to prepare technical reports for each measure and each year that the measure is administered, as well as develop scoring procedures and scoring programs. The technical reports, scoring procedures, scoring programs, and scored datasets are archived and distributed through the Data Center.
The Data Center and the research sites also share responsibility for data analysis on the research questions specified by the principal investigators and other project researchers. Analysts at the Data Center lead and coordinate analyses testing the outcomes of Fast Track interventions in different domains, as well as other analyses central to the project’s key aims.
Documentation and Data Archive
Documentation of the Fast Track Project is extensive, encompassing a variety of domains:
- Codebooks are created each year to provide information about the measures administered during that data collection year. These codebooks provide a historical background of the measure, a description of any changes or modifications made to the measure, a copy of the measure with variable labels added, and a SAS Proc Contents listing of all variables with their descriptions and characteristics.
- The Data Center produces a weekly report of any additions or changes made to the data on the server, as well as any documentation that is newly available. This weekly report is a valuable resource that allows the Data Center to keep researchers informed of updates and changes, and of procedural issues that may need attention.
- Data analysts must document their research and send to the Data Center copies of all analyses performed for Fast Track papers and presentations. Analysts also must send commented code for each analysis so that the programs can be replicated and archived for future reference.
- The Data Center is a repository of all measures used in the project and of all codebooks, manuals, technical reports, and outcome reports created by the project. In addition, all data collected in the study are stored at the Data Center. The four most recent years of data are kept on-site, and the rest of the data is stored in a secured, climate-controlled storage facility.