Single-Ancestry Analysis

Online Single-Ancestry PRS Training

Quick-Start Working Example

In this section, we provide a working example of the PennPRS single-ancestry data analysis pipeline. More detailed instructions can be found in the section below.

Suppose a user wants to train a PRS model for human height in the Admixed American population. The user can navigate to our GWAS Queryable Database and search for 'Height':

The study 'GCST90095033' is a suitable input for PRS training, providing GWAS summary statistics from 59,771 Hispanic or Latin American individuals. Users can view more details about this study by clicking the 'View Study' link, which redirects to the corresponding GWAS Catalog page.

To begin PRS training using this dataset from the GWAS Catalog, the user can navigate to our single-ancestry analysis page.

Next, select 'Query Data' and enter the relevant information, which can be obtained from the corresponding GWAS Catalog page. Then click 'Save & Continue'.

In this step, the user selects the training mechanism. There are two options available, and we recommend 'Pseudo-Training (Recommended)' if the user does not have a specific preference.

Next, the user selects the specific PRS training methods. For general users, we recommend selecting all three methods and enabling 'Run Ensemble' to generate two additional ensemble models based on the three methods. We strongly recommend using the default settings for these methods, although users have the option to modify them if needed.

Next, the user names the job. We recommend enabling email notifications and double-checking the job details before submission. Then click 'Submit'.

The user will then be directed to a page confirming successful job submission. If email notifications are enabled, the user will also receive updates on the job status.

A typical job takes approximately 1 to 5 hours to complete, depending on server load as well as the nature of the data and method selected. The user will receive an email notification once the job is finished. If no errors occur, the user can then download the trained PRS models (five in this case) and the log file in a zip file from the 'Job Center'.

If the job fails, the user can check the returned log files (available from the "Download Error Log"), browse the FAQ Section, or contact the PennPRS team directly for support.

Detailed Steps of Job Submission

Our cloud-based, end-to-end single-ancestry PRS model training job consists of five steps:

Step 1. Build input GWAS summary data file from one of the three options: (i) upload local GWAS summary data; (ii) query data from GWAS catalog; and (iii) reuse previously uploaded data.

Step 2. Select the PRS model training mechanism from one of the two options: (i) peudo training by using summary data and (ii) tuning-parameter-free training.

Step 3. Select the specific PRS training methods.

Step 4. Configure and submit the job.

Step 5. Monitor job status and download results.

Each single-ancestry analysis job allows uploading or querying one GWAS summary data file. If you would like to try different GWAS training datasets for the same trait, please submit separate jobs for the different GWAS data files.

Below we provide details for each step.

Step 1. Build input GWAS summary data file

To build the input GWAS summary data file, users have the following three options.

Option 1. Upload local GWAS summary data file

The users have the option to upload their local GWAS summary data file of formats including .txt, .tsv, .csv, .zip, .gz, .gzip, and .tar.gz., separated by tab, comma, or space. PennPRS will perform a unified harmonization step to convert the harmonized data into a standard format. The user needs to ensure the uploaded file meets the following minimum requirements:

Upload one file per job (maximum file size allowed: 800MB). If you upload a .zip file, please make sure that it only contains one file.
Select trait type (binary or continuous phenotype).
Specify the trait name when submitting the job.

In addition, the user needs to select a genetic population label for the uploaded data. PennPRS supports one of the five super populations defined by the 1000 Genomes Project:

AFR (African, Admixed African, or African American)
AMR (Admixed American, Hispanic/Latino)
EAS (East Asian)
EUR (European)
SAS (South Asian)

The updated GWAS data file should contain the following columns: CHR, SNP, A1, A2, BETA, SE, P, MAF, N. Please prepare your data in the following format with the required columns (for a continuous trait):

CHR SNP A1 A2 MAF BETA SE P N 1 rs3131969 A G 0.129396 -0.00478692 0.0105404 0.64972 32586 1 rs2286139 C T 0.136484 0.0018321 0.0105153 0.861684 31359 1 rs12562034 A G 0.103985 -0.0016805 0.0114914 0.883732 32827

One example data file satisfying the format requirement can be downloaded here.

PennPRS is able to automatically detect the required columns by checking their names/headers. Definitions and alternative options for the column headers are given below. For example, users are free to use "CHR", "Chr", "chr", or "Chromosome", all of which will be interpreted as the "chromosome" column.

CHR | Chr | chr | Chromosome: chromosome
RSID | snp | Snpid | Snpid_UKB | RS | Rsid | Rs_id: SNP RSID
A1 | a1 | Effect_allele | allele1 | allele_1 | alt_allele | EA: Effect allele (the allele which BETA corresponds to)
A2 | a2 | Allele2 | Allele_2 | Ref_allele | Other_allele | NEA: Alternative/Other allele
MAF | Maf | maf | Effect_allele_frequency | Eaf | FRQ | FRQ_U | F_U: frequency of the minor allele, effect allele, or either one of A1 and A2
BETA | Beta | beta: SNP effect size estimate
ODDS | odds | Odds_ratio: odds or odds ratio of on a binary trait
SE | se | Stderr | Std_Error | Stderr_Beta | SE_Beta | Stderr_B: standard error of the BETA estimate
P | Pvalue | P_value | pvalue | Pval | P_val | GC_Pvalue: p-value
N | n: GWAS sample size (allowed to vary by SNP) for continuous traits and the total (case + control) sample size for binary traits
Neff | N_eff: effective sample size, same as the total sample size for continuous traits and 4/(1/NCase + 1/NControl) for binary traits
Ncase | N_case: number of cases
Ncontrol | N_controll: number of controls

For continuous traits, we require N (sample size) or Neff (effective sample size, equivalent to N for continuous traits).

For binary traits, we require either Neff (effective sample size, not equivalent to N for binary traits) or Ncase + Ncontrol.

Additionally, there is no need to manually remove additional columns from the data file, as PennPRS will automatically ignore any redundant columns that are not needed for the PRS analysis.

Update September 15, 2025: We have updated our code to require an RSID column in the uploaded data to prevent unexpected errors during the quality control step. If only position information is available, please impute RSIDs using reference genotype data from the same genome build.

If the column names in the uploaded data file do not match any of the options listed here, there is no need to rename them in the data file. Instead, simply enter the names for the corresponding columns when submitting the job.

Option 2. Query data from the public GWAS summary database provided by PennPRS

The user can directly query summary data from our public GWAS summary database built based on over 27,000 harmonized datasets from the GWAS Catalog.

PennPRS can directly query data from the GWAS Catalog for PRS analysis. The user just needs to specify the following information:

Trait ID in GWAS Catalog (e.g., GCST009979).
Trait type (binary or continuous phenotype, which can be obtained from the original GWAS Catalog page, by clicking the "View Study" link).
Sample size: N (sample size) for continuous traits; Neff (effective sample size) or Ncase + Ncontrol for binary traits. This information can be obtained from the original GWAS Catalog page, by clicking the "View Study" link.
Trait name when submitting the job.
Genetic population label (one of the five super populations defined by the 1000 Genomes Project). Again, it can be obtained from the original GWAS Catalog page, by clicking the "View Study" link.

Update March 12, 2025: We have supported direct querying of GWAS summary statistics of over 2400 disease phenotypes from the FinnGen database (R12) (https://pennprs.org/data).

Option 3. Reuse previously uploaded data.

If the user has already uploaded a GWAS summary dataset and successfully finished the job, the uploaded data will be saved on our server, allowing the user to reuse the data for future jobs without needing to re-upload it.

To reuse the data, the user can simply select the name of the trait on the PennPRS record.

Step 2. Select the PRS model training mechanism

After successfully inputting the GWAS summary data, the user will select one of the two PRS model training mechanisms.

Option 1. Pseudo training by using summary data

Methods using this mechanism will require pseudo training approaches to split the full GWAS summary data into pseudo training and tuning summary-level datasets for model training and parameter tuning.

Currently, PennPRS supports the following three methods, using their pseudo-training versions developed and tested by the PennPRS team.

C+T-pseudo (Clumping and thresholding-pseudo)
LDpred2-pseudo
Lassosum2-pseudo

Option 2. Tuning-parameter-free PRS training

Methods of this mechanism do not require tuning parameters and can train the PRS model directly on the full GWAS summary data without the need to conduct subsampling. PennPRS currently supports the following three methods.

Step 3. Select the specific PRS training methods

After specifying the model training mechanism in Step 2, the user will be able to select one or more specific PRS training methods.

Pseudo PRS training by using summary data for parameter tuning

If this mechanism was selected in Step 2, then the user can choose one or more pseudo training PRS methods: C+T-pseudo, LDpred2-pseudo, and Lassosum2-pseudo.

If more than one method is selected, then the user will be provided with the option to train an ensemble PRS by the selected methods.

Under this model training mechanism, the user may specify tuning parameter settings. The user can either use the default setting (highly recommended) or customize the settings for each selected method.

Tuning-parameter-free PRS training

If this mechanism was selected in Step 2, the user can choose one or more methods from three currently supported methods: LDpred2-auto, DBSLMM, and PRS-CS-auto. We recommend LDpred2-auto for computational efficiency.

Step 4. Configure and submit the job

In Step 4, the user can provide a job name and enable email notifications. The user will then review the input data and method information before submitting the job.

Step 5. Monitor job status and download results

Once a job is successfully submitted, the user will see the following page, and the user can monitor the job status by clicking "View Job Status".

If the email notifications are enabled in Step 4, the user will receive separate status updates from nonreply.pennprs@gmail.com at each stage:

(i) when the job is successfully submitted

(ii) when the job starts running

(iii) when the job finishes (either completed successfully or failed)

The user can view the job status and download the results from the "Job Center".

If the job is completed successfully, the user will be able to obtain the PRS weights by clicking "Download Results".