Quanticate Blog

The Advantages of Parallel Processing Clinical Data in SAS/Connect

Written by Clinical Programming Team | Tue, Oct 15, 2024


As clinical trials grow in complexity, the volume of data collected and the need for advanced statistical techniques, such as Bayesian Analysis and Multiple Imputation (MI), continue to expand. These methods require substantial computational power, leading to increased processing times that can delay critical analysis and reporting stages. Such delays pose challenges in meeting tight project timelines, especially when multiple analyses or quality control (QC) tasks must be performed on large datasets.

To address these challenges, SAS/CONNECT offers a solution through parallel processing, allowing multiple tasks to be executed concurrently across different server sessions. This approach not only reduces the total time required for data processing and QC activities but also enables more thorough analyses, such as generating a higher number of imputed datasets or obtaining more samples in Bayesian frameworks. By leveraging SAS parallel processing, clinical data teams can maintain data integrity and quality while meeting stringent timelines, making it an essential tool for optimising workflow efficiency in clinical research.

 

Parallel processing

Typically, the reporting and analysis programs are run in SAS whereby a user would submit code to be executed in a single server session. This means that through single threaded workloads all code is executed sequentially, which can lead to longer processing times, especially with large datasets or computationally intensive analyses. .

SAS/connect is a SAS client/server toolset that provides the ability to manage, access, and process data in a distributed and parallel SAS environment. This means that a program can be executed on multiple server sessions at once, resulting in code being executed in parallel which can vastly reduce the processing time of tasks that can be executed in this manner.

 

Basic Parallel Processing with SAS: Using THREADS and CPUCOUNT

Parallel processing in SAS doesn’t always require complex setups or extensive coding. For many basic parallel processing needs, enabling multi-threading in SAS can significantly boost performance with just a few option statements. A key setting for this is the THREADS option, which allows SAS to distribute tasks across multiple CPU cores, making better use of modern multi-core processors.

To activate multi-threading, simply include the following in your SAS program:

OPTIONS THREADS CPUCOUNT=ACTUAL;

The THREADS option enables multi-threading, while CPUCOUNT=ACTUAL instructs SAS to detect the number of logical cores available on the system and use them efficiently. This allows SAS to adjust to the computational capacity of the environment automatically. Running the following command will confirm the number of cores detected in the SAS log:

PROC OPTIONS OPTION=CPUCOUNT;
RUN;

This allows users to tailor their parallel tasks based on the capacity of their server or local machine. For example, in environments with high CPU availability, this can drastically reduce the runtime of tasks such as data sorting or statistical computations.

 

Balancing Multi-threaded Performance

While multi-threading can offer significant speed improvements, it’s important to balance the workload to avoid over-utilizing system resources. By setting CPUCOUNT to a number lower than the available cores, users can limit the number of threads used, thereby preventing SAS from consuming too many system resources at once. For example:

OPTIONS THREADS CPUCOUNT=2;

This setting restricts SAS to using only two threads, which can be useful in shared server environments where other users may also need resources.

 

 

Multiple imputation

Multiple imputation is a statistical technique for analysing missing data by imputing missing values multiple times, analysing each of these datasets separately before combining the analysis. This approach helps to account for uncertainty about the missing values and is particularly valuable in clinical trials where missing data can skew results if not properly addressed.

Using SAS/connect, datasets can be imputed and analysed across multiple sessions at once before being pooled together after all of the sessions have finished executing. For example MI can be used to impute a dataset 1,000 times across 10 sessions with 100 datasets being imputed and analysed simultaneously in each session. This can result in a vast reduction in the time taken to perform this analysis when compared to 1,000 datasets being processed sequentially. Once the imputations are complete, the results from each session can be pooled together for the final analysis.

 

 

Dataset processing

Clinical trials collecting vasts amount of data on each subject can result in the execution of dataset processing (e.g. sorting, merging and deriving endpoints) taking longer as the size of the SAS dataset to be processed increases. When working with these large datasets, processing time can quickly extend, making it difficult to meet tight reporting timelines.

When parallel processing a dataset with SAS/CONNECT it may be possible to perform the vast majority of operations required on a subset of data being processed. Subsets of a dataset can therefore be executed on separate sessions in parallel before being combined once all sessions have finished executing. For example when producing an analysis dataset e.g. ADLB, in a trial where the derived variables for a given subject do not depend on the information collected on other subjects, a subset of subjects can be processed in a different server session.

For instance, imagine a dataset containing data for 1,000 subjects. If processing is distributed across 10 sessions, each session can handle data for 100 subjects independently, significantly reducing the total processing time. Once each subset is processed, the results can be combined into a single dataset for further analysis or reporting.

Analyzing multiple endpoints

Clinical trials often require analyzing multiple endpoints using the same analysis e.g. mixed effect model. Using SAS/connect, we would be able to perform the analysis on different endpoints in parallel opposed to sequentially in a given program.

Executing multiple programmes at once

In the reporting of a clinical trail, multiple analyses and reporting programs are required to be executed which can be time consuming. For example, when updates to programs are made that impact the results produced by other analysis programs or on the receipt of new/un-blinded data.

 SAS/connect enables you to run codes in parallel and even account for dependence between programs, reducing the overall processing time. It allows you to run independent programs simultaneously and manage dependencies between them, ensuring that related programs run in the correct order. For example, you can set up some programs to start only after others have completed, ensuring that downstream tasks have access to updated results.

Disadvantages

While parallel processing with SAS/CONNECT offers significant time savings and efficiency, it is not suitable for all tasks. Several limitations and potential challenges can impact its effectiveness, particularly in shared environments.

Dependency Between Tasks: Not all tasks can be executed in parallel. Many analysis workflows involve dependencies where the results of one step are needed before the next can begin. For example, if a dataset needs to be cleaned or transformed before it can be analysed, the cleaning step must be completed first. In such cases, parallel processing may not be possible, as running these steps concurrently would lead to incorrect results. In these scenarios, a sequential approach is required, reducing the potential for time savings.

Resource Contention: Using multiple server sessions simultaneously can lead to high resource consumption, particularly in shared computing environments. When a user initiates several parallel sessions, each session consumes CPU, memory, and I/O resources. If this is not properly managed, it can lead to a situation where a single user monopolises server resources, causing performance issues for others using the same server. This can result in slower response times for other users and may even impact the stability of the server.

 

Conclusion

SAS/connect is a powerful tool that can enable SAS code to be executed in parallel with little additional programming required which can vastly reduce processing times for computationally intensive tasks. This efficiency makes SAS/CONNECT particularly valuable in clinical trials, where timely data analysis and reporting are crucial. However, the additional processing carried out by an individual user can put strain on the servers used by others, therefore a compromise must be made by reducing the number of sessions used (increasing processing time) in order to maintain enough processing power for other users on the system.

Quanticate's statistical programming team can support you with Laboratory dataset, CDISC Mapping and SDTM conversions and domains. Our team of experts would be happy to provide support and guidance for your development programme if you have a need for these types of services please Submit a RFI and member of our Business Development team will be in touch with you shortly.