24-CORE MACHINE FOR 4K+

Mystery Box is starting off the year with a brand new computer designed for high-power After Effects and Davinci Resolve rendering.  While establishing our new performance benchmarks, we took the opportunity to optimize After Effects, using its Memory and Multiprocessing features.

Why Optimize?

A recent high definition animation project we finished for a client required 43 minutes 48 seconds using the basic After Effects settings on our 8-Core MacPro (3.0GHz, 128GB RAM, Dual AMD FirePro D700).  The same computer, using the same project files and optimized After Effects multiprocessing settings, took 17 minutes 54 seconds, an improvement of 245% using the exact same hardware.

Optimization can improve performance on projects where frames are independent in their rendered elements.  Some plug-ins and effects that reference multiple frames, such as temporal noise reduction, or third-party plugins that are independently optimized to make use of the video and cpu hardware, can see reduced performance when multiple frames are rendered simultaneously as the separate streams compete for system resources.  However, when frames are rendered independently, initiating multiple processing streams can make more efficient use of your hardware.

Background

The first assumption that needs challenging when optimizing After Effects for multiprocessing is that throwing all physical or hyper threaded cores at the project will give the best possible performance. 

This is not the case.

When you allocate cores for multiprocessing in After Effects, you’re given the choice for the amount of RAM to allocate for each process, and the amount of cores that should be reserved for other applications.  The catch is that both Windows and Mac OS operating systems allow for managed use of system resources; initiating 2-3 threads in After Effects may use 4-6 cores worth of resources.  If you have 6 cores total and initiate 6 threads, you may inadvertently resource starve each thread by reducing the managed resources at the OS level.

Additionally, the amount of RAM allocated to each thread can make a big difference in thread performance. After Effects allows for allocation of 1-6GB per core.  In a comparison test on a single machine, allocating 6GB of RAM per thread using 7 threads gave the 17 minute 54 seconds time referenced above, while reducing the RAM allocation per frame to 4GB gave a result of 20 minutes 47 seconds, or a 13.9% reduction in efficiency, to 3GB gave a result of 21 minutes 56 seconds, or a performance reduction of 18.4%.  Since the maximum RAM allocation is determined by the RAM available to AE / # of Cores used, or 6GB, whichever is smaller, increasing the number of threads used beyond the RAM allocation of 6GB or 4GB / thread can significantly reduce the rendering performance, even if it wouldn’t result in a significant starving of other system resources.

The last consideration when optimizing a render system is the multiprocessing initiation time.  The more threads initialized at the start of a render, the more time taken by After Effects to initialize its multiprocessing core.  This involves initiating the subprocesses for each thread, and a start sequence delay on threads so that the threads generally finish in chronological order - if a thread finishes out of order, it must wait for the previous threads to finish their render frames before completing its process so that it can transfer its rendered frame back to the centralized core process to write it to disk in the proper order.  The net result of this initialization is that it can take a few minutes to start all threads, which time can actually hinder the rendering performance of sequences that would otherwise take only a handful of minutes under the default settings.

The Test Project

In order to determine the optimal multiprocessing settings we used a recently completed project that relied on a small number of internal and external vector shapes and objects, a couple of external rasterized images with transparency, significant use of motion blur and gradients, as well as applied and transitional gaussian blurs.  Puppet warps and many-element motion reduced general frame-rendering performance while keeping the limiting bottleneck within the After Effects rendering process itself rather than any associated storage hardware.  Similarly, the project was rendered into a lightly compressed, rasterized image format (Apple Animation) to reduce any compression load on the core process and free up system resources for allocation to the multiprocessing threads.  These settings should provide an adequate performance benchmark for multiprocess After Effects rendering on our systems while controlling for other influential factors.  Rendering was done in 8-bits / channel RGB @ 1080p30.

Results

Starred value is optimal; values in blue show no statistically significant divergence from optimal (±5%).

Machine 1: MacPro 8-Core, OS-X Yosemite 10.10.1, 3.0GHz Intel Xeon Gen 2 8-Core with Hyperthreading (16 virtual cores), 64GB DDR3 ECC 1833MHz RAM, 2 x AMD FirePro D700, AE Version 13.2 (CC 2014.2).  Rendering to Thunderbolt 2 RAID 0 (900MB/s write)

No Multiprocessing: 43 Min 48 Sec

5 Cores @ 6GB RAM: 22 Min 45 Sec

6 Cores @ 6GB RAM: 18 Min 49 Sec

7 Cores @ 6GB RAM: 17 Min 54 Sec

7 Cores @ 4GB RAM: 20 Min 47 Sec

*8 Cores @ 6GB RAM: 17 Min 23 Sec

8 Cores @ 4GB RAM: 19 Min 45 Sec

9 Cores @ 4GB RAM: 19 Min 35 Sec

Machine 2: Custom 24-Core, Windows 8.1, 2 x 2.5GHz Intel Xeon Gen 3 12-Core with Hyperthreading (48 Virtual cores), 128GB DDR4 ECC 2133MHz RAM, 2 x Nvidia Quadro K6000, AE Version 13.2 (CC 2014.2).  Rendering to Internal SSD RAID 0 (1.5GB/s write).

No Multiprocessing: 41 Min 52 Sec

8 Cores @ 6GB RAM: 12 Min 23 Sec

10 Cores @ 6GB RAM: 11 Min 34 Sec

11 Cores @ 6GB RAM: 11 Min 36 Sec

*12 Cores @ 6GB RAM: 10 Min 46 Sec

13 Cores @ 6GB RAM: 10 Min 48 Sec

14 Cores @ 6GB RAM: 12 Min 13 Sec

15 Cores @ 6GB RAM: 13 Min 34 Sec

19 Cores @ 6GB RAM: 15 Min 5 Sec

Analysis

The 8-Core MacPro sees optimal performance while using 8 threads (100% of physical cores, 50% of HT Virtual) and 48GB of RAM (82.8% of 58GB available), while the 24-Core Custom sees optimal performance while using 12 threads (50% of physical cores, 25% of HT Virtual) and 96 GB of RAM (78.7% of 122GB available), though a minor correlation in optimal settings indicates the maximum amount of RAM per core (6GB) be used up to around 80% of the RAM available for After Effects.

In all cases, OS based activity and performance monitoring displayed significant and consistent CPU usage of all available processing power (Activity Monitor under Mac OS displays this value as 1600%, 100% for each physical and virtual core, divided among the active processes; Task Manager under Windows displays this value as 100%, 2.08% for each physical and virtual core), suggesting that the limiting factor is not the available processing power, but how well it’s optimized for performance.

Conclusion

Other than trial-and-error testing on a per-machine basis, there is no accurate predictor for optimal rendering performance.  Under the right conditions, significant gains is rendering power can be achieved, resulting in significant reductions in render time.  It is suggested that through experimentation the optimal settings can be determined, however, a small difference of precision in values around the optimal “ballpark” result in marginally different render times, suggesting that except in extreme cases, close enough is good enough.

 

Multiprocessing can be a useful tool for improving After Effects rendering performance when dealing with large amounts of internally generated data on a per-frame basis.  However, when you consider the drawbacks of multiprocessing, and the potential performance reduction of some plug-ins and effects, it becomes quickly apparent that the decision to use optimized multiprocessor settings should be done on a project-by-project basis, disabling multiprocessing for short renders, when third party optimized plug-ins, and multi-frame dependent sequences; and enabling it for animation and motion graphics projects with frame independence.