• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

PlayStation 4 hUMA implementation and memory enhancements details - Vgleaks

As we promised in our previous article, we present new information about the enhancements in the memory system in PlayStation 4.
http://www.vgleaks.com/more-exclusi...plementation-and-memory-enhancements-details/
Bypass Bits

- If many of these sorts of compute shaders are being run simultaneously, there is “cross talk” in that one compute dispatch may forcé an invalidate or a premature flush of another dispatch’s SC memory

- As a result of this (and other factors), it may be optimal to bypass either the L1, or the L2, or both

Bypassing all caches for the accesses to the shared CPU-GPU memory (effectively making the data UC rather than SC) will remove the need for the invalidates and writebacks of L1 and L2
At the same time, there will be more – perhaps much more – traffic to and from system memory
- It is possible to change the V# and T# definitions on a dispatch by dispatch basis when exploring these issues and tuning the application

- However, in order to allow for a more stable and debugable programming approach

Two override bits have been added to the draw call and dispatch controls
The L1 bypass bit specifies that operations on GC and SC memory bypass the L1 and go directly to L2
The L2 bypass bit specifies that operations on SC memory bypass the L2, using the new “Onion+” bus
This allows the application programmer to use same shader code and V#/T# definitions, and then run the shaders with several different cache flush strategies. No recompilation or reconfiguration is required
Four Memory Buffer Usage Examples

1) Simple Rendering

- Vertex shader and pixel shader only; the pixel shader does not direct memory accesses

- Vertex buffers (RO)

- Textures (RO)

- Color and depth buffers are written using dedicated hardware mechanisms, not memory buffers

2) Raycast

- In order to compute visibility (“can the enemy see the player”) or sound effect volume (“is there a direct path from audio source to player”), sets of 64 rays are compared against large triangle databases

- Triangle databases (RO)

- Input rays (SC)

- Output collisions (SC)

- The raycast probably doesn’t use much SC data and could potentially entirely bypass L2

3) Procedural Geometry (e.g. water surface)

- The CPU maintains a high level state of the water (ripples, splashes coming for interactions with game objects). The GPU generates the detailed water mesh, with is used only for rendering

- Input: water state as maintained by CPU (SC)

- Output: detailed water surface (GC)

4) Chained compute shaders

- Compute shaders write semaphores for the CP to read, enabling other compute dispatches (and perhaps draw calls) to run. They also add packets to compute pipe queues (perhaps packets that kick off more compute dispatches)

- Various buffers (RO, PV, GC, SC)

- Semaphores (UC)

- Compute pipe queue (UC)

- NOTE that CP does not have access to the GPU L2, so semaphores and queue contents must either be assigned the SC memory type (visible to the CP only after a L2 writeback) or the UC memory type (which bypases the L2)

- Using UC can allow for greater flexibility, e.g. a compute dispatch can have several stages that send and receive semaphores. Using SC requires the dispatch to terminate before the semaphore is visible externally
Strategies for Scalar Loads

- In addition to the “gather read” and “scatter write” loads into VGPRs (Vector GPRs), the R10xx core also supports scalar reads and writes into SGPRs (Scalar GPRs)

Typically, scalar reads are used to load T#, V#, and S# structures, as well as any other data that applies to the wavefront as a whole (as opposed to the vector reads that load data on a thread-by-thread basis)
- These read operations use the L2, but instead of the L1 they use a different cache called the “K-cache”. There is one 16 KB K-cache for each three CU’s

The K-cache must be invalidated when there is the possibility that it may contain “stale” data, e.g. a later draw call or dispatch uses the same location in the T# (etc) ring buffer as an earlier call
K-cache invalidation takes 1 cycle but dumps all data, resulting in a high cost
The most straightforward way of reducing the invalidation count is to use larger ring buffers for the scalar input data to the draw calls and dispatches
Performance

- Performance of the L2 cache operations is much better on Liverpool than on R10xx

- The L2 invalidate typically takes 300-350 cycles

All in-flight memory transactions must settle before the invalidate can be completed
A small overhead (about 75 cycles) is required to locate and invalidate the lines
This results in the direct cost listed above. There is also an indirect cost, in that invalidated SC data must potentially be reloaded
- The cost of an L2 writeback depends on the amount of data that must be written back to system memory

The Onion bus can support 10GB/sec, which means 12.5 bytes/cycle (0.2 lines/cycle)
If we attribute 160 GB/sec of the Garlic bus to the GPU, the bus can support 200 bytes/cycle (3.125 lines/cycle)
- If there is only a little SC dirty data present in the L2, the writeback is fairly fast

4K bytes worth of dirty Onion SC lines will take perhaps 500 cycles (Onion bottleneck PLUS small overhead to locate lines PLUS latency to system memory)
20K bytes worth of dirty Garlic SC lines will take about the same time
- Worst case L2 writeback cost is basically the Onion or Garlic cost of writing 512 KB (about 40,000 cycles and 3,000 cycles respectively)
Additional Optimizations

- There are additional further optimizations in the L1 and L2 caches

- The L2 cache has dirty state tracking

If the L2 has performed no reads from SC memory since the last invalidate, it will ignore any requests to invalidate
If the L2 has performed no writes to SC memory since the last writeback, it will ignore any requests to perform a writeback
This will help performance in the situation where multiple pipes are requesting invalidates and writebacks, e.g. several compute pipes are separately dispatching compute shaders that use SC memory
- The L1 cache can be invalidated “once per CU”

A dispatch may send multiple wavefronts to a single CU
Using this option, the invalidate of GC/SC occurs only on the first wavefront of the dispatch

Edit: Added Custom Direct X 11 support: Can utilize the Direct X 11 feature set

Developers will be able to take advantage of Microsoft’s latest industry standard DirectX API — DirectX 11.1, but Sony has taken the time to improve upon it, pushing the feature set beyond what is available for PC games development.
Those improvements include better shader pipeline access, improved debugging support features out the box, and much lower level access to the system hardware enabling developers to do “more cool things.” That’s achieved not only through an modified DirectX 11.1 API, but also a secondary low-level API specifically for the PS4 hardware.
http://www.geek.com/games/sony-iimprove-directx-11-for-the-ps4-blu-ray-1544364/
I also noticed that Xbox One GPU is DirectX 11.1+ & PS4 GPU is DirectX 11.2+

DirectX_11.2__PS4-pcgh.jpg



----------------------------------------------------------------------------------------------------------

Basically an update on the implementation of hUMA from the prior thread...
http://www.neogaf.com/forum/showthread.php?t=662537&highlight=vgleaks

-----------------------------------------------------------------------------------------------------------
In multiple threads some of us where debating if it was thread worthy or not... I took the precautionary measures in making one still

New news, new thread. That's the rules.

lol the amount of hUMA threads it will make if we take this analogy would be ridiculous
Ha! showed you!... wait a minute
 

RoboPlato

I'd be in the dick
Article is a bit too technical for my understanding but it seems like this will provide PS4 with some significant efficiency gains and will streamline programming for certain processes by a significant margin. It seems like this could be taken advantage of by more devs than I thought if some of the stuff is as simple to utilize as the article makes it sound.
 

Oppo

Member
So what you are saying is... this is a form of... one might term it...

Blast processing?

I do love me some dirty onion & garlic. that's Cajun style
 
So uhh.... is this unique to the ps4?

No. hUMA is a tech that AMD and others will be pushing over the coming years. It's believed that the Xbox One has something similar.


It's good news though. The less time the system needs to spend swapping memory the more time it can spend on other things. It's going to help games become more graphically rich and more complex as they become more and more efficient with the system. This happens with all hardware, but hUMA should multiply that effect. If this is as powerful and simple as it sounds... the difference between launch games and end of gen games will be the largest in console history. We'll look at Killzone and Forza and laugh that we thought they looked good.
 

RoboPlato

I'd be in the dick
No. hUMA is a tech that AMD and others will be pushing over the coming years. It's believed that the Xbox One has something similar.


It's good news though. The less time the system needs to spend swapping memory the more time it can spend on other things. It's going to help games become more graphically rich and more complex as they become more and more efficient with the system. This happens with all hardware, but hUMA should multiply that effect. If this is as powerful and simple as it sounds... the difference between launch games and end of gen games will be the largest in console history. We'll look at Killzone and Forza and laugh that we thought they looked good.

I think I just shed a tear.
 

USC-fan

Banned
This is the onion PLUS bus that only in the PS4. This is not something AMD has rolled into their product at this time.


Very cool and in depth info. Sony really made some very deep changes to this product. It going to pay off years from now when they really start "coding to the metal." haha
 
No. hUMA is a tech that AMD and others will be pushing over the coming years. It's believed that the Xbox One has something similar.


It's good news though. The less time the system needs to spend swapping memory the more time it can spend on other things. It's going to help games become more graphically rich and more complex as they become more and more efficient with the system. This happens with all hardware, but hUMA should multiply that effect. If this is as powerful and simple as it sounds... the difference between launch games and end of gen games will be the largest in console history. We'll look at Killzone and Forza and laugh that we thought they looked good.
Dang... by the way in any sort of way or form... will this help out the CPU overall power as in give it a extra boost

since both the ps4 and xbox one have similar CPUs but not GPUs
 

RoboPlato

I'd be in the dick
This is the onion PLUS bus that only in the PS4. This is not something AMD has rolled into their product at this time.


Very cool and in depth info. Sony really made some very deep changes to this product. It going to pay off years from now when they really start "coding to the metal." haha

I like the approach that they took to PS4. They knew it wouldn't be a beast in raw power since it needed to be affordable, quiet, and cool so they looked down the line and made a lot of improvements in efficiency for techniques that will be getting used in a few years and made them easy to access and utilize.

This is a different article. It has examples.
 

KAL2006

Banned
So with PS4 having HuMa, GDDR5, more powerful GPU, and some other stuff I forgot.

How wide is the gap between Xbox One and PS4 in none technical terms.
 

Spongebob

Banned
I like the approach that they took to PS4. They knew it wouldn't be a beast in raw power since it needed to be affordable, quiet, and cool so they looked down the line and made a lot of improvements in efficiency for techniques that will be getting used in a few years and made them easy to access and utilize.


This is a different article. It has examples.
Thanks.

Edited my last post.
 

EMT0

Banned
No. hUMA is a tech that AMD and others will be pushing over the coming years. It's believed that the Xbox One has something similar.


It's good news though. The less time the system needs to spend swapping memory the more time it can spend on other things. It's going to help games become more graphically rich and more complex as they become more and more efficient with the system. This happens with all hardware, but hUMA should multiply that effect. If this is as powerful and simple as it sounds... the difference between launch games and end of gen games will be the largest in console history. We'll look at Killzone and Forza and laugh that we thought they looked good.

Is multiply really the right word? I'm asking because I honestly don't know if that was just a slip of the tongue, or if you're being literal. In the latter case...weak hardware redeemed.
 

BenouKat

Banned
the difference between launch games and end of gen games will be the largest in console history. We'll look at Killzone and Forza and laugh that we thought they looked good.

Excellent !

Pretty funny to read when you saw people complain on every forums of internet that this gen will reach their end graphics capability very quickly. :)
 

Oppo

Member
So with PS4 having HuMa, GDDR5, more powerful GPU, and some other stuff I forgot.

How wide is the gap between Xbox One and PS4 in none technical terms.

I've been reading a lot of this stuff, and I am far from an expert, but I will give this a shot. This is my understanding, corrections are of course welcome.

So these two systems are very very similar. More comparable than any other two consoles in the same gen in history, I'd wager.

The Sony setup is very straightforward. It has high bandwidth access to a bunch of fast RAM. Most of the philosophy behind Ps4 is their own 180 from the Ps3. Nothing is particularly exotic or fancy. It's a workhorse and well provisioned. It sort of falls within the general philosophy of hUMA whichis to eliminate CPU/GPU bottlenecks. PS4 games will get up to speed quickly. New tricks will be discovered as time goes on because of the new communication abilities to be had between CPU and GPU.

The Xbone approach more closely resembles something like what Sony would have maybe done before. In order to use the much cheaper RAM they opted for, they are using some hUMA type techniques, notably a tiny chunk of fast eDRAM. This helps leverage the slower RAM into a much higher performance. But it's not as simple, it needs to be accounted for. Therefore that will have an impact.

Now much like Cell, hard software problems do get solved eventually, and performance increases, so we can also expect the Xbone performance to climb over time of course. But it's not quite as straightforward as the PS4 solution, and to my untrained eye it seems that you can see a bit of the compromise that Ms has opted to make, in the service of longevity/price/component suppliers/whatever.

But I still think the takeaway is that these boxes are more similar than not, and while the PS4 does indeed hold an edge, it still remains to be seen how that actually plays out in the wider market. All we know for sure is that Sony's first party will probably blow the doors off like they usually do but who knows what beyond that.
 
Top Bottom