• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

PS4's memory subsystem has separate buses for CPU (20Gb/s) and GPU(176Gb/s)

Nah there was an actual downgrade.

Price

LKyZG.gif
 
So overall, I'd argue yes to your question.

The thing that puzzles me is the developers' mention of the need to allocate data correctly. When I first read about AMD's HSA, I assumed that everything would be unified and could be customized to the needs of the developers as required. I thought there wouldn't be any need at all for data separation, as to me and my limited knowledge it seemed that CPU and GPU data would be fed thrugh a common bus and allocated dynamically as needed. Now I'm confused :)
 
The thing that puzzles me is the developers' mention of the need to allocate data correctly. When I first read about AMD's HSA, I assumed that everything would be unified and could be customized to the needs of the developers as required. I thought there wouldn't be any need at all for data separation, as to me and my limited knowledge it seemed that CPU and GPU data would be fed thrugh a common bus and allocated dynamically as needed. Now I'm confused :)

I"m certainly not overly knowledgeable on this but wouldn't there be a learning curve for devs using unified memory?

And hence need to learn how to correctly allocate data for the new unified memory system that they might not be familiar with?

Or am reading into it incorrectly?
 

benny_a

extra source of jiggaflops
The thing that puzzles me is the developers' mention of the need to allocate data correctly. When I first read about AMD's HSA, I assumed that everything would be unified and could be customized to the needs of the developers as required. I thought there wouldn't be any need at all for data separation, as to me and my limited knowledge it seemed that CPU and GPU data would be fed thrugh a common bus and allocated dynamically as needed. Now I'm confused :)
But there isn't any need for data separation. I think it's just the author trying to put in more facts into the article but the closeness of the fact that PS3 is split-memory and PS4 is using more than one bus to access the same amount of RAM gives it the sense of equivalence where there isn't one.

i'm so ignorant about tech stuff, i need DBZ charts pls.
Technical wise nothing has changed so the old DBZ charts are still valid. What is surprising is how quickly they got something running. I don't follow DBZ but I guess if Goku had a son he would be birthed with ultra-speed.
 

GameSeeker

Member
Xbone's low spec holding PS4 back confirmed?


Incorrect. Please read the complete article.

---
The full quote from Eurogamer is :
"The PS4's GPU is very programmable. There's a lot of power in there that we're just not using yet. So what we want to do are some PS4-specific things for our rendering but within reason - it's a cross-platform game so we can't do too much that's PS4-specific," he reveals.

"There are two things we want to look into: asynchronous compute where we can actually run compute jobs in parallel... We [also] have low-level access to the fragment-processing hardware which allows us to do some quite interesting things with anti-aliasing and a few other effects."
----

Asynchronous compute is one of the areas the PS4 has a very significant advantage over Xbone. PS4 has extra CU units (18 vs. 12 for Xbone) that can be applied to asynchronous compute. PS4 also has custom modified compute queues: 64 vs. the standard 2 on AMD GCN parts.

It's great that PS4 ports are already looking at taking advantage of asynchronous compute this early in the lifecycle.
 
Incorrect. Please read the complete article.

---
The full quote from Eurogamer is :
"The PS4's GPU is very programmable. There's a lot of power in there that we're just not using yet. So what we want to do are some PS4-specific things for our rendering but within reason - it's a cross-platform game so we can't do too much that's PS4-specific," he reveals.

"There are two things we want to look into: asynchronous compute where we can actually run compute jobs in parallel... We [also] have low-level access to the fragment-processing hardware which allows us to do some quite interesting things with anti-aliasing and a few other effects."
----

Asynchronous compute is one of the areas the PS4 has a very significant advantage over Xbone. PS4 has extra CU units (18 vs. 12 for Xbone) that can be applied to asynchronous compute. PS4 also has custom modified compute queues: 64 vs. the standard 2 on AMD GCN parts.

It's great that PS4 ports are already looking at taking advantage of asynchronous compute this early in the lifecycle.

Though I don't agree with the person you corrected but the part you highlighted read exactly as what he was trying to say...
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
The thing that puzzles me is the developers' mention of the need to allocate data correctly. When I first read about AMD's HSA, I assumed that everything would be unified and could be customized to the needs of the developers as required.

It is unified as it has a unified address space. What they are talking about is that you have to decide if your memory access commands are checked against the L1/L2 of the CPU caches (Onion) or not (Garlic). L1/L2 are essential for any CPU's management of latency, and latency-tolerant memory access from the GPU would corrupt them without needing them. Hence, the two pathways. As a developer you have to decide, if CPU or GPU access performance is most crucial for a given data set. In the former case, you would issue memory access commands via Onion, in the former via Garlic. If the GPU wants to read/write data relevant to the CPU for asynchronous compute, it can use Onion+ to use the CPUs L1/L2 caches and to, analogously, not corrupt its own internal caches.

It is not a contradiction to HSA but, on the contrary, a requirement.
 
It is unified as it has a unified address space. What they are talking about is that you have to decide if your memory access commands are checked against the L1/L2 of the CPU caches (Onion) or not (Garlic). L1/L2 are essential for any CPU's management of latency, and latency-tolerant memory access from the GPU would corrupt them without needing them. Hence, the two pathways. As a developer you have to decide, if CPU or GPU access performance is most crucial for a given data set.

It is not a contradiction to HSA but, on the contrary, a requirement.

So this would in fact give devs more options?

If they wanted to be "lazy" they could and just use the Onion approach or go the garlic route and potentially gain more power/speed?

I apologize if my post asks old questions but I wasn't here in March :p
 

Nafai1123

Banned
Example of how Onion/Garlic bus's can be used.

"One's called the Onion, one's called the Garlic bus. Onion is mapped through the CPU caches... This allows the CPU to have good access to memory," explains Jenner.

"Garlic bypasses the CPU caches and has very high bandwidth suitable for graphics programming, which goes straight to the GPU. It's important to think about how you're allocating your memory based on what you're going to put in there."

"One issue we had was that we had some of our shaders allocated in Garlic but the constant writing code actually had to read something from the shaders to understand what it was meant to be writing - and because that was in Garlic memory, that was a very slow read because it's not going through the CPU caches. That was one issue we had to sort out early on, making sure that everything is split into the correct memory regions otherwise that can really slow you down."

So elements like main system heap (containing the main store of game variables), key shader data, and render targets that need to be read by the CPU are allocated to Onion memory, while more GPU-focused elements like vertex and texture data, shader code and the majority of the render targets are kept in the ultra-wide Garlic memory.

Really interesting read. It's great news that 3rd parties are already coming this far with the tools and lowest level GNM API this early. Sounds like the GNMX wrapper API still needs some work but is improving.

"Most people start with the GNMX API which wraps around GNM and manages the more esoteric GPU details in a way that's a lot more familiar if you're used to platforms like D3D11. We started with the high-level one but eventually we moved to the low-level API because it suits our uses a little better," says O'Connor, explaining that while GNMX is a lot simpler to work with, it removes much of the custom access to the PS4 GPU, and also incurs a significant CPU hit.

"The other thing we did is to look at constant setting. GNMX - which is Sony's graphics engine - has a component called the Constant Update Engine which handles setting all the constants that need to go to the GPU. That was slower than we would have liked. It was taking up a lot of CPU time. Now Sony has actually improved this, so in later releases of the SDK there is a faster version of the CUE, but we decided we'd handle this ourselves because we have a lot of knowledge about how our engine accesses data and when things need to be updated than the more general-purpose implementation... So we can actually make this faster than the version we had at the time."
 

stryke

Member
These kinds of tech presentations are always interesting, really appreciate Reflections for doing this (and Sony for allowing them to). Hopefully we get more of these leading up to launch.

Ubisoft seem very chummy with Sony lately
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
This is good right?

Yes.

Without two pathways either (a) all memory access would go through the L1/L2 caches and the GPU would corrupt them by causing caching of data irrelevant for the CPU which is highly dependent on L1/L2 or (b) all memory access would go directly to memory causing the CPU to die the death of latency since CPUs cannot hide latency as GPUs can.
 

vpance

Member
So PS4 specific optimizations = better AA, more fx. And at launch too. Bodes well for the future of multi plats.
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
So if the unified memory doesn't negate the need to separate data in the way you described, what is its main benefit?

They are not talking about separation of memory but how to allocate/access it. All memory is visible to both CPU and GPU. They are only talking about whether to access certain regions of memory via the one or the other pathway depending on whether CPU- or GPU-performance is most relevant to a particular data set.
 

Xenon

Member
Something told me we were not going to go into next gen without hearing about the unlocked power these machines have. So all we need now is a hard number of the percentage of the machines potential the first gen games are going to be using, 50,70,90 or 110%.
 
It is unified as it has a unified address space. What they are talking about is that you have to decide if your memory access commands are checked against the L1/L2 of the CPU caches (Onion) or not (Garlic). L1/L2 are essential for any CPU's management of latency, and latency-tolerant memory access from the GPU would corrupt them without needing them. Hence, the two pathways. As a developer you have to decide, if CPU or GPU access performance is most crucial for a given data set. In the former case, you would issue memory access commands via Onion, in the former via Garlic. If the GPU wants to read/write data relevant to the CPU for asynchronous compute, it can use Onion+ to use the CPUs L1/L2 caches and to, analogously, not corrupt its own internal caches.

It is not a contradiction to HSA but, on the contrary, a requirement.

Yes.

Without two pathways either (a) all memory access would go through the L1/L2 caches and the GPU would corrupt them by causing caching of data irrelevant for the CPU which is highly dependent on L1/L2 or (b) all memory access would go directly to memory causing the CPU to die the death of latency since CPUs cannot hide latency as GPUs can.

Thanks so much for the detailed explanation, I appreciate it!
 

WolvenOne

Member
I have limited tech saviness at my disposal, but I'm reading this as a shortcut to reduce latencies and bottlenecks. IE: When one Bus is occupied, things don't suddenly come screeching to a halt, in terms of performance.

It's little things like this that help consoles punch above their weight class, in relation to PC's. Vanilla off the shelf PC's, tend to have slightly less optimized architectures.

Also, I'm REALLY pleased with how easy PS4 development appears to be so far. I really hope it leads to a lot of new smaller titles later on. I'm really hungry for new franchises, and creative gameplay concepts.
 

benny_a

extra source of jiggaflops
Fuc*ing lol!

But yeah, Cerny really thought about pretty much everything. We will all reap the benefits of their hard work.
We all love Cerny (except the guy with the anti-Cerny avatar) but let's not discount the multitude of hardware engineers in Japan that probably proposed a lot of these things and then implemented them.

He is calling the shots but he is the representation of the hardware team.
 
We all love Cerny (except the guy with the anti-Cerny avatar) but let's not discount the multitude of hardware engineers in Japan that probably proposed a lot of these things and then implemented them.

He is calling the shots but he is the representation of the hardware team.

I love what Cerny represents :p
 

ElTorro

I wanted to dominate the living room. Then I took an ESRAM in the knee.
I have limited tech saviness at my disposal, but I'm reading this as a shortcut to reduce latencies and bottlenecks. IE: When one Bus is occupied, things don't suddenly come screeching to a halt, in terms of performance.

It is really just about whether an individual address in main memory should be mapped to CPU-L1/L2 cache (Onion) or not (Garlic). CPU-L1/L2 is (a) of limited size and (b) highly relevant to the CPU but at the same time irrelevant to the GPU. Hence, you issue access commands to CPU-relevant data through Onion, and access to CPU-irrelevant data through Garlic. As a result the GPU does not bully the CPU.
 

Panajev2001a

GAF's Pleasant Genius
This is the impressive part:



And this:



This bodes really well for PS4 ports to be of very quality and the best version on consoles.

This is really good. months for 2-3 people for such task at thsi stage of the console's lifecycle (months before launch) is quite impressive.
 

Orayn

Member
Cool stuff! Seems like this lets them sidestep the one potential drawback of the GDDR5, and choosing which bus to use sounds like a pretty trivial form of optimization compared to the PS3's split memory and Cell SPE shenanigans.

It was almost bad news.

It can still be bad news if you truly want it to be. You just need to BELIEVE! The Onion bus has a smaller number, you see, therefore fewer GokuFLOPS and bits.
 

DieH@rd

Banned
Really nice article, tons of details on PS4 arhitecture and SDK tools. Sony seems to be in very good position with PS4 deployment.


Article also confirmed two things - 2 CPU cores are dedicated to PS4 OS, and PSN download speed limit is [was] 12mbps.
 

Shikoro

Member
We all love Cerny (except the guy with the anti-Cerny avatar) but let's not discount the multitude of hardware engineers in Japan that probably proposed a lot of these things and then implemented them.

He is calling the shots but he is the representation of the hardware team.

Of course, they all did a wonderful job and I, just like a lot of people here, can't wait to see it in action for myself.
 
This doesn't seem that complicated to me. Developers can program like more standard systems and pass information from CPU to GPU through main memory if they want or they can pass it directly from the CPU to GPU using the onion bus, bypassing the main memory and the caches.
 
Top Bottom