I truly believe GameCube was the "perfect" hardware solution for Nintendo. They didn't brute force anything, but the real world performance was so incredible. I think one key component that made that all possible was the Mosys 1T-SRAM(the CPU & GPU were no slouch either). I remember Factor 5 was praising how amazingly fast it was, saying that they were caught by surprise, because they were already feeding the system huge amounts of data, yet, the more they fed it, it just kept blazing through it all. The access speed was just incredibly fast, like cache, some said.
I'd like to see Nintendo pursue a similar solution again. Embedded memory on the GPU for Frame buffer and other stuff, and a sizeable super-fast system memory for graphical/low-latency tasks, with an additional large pool of slower memory to handle slower/higher latency stuff(streaming, sound, etc). This strategy allowed them to create affordable, yet, very high performance hardware. I know they were worried about the cost of such exotic memory going forward, but I'm sure it could be cheaper now, and they could probably get by with less total memory than current consoles(eg: 64MB embedded, 1GB fast mem1?, 4GB slow mem2).
It's not like those 8GB of current gen significantly boosted their performance, but - in theory - is a bottleneck since the disc would take a long time filling just a fraction of that. Apple's iPhone gets by with a seemingly meager amount of system memory compared to Android devices, because what they have is tuned for performance, rather than brute force. Not sure why Nintendo abandoned GameCube's exotic memory solution when it produced such great result for decent cost, and a lot less silicon. I'm as curious as you about whether such a design is feasible this time around.
Wii U is very much a continuation of the Gamecube philosophy when it come to memory. There's a very small, extremely fast 3MB eDRAM pool, a quite small, very fast 32MB eDRAM pool, and then a large and slow 2GB DDR3 pool. The fact that they got away with a mere 12 GB/s of bandwidth for their largest pool shows how effectively they could utilise that 32MB in particular.
The problem with continuing along that road with the NX is that they no longer have the options they had before. More specifically:
50W TDP peak. 35W for the SOC.
2 x 4 core module PUMA 1.8-2.0GHZ
512:40:16 GCN 1.2 gpu at 800 mhz. 800 gigaflops. Might be 10CUs 1 teraflop if they clock the cpu lower.
32mb Edram 256-512gb/s
8GB of DDR4 on a 128bit bus. 50 gb/s
This is the ABSOLUTE minimum that I think you can except from NX, if Nintendo decides to cheap out again.
The bolded is where the problem lies, nobody seems to be offering eDRAM on processes below 40nm (outside Intel and IBM). This means that, for a small, high-bandwidth pool of memory, their options are reduced to SRAM or HBM.
To illustrate why SRAM is unsuitable for a framebuffer at 28nm, just look at Xbox One and PS4. You've got two consoles with a roughly similar cost released at the same time, but one chooses a single pool of GDDR5 and the other split pools of SRAM and DDR3. MS hoped that using a small on-die pool of memory for the framebuffer, like Wii U or Xbox 360, would give them the best of both worlds, by providing the GPU the bandwidth it needs while cheap DDR3 allows them a large 8GB of main memory.
The results are now obvious. SRAM is big (it takes up far more die space than eDRAM) and therefore very expensive. They could only accommodate 32MB of it on the SoC, and even then, with a larger SoC than Sony, there was a lot less room left for the GPU. So, they ended up with an embedded pool that isn't large enough for a console targeting 1080p, and a GPU that's almost 30% less powerful than it would otherwise have been. Meanwhile, Sony upgraded PS4's memory to 8GB at the last minute, leaving MS without even an overall capacity advantage. Nintendo would have exactly the same problems if they tried to take the SRAM approach to split pools. There's no getting around the cost and the die area implications.
HBM is more of an unknown. It's obviously expensive, but on a per-MB basis much cheaper than SRAM. In theory a single 1GB stack of HBM1 would provide both the capacity (obviously) and the bandwidth necessary for a console competitive with PS4 when combined with some quantity of DDR3/4. That said, a large part of the cost of HBM is surely the packaging (similar to the reason Wii U's MCM is as expensive as it is). That packaging cost isn't any different from HBM1 to HBM2, and won't be all that much more for two or four stacks of memory than it would be for one. So, for all we know it may be 4GB or bust when it comes to using HBM.
I think the question comes down to how much total RAM Nintendo wants to go with. If it's 8GB or less, then I can't imagine a HBM+DDR3/4 approach being cheaper than GDDR5(X), or even LPDDR4, and either of the latter should provide enough bandwidth for a GCN 1.2 GPU. If they decide they need 12GB or more, then perhaps a small HBM pool plus a DDR3/4 pool might be the cheaper way to give themselves both the bandwidth and capacity they need.
An alternative, of course, is to replace the embedded memory pool with a large victim cache which acts as an L3 for both the CPU and GPU (like Apple uses on many of their SoCs). It doesn't need to be large enough to hold the entire framebuffer to significantly reduce main-memory bandwidth requirements, but its effectiveness depends largely on how the GPU accesses the framebuffer (which depends both on the hardware and the way programmers use it). The PowerVR GPUs used in Apple's chips are designed specifically to conserve bandwidth by using a tile-based rendering system to maximise the efficiency of a cache system like Apple uses. AMD's GCN is designed for desktop environments with high-bandwidth GDDR5, though, so it might require a bit of effort on the part of engine programmers to get good use out of any L3 cache.