Support NeoGAF

nordique · Mar 19, 2013

nicoga3000 said:
Is this the right thread for general questions? I'm at Target looking at the display and am deciding what to buy.

Deluxe system
MonHun
NSMBU
?????

I have the old classic controller and a Wiimote with plus, but do I need to buy anything else? Also, of what's currently available, any suggestions for what to fill the 3rd B2G1 game slot with?

depends what your tastes in gaming are

I haven't played through much of my Wii U library, but I did recently invest in NFS Most Wanted

Enjoyed Call of Duty BLOPS2 as well (perfect campaign length for the amount of free time I have, still haven't beat it though)

If you are a bigger gamer, than you can try something longer with a deeper gameplay experience (though I suppose online is also a safe bet with BLOPS2, though I don't play online) such as perhaps Mass Effect 3

lostinblue · Mar 19, 2013

nicoga3000 said:
Is this the right thread for general questions? I'm at Target looking at the display and am deciding what to buy.

Deluxe system
MonHun
NSMBU
?????

Sorry, this is not the thread for that.

blu · Mar 20, 2013

A discussion on b3d provoked me to run the matmul SIMD test on a few more CPUs. I decided to post the results here, given it all originated form this thread, particularly that some gaffers might find it useful in the big picture of CPU FLOPS and how well modern compilers can utilize those in a tight-loop situation. I have re-run the test and picked the best time for each participant out of multiple runs.

update 6: added dual-threaded results for ppe and bobcat
update 5: added a ppe
update 4: added a zambezi, courtesy of Rolf NB
update 3: added a sandy bridge, courtesy of Lightchris
update 2: added a dothan; added xmm intrinsics version for bobcat and nehalem
update 1: added a nehalem

So, participating CPUs:

IBM PowerPC 750CL Broadway 729MHz
AMD Brazos Bobcat 1333MHz
Freescale Cortex-A8 IMX535 1000MHz (Freescale's ARMv7-based SoC)
Freescale PowerPC MPC7447A 1250MHz (from the days of Motorola Semiconductor)
Intel Bloomfield Xeon W3565 3200MHz (a Nehalem through and through)
Intel Dothan Celeron M 630MHz
Intel Sandy Bridge Core i5 2500K 4000MHz
AMD Bulldozer Zambezi FX4100 3600MHz
IBM Cell PPE 3200MHz

Except for the 7447, PPE, Bloomfield, Sandy Bridge and Bulldozer, all participants are 2-way SIMD designs (despite what their individual ISAs might claim); 7447, PPE, Bloomfield and Bulldozer are 4-way; Sandy Bridge is 8-way, but in this test it's used as 4-way. Notes of interest:

7447's altivec block is in-order, despite the rest of the CPU being out-of-order, and features a single SIMD MADD unit.
Bloomfield's FP/SSE block is out-of-order, just as the rest of the CPU, and features one SIMD MUL unit and two SIMD ADD units.
PPE is in-order through and through. It does SMT, though.
Bobcat is 1333MHz with Turbo Boost, i.e. only one of the cores receives it under full load, while the other core is lightly loaded. Two cores under full load max out at 1000MHz each.

Absolute times (in seconds; best out of a couple dozens runs on each CPU):

750cl: 6.09968 (paired singles; autovectorized)
bobcat: 4.44868 (sse3; autovectorized)
bobcat: 4.21585 (sse3; manual intrinsics)
bobcat: 5.76252 (sse3; manual intrinsics; twice the workload, spread evenly across two threads)
cortex-a8: 7.05067 (neon; autovectorized)
7447: 1.59139 (altivec; manual intrinsics)
nehalem: 0.556405 (sse3; autovectorized)
nehalem: 0.694955 (sse3; manual intrinsics)
dothan: 9.09823 (sse2; autovectorized)
sandy bridge: 0.386927 (sse3; autovectorized)
sandy bridge: 0.500083 (sse3; manual intrinsics)
sandy bridge: 0.368508 (avx128; autovectorized)
bulldozer: 0.488872 (avx128; autovectorized)
ppe: 1.33314 (altivec; manual intrinsics)
ppe: 2.0651 (altivec; manual intrinsics; twice the workload, spread evenly across two threads)

Normalized per-clock (in clocks * 10^6; absolute_time * clock_in_mhz):

750cl: 6.09968 * 729 = 4446.66672
bobcat: 4.21585 * 1333 = 5619.72805
cortex-a8: 7.05067 * 1000 = 7050.67
7447: 1.59139 * 1250 = 1989.2375
nehalem: 0.556405 * 3200 = 1780.496
dothan: 9.09823 * 630 = 5731.8849
sandy bridge: 0.368508 * 4000 = 1474.032
bulldozer: 0.488872 * 3600 = 1759.9392
ppe: 1.33314 * 3200 = 4266.048
ppe: 2.0651 / 2 * 3200 = 3304.16 (SMT version)

GCC 4.6.3 was used in all tests (though not the same revision in all cases), sometimes cross-platform but most often self-hosted on debian. Essentially, builds were always made and run on debian, just not the same debians in the cross-build cases.

Points of note re compilers:

That ~3% speedup in the bobcat autovectorized case vs the old measurements (using GCC 4.6.1) is thanks to the newer compiler's scheduling support for AMD family 20 (bobcat), via -mtune=btver1. Fun fact: the even newer GCC 4.7.2 suffers a regression there - it produces ~25% slower test results when using the same bobcat tuning options. Update: manual emitting produces better results by ~5%.
Altivec's case (7447) is not using the same auto-vectorization approach as the rest of the CPUs due to a severe deficiency in the autovectorizer's support for altivec (more on that in comments in the code). So I had to revert to a more primitive vectorization approach, where one manually emits vector-ops via intrinsics. As one can see in the code, though, the extra developer effort is minimal, if any at all. Yet, there's code growth.
The A8 is another instance where the autovectorizer does a less-than-stellar job - despite NEON's support for vector-by-scalar multiplication, the autovectorizer insists on using shuffles to get the proper vector layout for the multiplications. I've been unable to mend that, so keep that in mind. Still, even with this penalty, the new 4.6.3 compiler does improve the cortex times notably.
Autovectorization gives better results on Nehalem than manual emitting (by ~25%). The main difference in the sse code generated by autovectorization and manual emitting, respectively, is that in the former case single vector elements are read in from mem (read: L1) and then splat into full vectors in the same reg via a single shuffle op, whereas in the latter case an entire vector is read in from mem, and then splat into separate regs, at the cost of two ops (introducing an extra dependency) per splat. My guess is that Xeon's L1 has excellent latencies, which make the per-element loads a clear win over extra-dependent two-op splats. Just a curious fact.
Except for the avx autovectorized version, gcc 4.6.3 produces Sandy Bridge code identical with Nehalem's. Thus the non-avx Sandy Bridge tests were done with the Nehalem code.
Bulldozer has an ISA extension particularly relevant to this test (FMA4), but I have been unable to make the autovectorizer utilize that extension (-march=bdver1 -mtune=bdver1 -mavx -mfma4 -ffp-contract=fast does not seem to do the trick).
PPE uses the vanilla altivect manual intrisics code. To account for PPE in-orderness, code was built with tuning option for power6 (-mtune=power6). Update: I originally underestimated the latency of some of the ops in that code; while the compiler offsets the madds readouts to about 4 clocks on the average, that is largely insufficient for PPE's madds which turn out to boast 12-clock latency. Enter SMT.

And here are the test-case code and the build-script; note that the latter is used as is only for the self-hosted cases; the script is modified accordingly for the cross-compiled cases. For instance, the btver1 tuning is not there as it was done in a cross-build.

Assembly listings:

750cl (autovectorized; matmul loop at L103)
bobcat (autovectorized; matmul loop at L102)
bobcat (manual intrinsics; matmul loop at L102)
cortex-a8 (autovectorized; matmul loop at L116)
7447 (manual intrinsics; matmul loop at L99)
nehalem (autovectorized; matmul loop at L102)
nehalem (manual intrinsics; matmul loop at L102)
dothan (autovectorized; matmul loop at L119)
sandy bridge (autovectorized avx128; matmul loop at L102)
bulldozer (autovectorized avx128; matmul loop at L102)
ppe (manual intrinsics; matmul loop at L102)

Refreshment.01 · Mar 20, 2013

YuChai said:
the lego game has been released. Wonder why the loading time is so slow (both disc and harddisk version). Poor coding? Or really require so much time to load, then, can we expect the PS4, 720 even worse (unless they use part of the ram as cache?)

Load times in PS4/Durango would be a lot better for the simple fact that a hard drive is included and the games will have the option to install.

Moreover, the VGleaks latest rumor says Durango requires mandatory installs for games (even for those bought on disks) and that there will be a way to play while the game installs seamlessly. As pointed out by some people at the WiiU unveiling, Nintendo did a poor job handling storage functionality and giving the user better interface options for the mass storage devices.

Donnie · Mar 20, 2013

Would be interesting maybe to list the theoretical floating point performance of each CPU tested.

bangai-o · Mar 20, 2013

the browser is html only? have any owners here tried google docs with it?

frankie_baby · Mar 20, 2013

Refreshment.01 said:
Load times in PS4/Durango would be a lot better for the simple fact that a hard drive is included and the games will have the option to install.

Moreover, the VGleaks latest rumor says Durango requires mandatory installs and that there will be a way to play while the game installs. So NIntendo did a poor job handling storage and giving the user better interface options for the mass storage devices.

Well there's nothing stopping Wii u devs having an install option

Refreshment.01 · Mar 20, 2013

frankie_baby said:
Well there's nothing stopping Wii u devs having an install option

This point has already been answered quite enough. Your mass storage device in the Wii U is limited by the USB 2.0 interface. It is indeed somewhat faster than the Blu ray drive but still quite slow in relative terms. Even a decent mechanical drive throughput surpasses what the USB 2.0 can handle. So Nintendo's decision leaves the end user with no alternatives for improved load times.

Ah but who im kidding... some people will try to turn this around no matter how transparent this is explained to them. Carry on...

blu · Mar 20, 2013

Donnie said:
Would be interesting maybe to list the theoretical floating point performance of each CPU tested.

Indeed. Essentially, all participants can do 2 FLOPs per clock per vector lane. All but the 7447 are 2-way SIMD (despite what their individual ISAs might claim); 7447 is the only true 4-way specimen.

Zoramon089 · Mar 20, 2013

Refreshment.01 said:
This point has already been answered quite enough. Your mass storage device in the Wii U is limited by the USB 2.0 interface. It is indeed somewhat faster than the Blu ray drive but still quite slow in relative terms. Even a decent mechanical drive throughput surpasses what the USB 2.0 can handle. So Nintendo's decision leaves the end user with no alternatives for improved load times.

If it's faster than the drive then it will improve load times...I'm not sure how that isn't the case. USB 2.0 has speeds of up to 35 mb/s. The disc drive reads up to 22.5 mb/s. That's a significant increase at over 50% faster. I think people are greatly exaggerating how slow USB 2.0 is

Earendil · Mar 20, 2013

This also does not include logistics such as shipping and packaging.

BaBaRaRa · Mar 20, 2013

blu said:
Indeed. Essentially, all participants can do 2 FLOPs per clock per vector lane. All but the 7447 are 2-way SIMD (despite what their individual ISAs might claim); 7447 is the only true 4-way specimen.

Excuse the ignorance here, but how 'compiler bound' are these results? I remember seeing some comparisons between GCC and intels own compiler on ix86 and the difference was vast.

So could it be that when using the native toolchain developers will squeeze more out of these architectures? Or will that only affect more general purpose code and not specific code like this?

Disclaimer: I never even run 'make test'

BaBaRaRa · Mar 20, 2013

Refreshment.01 said:
Load times in PS4/Durango would be a lot better for the simple fact that a hard drive is included and the games will have the option to install.

How much faster are the ps4/Durango hard drives, and how much more memory do they have to fill? Genuine question.

The comparison to the WiiU only stands if they are loading WiiU level graphics; that is, 1G of data. As they are loading four, five, maybe six times as much data then their transfer rate needs to be a lot faster than that for any noticeable difference.

I have a hunch this will be the case, but I'm interested if anyone knows the raw data.

deviljho · Mar 20, 2013

Refreshment.01 said:
This point has already been answered quite enough. Your mass storage device in the Wii U is limited by the USB 2.0 interface. It is indeed somewhat faster than the Blu ray drive but still quite slow in relative terms. Even a decent mechanical drive throughput surpasses what the USB 2.0 can handle. So Nintendo's decision leaves the end user with no alternatives for improved load times.

Ah but who im kidding... some people will try to turn this around no matter how transparent this is explained to them. Carry on...

I think if it's been answered "quite enough," then there is no need to harp on it or reply to everyone who brings it up. If you look at "Nintendo's decision" simply as not choosing a better tech spec, then this argument will always exist and be valid. At the end of the day, there will be people who tolerate and overlook this "problem" to play games that they enjoy, and there will be others who really don't like it. Out of the latter, some will suck it up and play the game for the enjoyment, anyway. If it's so obvious that Nintendo should have used an internal hard-drive or USB 3.0 and has been brought up since the unveiling, as you say, then there's no point bringing it up anymore.

blu · Mar 20, 2013

BaBaRaRa said:
Excuse the ignorance here, but how 'compiler bound' are these results? I remember seeing some comparisons between GCC and intels own compiler on ix86 and the difference was vast.

So could it be that when using the native toolchain developers will squeeze more out of these architectures? Or will that only affect more general purpose code and not specific code like this?

Disclaimer: I never even run 'make test'

Well, anything not written in assembly is fairly compiler-bound. That's why *the* most important part of this test has been trying to find the best publicly-available tools for the job; the results quoted are the best results among a few (at the very least two, but sometimes four or more) compilers that I've tried with each cpu. It's just that GCC 4.6.3 tends to perform the best across the board (read: all other gcc's and llvm's I've tried have produced no better results than 4.6.3 on this test). Also I've checked each compiler's output to see nothing 'apparently stupid' was done like excessive register spills, obvious optimisation benefits missed, etc. That's how many of the compiler candidates were rejected.

Now, re native toolchains - that's a bit fuzzy matter. First of all (and to the best of my knowledge), none of the console vendors this gen use a compiler made by the CPU vendor. Secondly, GCC and LLVM can be considered fairly 'native' by virtue of being popular open-source compilers - in many occasions the actual architecture backend model for a CPU is maintained by the actual vendor. Business-wise, none of the CPU vendors have interest in the most popular compilers mistreating the vendor's products. Of course that's how things sit in theory, in practice there are always delays, regressions and what not, but that's why anyone who tries to run such tests should do their homework. Or write in assembly, which I've deliberately avoided for the following simple reason.

Writing in assembly is a diminishing return endeavor - the more mature a compiler/architecture combo is, the lower the chances that hand-written assembly can significantly outperform, or outperform per se, the compiler's output. It takes as little as designing your control path in such a way as to allow the compiler to inline your critical code and intermix it with non-critical code, and a 'less-than-stellar' compiler might be already capable of producing overall better results than what an experienced dev would achieve by carefully tuning their critical routine in assembly. Now, you could say that the actual case of this test, namely, an isolated matmul loop, is a good candidate for hand-optimisation in assembly, and you'd be right. In this regard the test puts the compiler in a 'tight spot' where it really has to do its best to stay competitive to an experienced assembly developer. But it is still a good measure of what the architectures can achieve with a reasonable effort on the part of the developer. Last but not least, tuning assembly code by hand is a time-consuming business (more so than writing in high-level languages), and sometimes devs don't get the luxury to even attempt such an optimisation, let alone succeed in it.

AzaK · Mar 20, 2013

blu said:
A discussion on b3d provoked me to run the matmul SIMD test on a few more CPUs. I decided to post the results here, given it all originated form this thread, particularly that some gaffers might find it useful in the big picture of CPU FLOPS and how well modern compilers can utilize those in a tight-loop situation. I have re-run the test and picked the best time for each participant out of multiple runs.

So, participating CPUs:

IBM PowerPC 750CL Broadway 729MHz

AMD Brazos Bobcat 1333MHz

Freescale Cortex-A8 IMX535 1000MHz (Freescale's ARMv7-based SoC)

Freescale PowerPC MPC7447A 1250MHz (from the days of Motorola Semiconductor)

Absolute times (in seconds; best out of a couple dozens runs on each CPU):

750cl: 6.09968 (paired singles)

bobcat: 4.44868 (sse3)

imx53: 7.05076 (neon)

7447: 1.59139 (altivec)

Normalized per-clock (in clocks^6; absolute_time * clock_in_mhz):

750cl: 6.09968 * 729 = 4446.66672

bobcat: 4.44868 * 1333 = 5930.09044

imx53: 7.05076 * 1000 = 7050.76

7447: 1.59139 * 1250 = 1989.2375

GCC 4.6.3 was used in all tests (though not the same revision in all cases), sometimes cross-platform but most often self-hosted on debian. Essentially, builds were always made and run on debian, just not the same debians in the cross-build cases.

Points of note re compilers:

That ~3% speedup in the bobcat timing vs the old measures is thanks to the newer compiler's scheduling support for AMD family 20 (bobcat), via -mtune=btver1. Fun fact: the even newer GCC 4.7.2 suffers a regression there - it produces ~25% slower test results when using the same bobcat tuning options.

Altivec's case (7447) is not using the same auto-vectorization approach as the rest of the CPUs due to a severe deficiency in the autovectorizer's support for altivec (more on that in comments in the code). So I had to revert to a more primitive vectorization approach, where one manually emits vector-ops via intrinsics. As one can see in the code, though, the extra developer effort is minimal, if any at all. Yet, there's code growth.

The A8 is another instance where the autovectorizer does a less-than-stellar job - despite NEON's support for vector-by-scalar multiplication, the autovectorizer insists on using shuffles to get the proper vector layout for the multiplications. I've been unable to mend that, so keep that in mind. Still, even with this penalty, the new 4.6.3 compiler does improve the cortex times notably.

And here are the test-case code and the build-script; note that the latter is used as is only for the self-hosted cases; the script is modified accordingly for the cross-compiled cases. For instance, the btver1 tuning is not there as it was done in a cross-build.

As usual, assembly listings upon request ;p

Awesome blu, I wouldn't mind asm if you have it.

joesiv · Mar 20, 2013

frankie_baby said:
Well there's nothing stopping Wii u devs having an install option

Do you know this? Just curious, because this is something that could easily be defined within Nintendo's Lot Checks, or even limited due to the SDK setup. Locked down consoles *have* to control the data path, to avoid things such as piracy. Xbox 360 and PS3 both had provisions made by Sony and Microsoft for game installs, PS3 even has an allocation setup for developers to use as a cache in addition to the "install".

BaBaRaRa · Mar 20, 2013

blu said:
thorough and informative reply

Very interesting, thanks for that

I always hoped that game development would be the last stronghold of the silicon ninja writing assembler with precision magnets straight into ram...

frankie_baby · Mar 20, 2013

joesiv said:
Do you know this? Just curious, because this is something that could easily be defined within Nintendo's Lot Checks, or even limited due to the SDK setup. Locked down consoles *have* to control the data path, to avoid things such as piracy. Xbox 360 and PS3 both had provisions made by Sony and Microsoft for game installs, PS3 even has an allocation setup for developers to use as a cache in addition to the "install".

wii u games already store some data on the system memory or harddrive (saves, dlc and patches) i cant see why they wouldn't be able install some of the disc data

mrklaw · Mar 20, 2013

BaBaRaRa said:
How much faster are the ps4/Durango hard drives, and how much more memory do they have to fill? Genuine question.

The comparison to the WiiU only stands if they are loading WiiU level graphics; that is, 1G of data. As they are loading four, five, maybe six times as much data then their transfer rate needs to be a lot faster than that for any noticeable difference.

I have a hunch this will be the case, but I'm interested if anyone knows the raw data.

Assuming intelligent installation and ordering of data to avoid lots of random access (which will slow things down massively), you could probably get 100-150MB/s

To fill the entire 8GB on a PS4 or Durango you'd be looking at around 60-90 seconds.

wsippel · Mar 20, 2013

mrklaw said:
Assuming intelligent installation and ordering of data to avoid lots of random access (which will slow things down massively), you could probably get 100-150MB/s

To fill the entire 8GB on a PS4 or Durango you'd be looking at around 60-90 seconds.

I would expect Sony and Microsoft to use 5,400 RPM 2.5" drives, so more like 80 - 100MB/s sequential, 35 - 45MB/s random.

Deactivision · Mar 20, 2013

A general question for the techies... Is there any obvious hardware limitation in the Wii U that is preventing some games from featuring anti-aliasing? It seems to be lacking in some key games (Lego City Undercover, some of the uglier Pikmin 3 screenshots are two examples that come to mind). If the Wii U does have the power to render those graphics with AA why wouldn't developers take advantage of it? And if it doesn't, what is that attributable to? Is it likely to remain a problem for Wii U games, or is it the sort of thing that developers will be able to add when they are more comfortable with the hardware?

frankie_baby · Mar 20, 2013

Deactivision said:
A general question for the techies... Is there any obvious hardware limitation in the Wii U that is preventing some games from featuring anti-aliasing? It seems to be lacking in some key games (Lego City Undercover, some of the uglier Pikmin 3 screenshots are two examples that come to mind). If the Wii U does have the power to render those graphics with AA why wouldn't developers take advantage of it? And if it doesn't, what is that attributable to? Is it likely to remain a problem for Wii U games, or is it the sort of thing that developers will be able to add when they are more comfortable with the hardware?

well deus ex apparently has AA that the ps360 versions dont but we shall see

Zoramon089 · Mar 20, 2013

mrklaw said:
Assuming intelligent installation and ordering of data to avoid lots of random access (which will slow things down massively), you could probably get 100-150MB/s

To fill the entire 8GB on a PS4 or Durango you'd be looking at around 60-90 seconds.

Are you assuming 7200rpm drives?! Because neither the PS3/360 had them and I don't expect them to be in the PS4/Durango either

blu · Mar 20, 2013

AzaK said:
Awesome blu, I wouldn't mind asm if you have it.

Thank you. I've updated the original post with urls to the assembly listings, and I've also thrown in a Nehalem as a bonus.

joesiv · Mar 20, 2013

frankie_baby said:
wii u games already store some data on the system memory or harddrive (saves, dlc and patches) i cant see why they wouldn't be able install some of the disc data

I just don't think we can assume it's entirely up to the developer to support installing. I've worked closely with all three major first parties and am familiar with all three's technical requirements, and things like these aren't something that the developer can just "do", and expect to pass first party certification. If the first party (as I said Sony/MS) support said feature, then indeed it's up to the developer. MS and Sony both have detailed lists of features your game supports, including things like HDD Cache, and Installation, Nintendo has never had such "features" such as game installation, or even local caching.

Though, I haven't seen the development documentation for the WiiU, it's possible that its included, I'm just reluctant to assume it's there. I'd love it if it was an option for developers though, and from Nintendo's standpoint, I could see why they'd want to include it (and maybe will in the future), as since the N64 and even gamecube days, they've desired to keep load times short.

lostinblue · Mar 21, 2013

joesiv said:
I just don't think we can assume it's entirely up to the developer to support installing. I've worked closely with all three major first parties and am familiar with all three's technical requirements, and things like these aren't something that the developer can just "do", and expect to pass first party certification. If the first party (as I said Sony/MS) support said feature, then indeed it's up to the developer. MS and Sony both have detailed lists of features your game supports, including things like HDD Cache, and Installation, Nintendo has never had such "features" such as game installation, or even local caching.

I've heard that the Wii had 64 MB of ROM reserved/usable for caching on the NAND; it wasn't used by most developers (if any) though due to them not knowing for sure if it was a permanent feature or something that could decrease over time.

Smurfman256 · Mar 21, 2013

Donnie said:
Considering retailers need to take a cut and shipping and packaging costs I think a hardware parts cost of anywhere around $250 for the basic would very likely mean a loss per unit.

did you take the US excise tax into account?

AzaK · Mar 21, 2013

blu said:
Thank you. I've updated the original post with urls to the assembly listings, and I've also thrown in a Nehalem as a bonus.

Thanks.

M3d10n · Mar 21, 2013

Refreshment.01 said:
This point has already been answered quite enough. Your mass storage device in the Wii U is limited by the USB 2.0 interface. It is indeed somewhat faster than the Blu ray drive but still quite slow in relative terms. Even a decent mechanical drive throughput surpasses what the USB 2.0 can handle. So Nintendo's decision leaves the end user with no alternatives for improved load times.

Ah but who im kidding... some people will try to turn this around no matter how transparent this is explained to them. Carry on...

Did you ever notice that copying a folder full of small files in your computer takes much longer than copying a single big file? Loading in most games is not a simple "read this huge, continuous file from disk to RAM". Most games have to read data from several different files in order to load a level. Reading separate files is always slower than reading a single continuous file due to seek time. HDDs very rarely max their throughput outside from copying huge files.

The main advantage a SSD has over standard HDDs is the seek time, which is much lower since there's no need to physically move a needle around to jump to a different drive sector. SSDs excel at reading/writing lots of small files, thus why they have drastic impact in computer boot times and software launch times.

There was a thread where someone compared the loading time of some Wii U game from disc, USB SSD, USB HDD and internal memory. The USB SSD was the fastest by far, beating even the internal flash. If USB 2.0 were the bottleneck for the game in question, SSD or mechanical wouldn't make a difference.

Now, in the case of a game where the developers somehow pack all the data necessary for a level into a single monolithic file that is simply copied as-is into RAM, then SATA would bring a big advantage over USB 2.0. Such technique duplicates tons of data, however.

joesiv · Mar 21, 2013

M3d10n said:
Did you ever notice that copying a folder full of small files in your computer takes much longer than copying a single big file? Loading in most games is not a simple "read this huge, continuous file from disk to RAM". Most games have to read data from several different files in order to load a level. Reading separate files is always slower than reading a single continuous file due to seek time. HDDs very rarely max their throughput outside from copying huge files.

Just to play devils advocate here, most games will package up their content into few larger containers. In most cases all art will be in one big file, audio in another, scripts in another, etc.. It's a part of the build pipeline, and was probably implemented to assist in this type of situation (along with some resemblance of protection from unwanted eyes)

joesiv · Mar 21, 2013

lostinblue said:
I've heard that the Wii had 64 MB of ROM reserved/usable for caching on the NAND; it wasn't used by most developers (if any) though due to them not knowing for sure if it was a permanent feature or something that could decrease over time.

that's possible, I don't recollect. Hopefully they made provisions on the WiiU. Though, a single chip of NAND isn't incredibly fast either, any one know of any benchmarks for the K9K8G08U1D (or similar) chip the WiiU has?

ambientmystic · Mar 21, 2013

blu said:
Thank you. I've updated the original post with urls to the assembly listings, and I've also thrown in a Nehalem as a bonus.

That Nehalem is just a monster for its age (circa 2009). Imagine if the test was multi threaded to use all of the Xeon's cores.

I almost wish you could add Sandy Bridge and Ivy Bridge based samples to see how much of an improvement those two are over the Nehalem in this test.

Also, couldn't you have taken advantage of the Nehalem's SSSE3/SSE4.1/4.2 extensions in your test? Was it necessary to use just SSE3?

blu · Mar 21, 2013

ambientmystic said:
That Nehalem is just a monster for its age (circa 2009). Imagine if the test was multi threaded to use all of the Xeon's cores.

I almost wish you could add Sandy Bridge and Ivy Bridge based samples to see how much of an improvement those two are over the Nehalem in this test.

Nehalem is quite potent, indeed. And I'm not done adding CPUs yet ; ) As re the test - it tests the SIMD of a single core, as that's what it was meant to do.

Also, couldn't you have taken advantage of the Nehalem's SSSE3/SSE4.1/4.2 extensions in your test? Was it necessary to use just SSE3?

SSSE3 and SSE4.2 have no relevance to this test. SSE4.1 does have what to offer there (dpps op), but it would require a rewrite that would make the test far less portable, while not necessarily improve the performance on those CPUs that support it - it would not help the nehalem; it could help the bobcat with its high-latency shuffle, but bobcat has no SSE4.1.

Disorientator · Mar 25, 2013

eSOL's FAT File System is selected for Wii U

Tokyo, Japan. March 14, 2013 - eSOL, a leading developer of real-time embedded software solutions, announced today that eSOL's PrFILE2 FAT file system is used in Nintendo's Wii U (TM) game console. PrFILE2 provides file operation functions including data reading, and writing for the SD memory card attached to the Wii U.

The PrFILE2 FAT file system is part of eSOL's eCROS real-time OS-based software platform. eCROS consists of the eT-Kernel real-time OS, the eBinder integrated development environment, middleware software including PrFILE2, USB stacks, network protocols and GUI, and professional support services. PrFILE2 offers many functions for digital home appliances including a high-speed backward-seeking file pointer function and the ability to minimize data loss even when power is lost or media are unexpectedly ejected. Moreover, PrFILE2 supports the multi-language, dynamic character code conversion and UNICODE character modes necessary for use in products marketed overseas.

eSOL also offers the newly developed PrFILE2 FAT Safe as an option with applications using PrFILE2. PrFILE2 FAT Safe does not just minimize data destruction, but actually prevents data loss in the event of sudden power outages or an unexpected media ejection.

eCROS, a proven real-time OS-based software platform that ensures high reliability, has been widely used in many embedded products including the Wii (TM) (the predecessor of the Wii U), automotive information systems, digital home appliances, and office automation equipment.
"We are honored to be selected once again by Nintendo." said Hiroaki Kamikura, General Manager of the Embedded Products Division, eSOL. "I hope Nintendo likes our professional services and our ability to implement PrFILE2 effectively as much as they like PrFILE2 itself, The PrFILE series has been adopted in many digital consumer products and eSOL will continue its strong commitment to fully support embedded software application developers."

Press release dated March 14th but search returned nothing.

FyreWulff · Mar 25, 2013

Tron#1 said:
so what exactly does that mean? On another note quick development question. is it possible for a developers to add off-tv gmeplay in a patch. like just for example could TT games add off-tv gameplay to lego city if they wants through a patch or something like that?

Monster Hunter Ultimate added off-TV in a patch

Disorientator · Mar 25, 2013

Tron#1 said:
so what exactly does that mean?

PrFILE2 provides file operation functions including data reading, and writing for the SD memory card attached to the Wii U

Same SD file system used on the Wii.

Nothing unexpected of course, as the system is backwards compatible.

ikioi · Mar 26, 2013

This post may come across as a huge whinge, but I'd really like to bounce my thoughts of you guys and have some constructive feedback.

Iwata states Nintendo are selling the Wii U at a loss. I can only assume this price includes R&D, fabrication setup costs, manufacturing costs, shipping, taxes, marketing, the whole kit. CNN, Chipworks, and various others have done estimations, and overwhelmingly the finding is the hardware doesn't account for a significant portion of the RRP Nintendo have set for the console. Personally, I really can't see what possibly could have been expensive about it, everything looks pretty low end and cheap.

I understand the CPU is based on the PPC 750 architecture, so same architecture as the Gamecube and Wii. It differs from other 750s as it's multi core, clocked over 1GHz, and features 3 megabytes of eDRAM. But still surely this CPU wouldn't have been costly to engineer? It's after all low watt, low performance, low transistor count, built on a mature 40nm fab process, and based on over decade old architecture.

GPU, again more of the same as the GPU. Based on quite old AMD architecture, low end performance, built on what should be a mature 40nm fab process, and while customised again not an overly complex design. Could the 32 megabytes of eDRAM really add significantly to the chips manufacturing costs?

MCM? I got no clue about the complexities of these. I would have thought still not significant in the scheme of things.

The eMMC flash memory seems to be the low end stuff found in smart phones etc. Poor read and write speeds, low capacity at 8 or 32GB, can't see this being expensive or even requiring any real R&D.

2GB of DDR3 on a 64bit bus. Don't think I need to elaborate here.

ROM Drive. Like previous Nintendo consoles, again don't think it'd be expensive or require any real investment into R&D or engineering to come up with.

The controller also seems to be made up of pre existing and mature technologies. Resistive touch screen, LCD display, NFC, gyro and other sensors, Broadcom Mirrorcast steaming tech,1500mw LI battery, etc.

I again can't see how complex engineering or developing the controller would have been. I know Nintendo stated that they spent a lot of time on the Mirrorcast tech though, as they wanted it to be reliable and have excellent latency. But that sounds more like tweaking and fine tuning to me then millions of dollars worth of R&D and engineering resources. I also can't see the implementation of all these technologies into the one unit as expensive, smart phones and tablets contain a lot/all of these same technologies as the Wii U.

From a software POV, I also can't see the OS having cost much money. Given its lack of features, poor performance, and stability issues with hard locks. It seems to be far less developed then the Xbox 360 and PS3's were at their launches.

Performance wise, the general consensus seems to be that the Wii U is only marginally better then Xbox 360 and PS3. If that's the case why didn't Nintendo just go buy a mid/low end AMD APU for <$80 USD and call it a day. If all they wanted was 300-500 gigaflops of GPU performance with DX10-11 architecture, seems the AMD + IBM combo would have been a more expensive way to achieve it. Backwards compatibility? Reluctance to change from a architecture they've been using for a decade? Why stick with IBM when it looks like AMD could have done an all in one for less. The AMD APU would have also solved their MCM expenses.

The_Lump · Mar 26, 2013

^I'd hedge a bet that back compatability was a driving force in that...

FyreWulff · Mar 26, 2013

The controller also seems to be made up of pre existing and mature technologies.

.. which only started having licensing of approved devices a month before the Wii U came out.

Nostremitus · Mar 26, 2013

I wouldn't consider the low wattage as a bullet point on why it should be inexpensive... The chips are actually quite powerful for their consumption. That's not cheap. Chipworks estimated the MCM package to be ~$100. The ROM drive is proprietary, it's not a standard Bluray drive. And again, its low power consumption could also effect price. I dunno about any of the other components, though I recall Nintendo stating that it took a while to get the latency on the gamepad right. Time=money when paying people to work. I'd try to find quotes, but I'm on my phone.

Edit: I wonder if they're including their network infrastructure in the cost.

FyreWulff · Mar 26, 2013

It has the lowest latency of any modern wireless controller. The gamepad can read input and update the screen accordingly faster than a lot of TVs.

NBtoaster · Mar 26, 2013

FyreWulff said:
It has the lowest latency of any modern wireless controller. The gamepad can read input and update the screen accordingly faster than a lot of TVs.

Did someone ever measure the latency of Wii U games?

RoombaDance · Mar 26, 2013

ikioi said:
I also can't see the implementation of all these technologies into the one unit as expensive, smart phones and tablets contain a lot/all of these same technologies as the Wii U.

I feel like people tend to forget a nice smartphone can cost around $400 -$600 without a contract. The Wii U sells at $300. Still the details of how it adds up are a bit of a mystery.

tassletine · Mar 26, 2013

ikioi said:
I again can't see how complex engineering or developing the controller would have been. I know Nintendo stated that they spent a lot of time on the Mirrorcast tech though, as they wanted it to be reliable and have excellent latency. But that sounds more like tweaking and fine tuning to me then millions of dollars worth of R&D and engineering resources. I also can't see the implementation of all these technologies into the one unit as expensive, smart phones and tablets contain a lot/all of these same technologies as the Wii U.

From a software POV, I also can't see the OS having cost much money. Given its lack of features, poor performance, and stability issues with hard locks. It seems to be far less developed then the Xbox 360 and PS3's were at their launches.

Performance wise, the general consensus seems to be that the Wii U is only marginally better then Xbox 360 and PS3. If that's the case why didn't Nintendo just go buy a mid/low end AMD APU for <$80 USD and call it a day. If all they wanted was 300-500 gigaflops of GPU performance with DX10-11 architecture, seems the AMD + IBM combo would have been a more expensive way to achieve it. Backwards compatibility? Reluctance to change from a architecture they've been using for a decade? Why stick with IBM when it looks like AMD could have done an all in one for less. The AMD APU would have also solved their MCM expenses.

ANY new tech is extremely costly to produce. If you're just buying components off the shelf and slotting them in it's much cheaper, but the result would be what you get with most phones -- a service that soon slows down and stops working after a couple of years. Don't forget that the WiiU is also running two screens as once, and as with a game like Zombie U it does it pretty well (even for a launch title).
You're mistaken if you think the WiiU OS lacks features overall. It lacks some the other consoles have but it makes up for it in others. Miiverse I think would be very complicated -- I bet this is the reason the OS isn't running at full speed. The lock ups seem to occur more when you are connected online. I've only had one so far though, unlike my xbox that does it regularly when it overheats.

As for the OS being less developed than the others at launch. I'd say they're comparable even today. Given that there's been 7 years to work on them the xbox and PS3 OS aren't good at all. The most annoying thing about the WiiU os is that it has a giant "Please wait" message and it plays a sound. I think without that it's problems would be much less noticeable. It still needs to be fixed though.

wsippel · Mar 26, 2013

ikioi said:
I understand the CPU is based on the PPC 750 architecture, so same architecture as the Gamecube and Wii. It differs from other 750s as it's multi core, clocked over 1GHz, and features 3 megabytes of eDRAM. But still surely this CPU wouldn't have been costly to engineer? It's after all low watt, low performance, low transistor count, built on a mature 40nm fab process, and based on over decade old architecture.

GPU, again more of the same as the GPU. Based on quite old AMD architecture, low end performance, built on what should be a mature 40nm fab process, and while customised again not an overly complex design. Could the 32 megabytes of eDRAM really add significantly to the chips manufacturing costs?

You're heavily underestimating those two factors. Also, anything that increases the size of a chip considerably increases manufacturing costs, as it reduces the yield - less dies fitting on a wafer, and a much higher probability that individual dies come out dead. Removing the eDRAM would probably cut the manufacturing cost in half.

By the way: The CPU is 45nm, and we don't actually know what the GPU is based on or how much of whatever it was is left. Seemingly not much, as it doesn't look anything like any known AMD GPU. It's supposedly manufactured using a modern TSMC process - maybe 40LPG, which is actually a rather new and expensive process even though it's 40nm (first used for Tegra 3 in 2011).

Goodlife · Mar 26, 2013

Are we ever going to find out any more about the GPU, or is it now just a case of look at the games to see how powerful it is?

ikioi · Mar 26, 2013

tassletine said:
ANY new tech is extremely costly to produce. If you're just buying components off the shelf and slotting them in it's much cheaper

Isn't that what Nintendo are effectively doing?

The gamepad's streaming tech is Broadcom's Mirrorcast. Based on Ask Iwata, software/firmware level optimised for improved latency response and resistance to interference
The Processor and its architecture is IBMs
AMD customise a GPU from the R700 series
Samsung provide NAND Flash and eMMC memory
Samsung, Micron, and various others provide the DDR3
Wireless networking is provided by Broadcom 802.11n chip
NFC chip in controller provided by Broadcom

Seems to me a majority of the components in the Wii U are off the shelf. Only the GPU and CPU are not, and they're rather simplistic and dated architectures by modern standards. Neither the CPU or GPU is cutting edge in either architecture or performance. Old architecture, low end performance. Is it not true that the Wii U's CPU has roughly the transistor count of a single Xenon core? Or that the GPU at best has around 400 gigaflops of processing power?

If Nintendo are spending serious coin developing and manufacturing a 400 gigaflop GPU and a multi core PPC 750 CPU they're fools. Everything I've seen suggests a budget AMD APU, $80 RRP, could out the two of them without effort.

As for the Mirrorcast software, I can't imagine Nintendo spending tens of millions adapting Broadcom's Mirrorcast tech to the Wii U.

but the result would be what you get with most phones -- a service that soon slows down and stops working after a couple of years. Don't forget that the WiiU is also running two screens as once, and as with a game like Zombie U it does it pretty well (even for a launch title).

Based on the state of the Wii U's OS, its already stopped working a few times for me.

As for the OS being less developed than the others at launch. I'd say they're comparable even today. Given that there's been 7 years to work on them the xbox and PS3 OS aren't good at all. The most annoying thing about the WiiU os is that it has a giant "Please wait" message and it plays a sound. I think without that it's problems would be much less noticeable. It still needs to be fixed though.

I don't agree with this.

wsippel said:
You're heavily underestimating those two factors. Also, anything that increases the size of a chip considerably increases manufacturing costs, as it reduces the yield - less dies fitting on a wafer, and a much higher probability that individual dies come out dead. Removing the eDRAM would probably cut the manufacturing cost in half.

Not sure I understand.

Are you saying Nintendo invested significant money into the fabrication setup for the CPU. Because the Wii U's CPU is unlike any other PPC 750 ever produced (multi core, large cache, high clock speed) a whole new fab process had to be setup. This setup would have cost Nintendo a lot of money? How much money we talking?

The size of the Wii U's CPU is nothing. Compared to Xenon at 45nm, isn't it roughly 1/3rd the size and thus transistor count?

As for the GPU, the eDRAM I understand would cost a bit extra. But surely we're talking dollars per chip at most here?

Over all though wisppel I'm not really buying the argument that the Wii U hardware cost a lot then first appearance would suggest. The tablet controller is criticised for its battery life, well Nintendo put a very small battery in it. Why'd they do that? Money is the only reason I can see. 2GB DDR3 on a 64bit bus, seriously it would have been a matter of dollars to increase that to 256bit and say 4GB. The eMMC flash memory in the system is the same low end slow read and write stuff used in iPads and smart devices. Cheapest of the cheap available on the market for the most part, Samsung don't manufacturer flash memory slower then it from what I can find in their product listing. Majority of the Wii U's chips sans CPU and GPU are off the shelf , and all are costed in the single digit dollars range.

The impression I'm getting now is that Nintendo made poor choices for the hardware and basically shot themselves in the foot. Why have they persisted with the PPC 750 CPU architecture? It's a 15 year old architecture give or take, and frankly no matter how modified it is, its going nothing on modern x86 or IBM CPU architectures. Where else is the 750 used? No where outside of Nintendo's products, IBM dumped it over a decade ago with the last of the Apple iBooks.

The only thought I have for why Nintendo persisted with PPC 750 is because of backwards compatibility and a reluctance to embrace and learn a new architecture. Having used PPC 750 since Gamecube sticking with it means they can avoid having to reskill and can reuse a lot of assets and tools they've developed over the years. Nintendo do seem to go out of their way to avoid having to learn or embrace new architectures. Evident by their continued support of PPC 750, fixed function GPUs, and utilising the same base architecture concept from the Gamecube to Wii U. Seems to me they've spent more money trying to adapt their existing architectures and beef them up for HD gaming, like dicking around and making a multi core PPC 750, then what that money could have brought had Nintendo invested it into the best architecture AMD and IBM could have provided.

Spend $100 beefing up a 750 CPU. Result = still pathetically bad performance
Spend $100 buying a best CPU IBM/AMD have available. Result = Very good performance but we'd have to learn a new architecture, develop new tools and assets, up skill and retrain staff, and we'd also lose backwards compatibility

phosphor112 · Mar 26, 2013

If there were any example to show that dropping BC was a good thing... it would be Wii U. The cost of the CPU is way too high.

StevieP · Mar 26, 2013

Being sold at a loss was a per unit loss, and that did not include things like R+D

Support NeoGAF

WiiU technical discussion (serious discussions welcome)

Member

Banned

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Banned

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Banned

Member

Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Member

Member

MrArseFace

Banned

Member

Member

Banned

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Banned

Member

Member

Member

Member

Member

Member

Wants the largest console games publisher to avoid Nintendo's platforms.

Member

Member

Member

Banned

Banned

Member

Member

Member

Member

Member

Member

Banned

Member

Banned

Banned

Banned

Similar threads