• Hey, guest user. Hope you're enjoying NeoGAF! Have you considered registering for an account? Come join us and add your take to the daily discourse.

Anandtech: Intel's new Atom CPU beats AMD's Jaguar in performance

That is the fucking x86 decoder, (as in the max instructions per cycle that it can decode into micro ops) that is not the CPUs IPC when processing a load!

So what do you think this means?

The IPC of Piledriver and Jaguar CPU cores should be pretty close.

http://beyond3d.com/showpost.php?p=1667582&postcount=43

Bobcat/Jaguar core can issue up to 6 instructions per clock (it has a dual port integer scheduler, a dual port AGU scheduler and a float "coprocessor" scheduler that can issue 2 instructions per clock). It however cannot decode/retire more than 2 instructions per clock.

In comparison a BD/PD module (2 cores) can issue 2x4+4 instructions = 12 instructions per clock, and decode/retire 4 instructions. So a BD/PD module (two cores) provides exactly the same peak and sustained rates as two Bobcat/Jaguar cores.

Of course shared decode and especially shared retire are a boon for BD/PD average IPC, since the processor can temporarily boost the sustained decode/retire rate of a core when the another code stalls (pipeline bubbles, cache stalls, branch misprediction, etc). The peak (all cores running at full steam) instruction throughput of Jaguar and BD/PD cores are however exactly the same.

http://beyond3d.com/showpost.php?p=1668355&postcount=61

Yes, they are also talking about the decoder, but it's clearly part of the discussion about the IPC.
 

Durante

Member
Durante, that paper is truly amazing. I just hope you have no part in it. I mean, I read it up to this paragraph I'll quote here, for coder's gaf amusement:

"Compiler: Our toolchain is based on a validated gcc 4.4 based
cross-compiler configuration. We intentionally chose gcc so
that we can use the same front-end to generate all binaries. All
target independent optimizations are enabled (O3); machine-
specific tuning is disabled so there is a single set of ARM bi-
naries and a single set of x86 binaries. For x86 we target 32-bit
since 64-bit ARM platforms are still under development. For
ARM, we disable THUMB instructions for a more RISC-like
ISA."

Using gcc 4.4 (in a research from 2013..) with no machine-specific optimisations across widely varying architectures (i.e. in-order and out-of-order) to draw conclusions about performance.. Using x86 on amd64 machines, under the excuse 64-bit ARM is still unavailable (nevermind it's using a known 32-bit op encoding, and is already supported in emulated form by well-established toolchains), while refusing to use Thumb encoding on the ARM, to draw conclusions about the code density.. Amazing.
I had a link to the paper sent to me and only read the abstract. Still, I don't know if you're in academia but the fact that e.g. an older version of GCC was used is not particularly surprising or unusual. Just because something was published in 2013 doesn't mean that it was written in 2013, and certainly not that all the experiments it was based on were performed in 2013. Sadly, there's a lot of time that can (and at least some time that has to) pass to get anything published. I don't think any of the points of critique you bring up completely invalidate the conclusions drawn in the work (particularly since I've not seen any evidence to the contrary).
 

ghst

thanks for the laugh
Jaguar is a low end CPU because it have abysmal IPC. You can stack as many cores as you want to get good multithread scores in benchmarks, but that doesn't change that fact.

No matter how well optimized your code is, it will run as fast as the slowest thread. And games need fast IPC over multiple threads. There are a lot of sub-routines that need good IPC, so I'm actually worried about how next gen consoles CPUs willl hold next gen games innovations beyond graphics.

FFS, dual core i3s can beat this things easily even in heavily multithreaded scenarios, how can you affirm that they are more than just low end parts? CPU isn't about GFLOP additions. CPUs have to do the job in a lot of different heterogeneous workloads, and Jaguar is far from that.

i'd really appreciate a thread which went in to the guts of CPU parallelism and what we can really expect from a generation built upon octo core jaguars (which are actually more like two quad core jaguars duct-taped together), but there are way too many hair trigger apologists out there right now for that to be anything but a "but does your PC run uncharted?" quagmire.

i mean, the console CPUs are coming in at half the speed of 2500k in optimal scenarios, but so many PC developers are terrified at the idea of having truly parallel code, instead preferring to run a linear "game thread" and offload shitwork to the supplementary threads. i mean, it's all well and good for something like civ 5 where latency and synchronicity aren't going to be an issue, but for something like starcraft 2 i can imagine it being a nightmare (and is cited as the main reason for its crappy threaded performance).

what should we really be expecting over the next six years? is truly optimal thread parity a realistic ambition in game development?
 

enra

Neo Member
As I said, an octocore Bulldozer derivative CPU only have 4 FPU's.

How do you explain the huge discrepancy (50% difference) between the data from that graph and the results from anandtech. It is obvious that the methodology used in that review is faulty.
 

blu

Wants the largest console games publisher to avoid Nintendo's platforms.
I had a link to the paper sent to me and only read the abstract. Still, I don't know if you're in academia but the fact that e.g. an older version of GCC was used is not particularly surprising or unusual. Just because something was published in 2013 doesn't mean that it was written in 2013, and certainly not that all the experiments it was based on were performed in 2013. Sadly, there's a lot of time that can (and at least some time that has to) pass to get anything published. I don't think any of the points of critique you bring up completely invalidate the conclusions drawn in the work (particularly since I've not seen any evidence to the contrary).
No, I'm not in academia so I did indeed neglect the possibility the research and publishing timeframes might largely differ - mea culpa. I'm in the HPC field though (while been in the games filed in a previous life), and evaluating compilers performance in the context of divergent ISAs and ISA implementations has been part of my daily routine for the better part of my career. The value of whatever that paper concludes (which could be largely true for the parameters of their experiment, I haven't actually gone through their figures) is significantly tarnished by their methodology setup alone - if they draw valid conclusions for the current timeframe and compiler status quo that would be by pure chance.
 

kortez320

Member
IPC on Piledriver is also bad.

And even if the Jaguars was on the same level it's still starting at a bad place with a very low clock speed.

It's just not a fast chip and even assuming a game that will scale 100% across all cores it's still not as fast as mid-range chips from four years ago.
 
IPC on Piledriver is also bad.

And even if the Jaguars was on the same level it's still starting at a bad place with a very low clock speed.

It's just not a fast chip and even assuming a game that will scale 100% across all cores it's still not as fast as mid-range chips from four years ago.

Well, I disagree. Jaguar uses a more efficient architecture, which is intended to work at lower clock speeds. This is a more detailed post from B3D explaing the differences between Jaguar and Piledriver.

http://beyond3d.com/showpost.php?p=1715619&postcount=155

Piledriver core (CPU used in Trinity) is designed for high clock speeds (turbo up to 4.3 GHz, overclock up to 8 GHz). In order to reach such high clocks, several sacrifices had to be made. The CPU pipelines had to be made longer, because there's less time to finish each pipeline stage (as clock cycles are shorter). The cache latencies are longer, because there's less time to move data around the chip (during a single clock cycle). The L1 caches are also simpler (less associativity) compared to Jaguar (and Intel designs). In order to combat the IPC loss of these sacrifices, some parts of the chip needed to be beefed up: The ROBs must be larger (more TLP is required to fill longer pipelines, more TLP is required to hide longer cache latencies / more often occurring misses because of lower L1 cache associativity) and the branch predictor must be better (since long pipeline causes more severe branch mis-predict penalties). All these extra transistors (and extra power) are needed just to negate the IPC loss caused by the high clock headroom.

Jaguar compute unit (4 cores) has same theoretical peak performance per clock as a two module (4 core) Piledriver. Jaguar has shorter pipelines and better caches (less latency, more associativity). Piledriver has slightly larger ROBs and slightly better branch predictor. But these are required to negate the disadvantages in the cache design and the pipeline length. The per module shared floating point pipeline in Piledriver is very good for single threaded tasks, but for multithreaded workloads, the module design is a hindrance, because of various bottlenecks (shared 2-way L1 instruction cache and shared instruction decode). Steamroller will solve some of these bottlenecks (before end of this year?), but it's still too early to discuss about it yet (with the limited information available).

Jaguar and Piledriver IPC will be in the same ballpark. However when running these chips at low clocks (<19W) all the transistors spent in Piledriver design that allow the high clock ceiling are wasted, but all the disadvantages are still present. Thus Piledriver needs more power and more chip area to reach similar performance than Jaguar. There's no way around this. Jaguar core has better performance per watt.

My guess:
- Multi threaded workload (GPU stressed): Jaguar wins by +10% (because of clock difference: 1.6 GHz vs 1.815 GHz)
- Multi threaded workload (GPU idle): Tie(depends highly on how much PD can turbo clock in this case)
- Single threaded workload (GPU idle): PD wins by up to +50% (single core in PD module runs up to 20% faster when the other core is idling + PD can turbo clock to 2.4 GHz when only a single core is active).

GPU performance difference is impossible to estimate at this point, because we do not yet know exact details about the Temash/Kabini APU configurations. However Jaguar core takes (considerably) less die space and consumes slightly less power than a Piledriver core (of similar performance). This means that AMD could equip the Jaguar based APU with a more powerful GPU than Trinity at the same TDP limit. It's all about market segmentation. If AMD plans to place the high end Kabini APUs against Intel's Ultrabook chips, we might see a more powerful GPU (AMD has technology available for this already, just look at the PS4 design). However if AMD sees Kabini as a low end laptop / netbook platform, they are likely going to focus on making a small chip that is cheap to produce, and limit the GPU options to low performance models.
 
Well, you could at least try to back up your claim.

You are comparing CPUs just by how many instructions they can decode per cycle. Let's ignore branch prediction and out of order capabilities, let's forget about cache, instruction support, FPU, micro ops and shite . What the hell, let's even ignore that it isn't true that Bulldozer and Jaguar can decode the same number of instructions per cycle.

Are you aware that a PD core have around 3 times the number of cycles of a Jaguar core? You know, that long pipeline crap that allows it to surpass 5Ghz easily when we haven't seen even a 2Ghz Jaguar.

Dunno, it is just the easiest card I can play, among much others.

Using your instruction throughput measure, someone could argue that Core2 and Sandy Bridge have the same IPC. So we can make an idea of your argumentation.

It is not me the one who have to write a wall of text explaining how a modern CPU works to prove that a Jaguar core can't match the performance of a Piledriver core. It is you the one who claims that Jaguar have the same IPC of a PD. It is you the one who have to prove your point.

There are only 2 scenarios:

1. You are just trolling trying to make some fun at the unability of people to explain why Jaguar is a low end architecture.
2. You haven't the slightest idea of what are you talking about.

But even more hilarious is that you are trying to prove that Jaguar IPC isn't low end comparing it to an architechture know by its low IPC.


How do you explain the huge discrepancy (50% difference) between the data from that graph and the results from anandtech. It is obvious that the methodology used in that review is faulty.

What are you talking about?

CB1151.png


51136.png

51135.png



You asked for an example with bad cinebench scaling, and I brought you one. In this case, adding 4 more cores hurt the performance because Turbo core can't kick-in with 8 cores as aggresively as it does with only 4. But you are still limited to 4 FPUs, so your performance is lower. This is what happens when you try to measure CPUs adding GFlops and multiplying them by core count.
 
Basically, you are taking a CPU built around clock speed, removing said clock speed and then arguing that it performs lower than a CPU built around efficiency.

I can imagine a lot of examples to mock that comparison, but I won't use any. Because I have no need to.
 

mkenyon

Banned
In the article you posted:
Jaguar compute unit (4 cores) has same theoretical peak performance per clock as a two module (4 core) Piledriver
1.6-1.8GHz Piledriver is beyond laughable. I guess the point that you're not getting is that Piledriver isn't good. It's worse than the first generation core series processors when you're talking about gaming performance. The only reason it does okay, especially in multimedia creation, is the very high clock speed.
Basically, you are taking a CPU built around clock speed, removing said clock speed and then arguing that it performs lower than a CPU built around efficiency.
Exactly.
 
Intel has outperformed AMD since 2006. No one should be amazed and shocked by the results of this comparison. Sony and MS chose AMD because they could source CPU + GPU from one company, plus Intel would have charged a lot more than AMD did for the privilege of using their CPUs.
 
Basically, you are taking a CPU built around clock speed, removing said clock speed and then arguing that it performs lower than a CPU built around efficiency.

I can imagine a lot of examples to mock that comparison, but I won't use any. Because I have no need to.

You can't simply ignore that Jaguar has twice the SIMD throughput than Piledriver, and then say that it's worse than Piledriver just because it uses a lower clock speed.

Now, this is what I said in a previous post.

An 8 Core Jaguar at 2GHz should score around 4, which is more than the FX-4350. However, we don't know what is the PS4 CPU clock speed.

The FX-4350 is clocked at 4.2GHz. As you can see, I am not ignoring the clock speed in my comparison.

You asked for an example with bad cinebench scaling, and I brought you one. In this case, adding 4 more cores hurt the performance because Turbo core can't kick-in with 8 cores as aggresively as it does with only 4. But you are still limited to 4 FPUs, so your performance is lower. This is what happens when you try to measure CPUs adding GFlops and multiplying them by core count.

Which would not be the case with Jaguar, as each core has its own FPU. With an 8 core Jaguar you would have 8 FPUs. And it's not just the FPU that is not shared.

http://beyond3d.com/showpost.php?p=1667582&postcount=43

Jaguar cores have their own 2-way 32 KB L1i caches. Two Piledriver cores share a 2-way 64 KB L1i cache. Sharing a 2-way cache between 2 cores is bad for performance, so Jaguar seems to win this one.
- Piledriver core has tiny 16 KB L1 data cache, while Jaguar core has larger 32 KB L1 data cache. Jaguar wins again.
 

joshcryer

it's ok, you're all right now
Intel has outperformed AMD since 2006.

Only because AMD was more innovative. On die memory controller. Core 2 used an off-die memory controller. Intel invested deeply in resistor size reduction while AMD invested in architecture. In retrospect AMD was ahead of its time (on die memory controllers didn't get adopted by Intel until later, but by then they'd reduced their resistor size giving them a TDP advantage over AMD which AMD has yet to overcome, unfortunately; I'll refrain from the anticompetitive behavior stuff here, as I want to keep it purely on a technological level...).

In the end Intel won the war by being ahead by IC fab tech and AMD has lagged significantly. What AMD excells at is their graphics card architecture which might not be as good as Nvidia for raw compute technology it excels at what they're meant for, graphics processing (price performance wise!).
 

Durante

Member
Intel has outperformed AMD since 2006. No one should be amazed and shocked by the results of this comparison. Sony and MS chose AMD because they could source CPU + GPU from one company, plus Intel would have charged a lot more than AMD did for the privilege of using their CPUs.
This is more or less true, but previous Atoms really sucked a lot (even compared to AMD CPUs), so that changing is the news here.
 
The real king here is Apple ARMv8 A7 cpu:

58194.png


Not very orthodox comparisson ( different OS... ) but very indicative of the performance they get with a cpu made at a 28nm process vs a 22nm of Silvermont and with similar or lower wattage.
 

Durante

Member
That's an almost useless comparison though, there's no way to tell how optimized for the CPU architecture those browsers are. (I'd assume that the mobile browser for ARM is roughly about 50 million times more optimized, simply by necessity -- or the lack of it on x86)
 

strata8

Member
Silvermont doesn't beat Jaguar in IPC. The 1.46 GHz listed in the chart is the minimum clock speed of the Z3770, but it turbos up to 2.4 GHz. The A4-5000 on the other hand is locked to 1.5 GHz.
 

kitch9

Banned
The entire argument is pretty much moot with APUS because with the new GPGPU's you can spread general processing across the entire chip. The APU has a lot of advantages over Intel chips because of this.
 
The entire argument is pretty much moot with APUS because with the new GPGPU's you can spread general processing across the entire chip. The APU has a lot of advantages over Intel chips because of this.

Sorry but most intel laptop cpu also have an on die GPU.
Not sure how well intel can compete against what AMD will put in their laptop APU?
 

kartu

Banned
Sorry but most intel laptop cpu also have an on die GPU.
Not sure how well intel can compete against what AMD will put in their laptop APU?

Not sure why you mention laptop CPUs (consoles are in 100W+ envelope, so).
On desktop, Intel APU (Haswell) vs AMD APU, in short: A10 spanks i7 at gaming:
http://www.bit-tech.net/hardware/cpus/2013/07/02/haswell-and-richland-gpu-testing/3

There was simply NO good alternative to AMD's product this time, no wonder both Sony & Microsoft went AMD.
 

tipoo

Banned
What I find crazier than the titles point is that Apples A7 in the iPhone 5S comes close to and infrequently surpasses Intels Bay Trail. Bay Trail itself is now rivaling the likes of original ULV Core 2 Duos like in the old Macbook Air. Crazy, crazy. Next years smartphone SoCs may well be faster than standard voltage Core 2 Duos like the one I'm typing on, which still feel plenty fast to me (with an SSD at least).
 

kitch9

Banned
Sorry but most intel laptop cpu also have an on die GPU.
Not sure how well intel can compete against what AMD will put in their laptop APU?

Intels apus have comparatively poor gpu processing power and favour CPU grunt. Especially when compared with the consoles apus

Add a bit of huma and it's no comparison. The consoles will have plenty of general processing grunt.
 
Not sure why you mention laptop CPUs (consoles are in 100W+ envelope, so).
On desktop, Intel APU (Haswell) vs AMD APU, in short: A10 spanks i7 at gaming:
http://www.bit-tech.net/hardware/cpus/2013/07/02/haswell-and-richland-gpu-testing/3

There was simply NO good alternative to AMD's product this time, no wonder both Sony & Microsoft went AMD.

Except for having gone for a traditional CPU + GPU instead of an APU. The key here is they wanted to go for a cheap console and so they chose an only one chip solution.
 

kartu

Banned
On the low end...as the article points out.
How is i7 a "low end"?


Except for having gone for a traditional CPU + GPU instead of an APU. The key here is they wanted to go for a cheap console and so they chose an only chip solution.
Cerny said that devs pushed for unified memory, so traditional CPU+GPU wouldn't fulfill the requirements.

PS
From anandtech article:

In its cost and power band, Jaguar is presently without competition. Intel’s current 32nm Saltwell Atom core is outdated, and nothing from ARM is quick enough. It’s no wonder that both Microsoft and Sony elected to use Jaguar as the base for their next-generation console SoCs, there simply isn’t a better option today. As Intel transitions to its 22nm Silvermont architecture however Jaguar will finally get some competition. For the next few months though, AMD will enjoy a position it hasn’t had in years: a CPU performance advantage.
http://www.anandtech.com/show/6976/...wering-xbox-one-playstation-4-kabini-temash/5
 
How is i7 a "low end"?

It's clear that both AMD and Intel have made large strides in GPGPU and gaming performance with their latest round of APUs, making for vastly superior all round performance for both low-end desktops and laptops alike


On the flip side, AMD has a big advantage when it comes to gaming, though we do wonder quite how big the market is for such a lowly gaming system
See Conclusion. These chips are low-end when compared to other parts on the market.

Besides the i7 is a good cpu with a crappy video solution, and the article states that you might as well go discrete with the i7 APU since its basically being bound.

It will be interesting how far Intel's new atom chip has closed the gap though on the low-end price/performance ratio with AMD.
 
You can't simply ignore that Jaguar has twice the SIMD throughput than Piledriver, and then say that it's worse than Piledriver just because it uses a lower clock speed.

What nonsense are you talking about?

You are just quoting some posts from Beyond3D without having a basic knowledge. Jaguar have doubled SIMD performance coming from Bobcat, not Piledriver.

The FX-4350 is clocked at 4.2GHz. As you can see, I am not ignoring the clock speed in my comparison.

No. You are just comparing now an octocore Jaguar against a 2 modules Piledriver. The lowest end Piledriver CPU.

Which would not be the case with Jaguar, as each core has its own FPU. With an 8 core Jaguar you would have 8 FPUs. And it's not just the FPU that is not shared.

http://beyond3d.com/showpost.php?p=1667582&postcount=43

Doesn't change the fact that octocore Jaguars aren't native octocores. We don't know yet how will Jaguar scale over 4 cores and the penalties coming from having to access system bus to communicate between cores not sharing the same module.

Multiplying performance per core number doesn't work.

Until now, your whole argument is that Jaguar core isn't low end because if you downclock low-mid tier CPU's, they perform as bad as Jaguar. As I said earlier, you have to show a Jaguar CPU performing better than a low end CPU, not the other way around.
 
What nonsense are you talking about?

You are just quoting some posts from Beyond3D without having a basic knowledge. Jaguar have doubled SIMD performance coming from Bobcat, not Piledriver.

Nope. I am comparing Jaguar to Piledriver.

Rolf NB said:
Jaguar actually has twice the SIMD throughput per clock and core when compared to Bulldozer. A Jaguar core has the same amount of SIMD resources as a Bulldozer module (two cores with a shared FP unit).

e: http://beyond3d.com/showpost.php?p=1668532&postcount=70

http://www.neogaf.com/forum/showpost.php?p=50472948&postcount=1417

No. You are just comparing now an octocore Jaguar against a 2 modules Piledriver. The lowest end Piledriver CPU.

Well, that was the whole point of the original discussion. Comparing the CPU used in next gen consoels to PC CPUs. The FX-4350 is not the lowest end Piledriver, nor the lowest end desktop CPU in that Cinebench test.

http://www.neogaf.com/forum/showpost.php?p=82527769&postcount=186

Doesn't change the fact that octocore Jaguars aren't native octocores.
We don't know yet how will Jaguar scale over 4 cores and the penalties coming from having to access system bus to communicate between cores not sharing the same module.

Multiplying performance per core number doesn't work.

Cinebench scales very well even with 32 cores.

Cinebench scales very easily as can be noticed from looking at the 32 core and 40 core results of the Xeon E7-4870. Increase the core count by 25% and you get a 22.4% performance increase.

http://www.anandtech.com/show/4486/server-rendering-hpc-benchmark-session/3

As I said earlier, you have to show a Jaguar CPU performing better than a low end CPU, not the other way around.

I already addressed this point.

http://www.neogaf.com/forum/showpost.php?p=82535965&postcount=187
 

Durante

Member
Nope. I am comparing Jaguar to Piledriver.
Sebbi's post on B3D is correct (of course, it's sebbi), but the interpretation is wrong. Bulldozer/Piledriver FP modules have FMA support, so their FP throughput in GFLOPs (which is the standard measurement for theoretical performance) is higher than that of 2 Jaguar cores.

And I really wouldn't try to make an argument that the FX-4350 isn't a low-end CPU. You can get a 4-module (8 core) Vishera CPUs which are almost twice as fast for less than 140&#8364;. A CPU half as fast as a 140&#8364; CPU is low-end.
 

kortez320

Member
Not trying to be a jerk but you are just regurgitating information that you don't fully grasp.

There's not really an actual discussion going on here.
 
Sebbi's post on B3D is correct (of course, it's sebbi), but the interpretation is wrong. Bulldozer/Piledriver FP modules have FMA support, so their FP throughput in GFLOPs (which is the standard measurement for theoretical performance) is higher than that of 2 Jaguar cores.

And I really wouldn't try to make an argument that the FX-4350 isn't a low-end CPU. You can get a 4-module (8 core) Vishera CPUs which are almost twice as fast for less than 140&#8364;. A CPU half as fast as a 140&#8364; CPU is low-end.

The interpretation is not wrong. Jaguar has a 128-bit wide FPU per core, while Piledriver has a shared 2x128 FPU per module. Therefore, a dual core Jaguar has 2 FPU, while a dual core PD module has 1 FPU. This is why Jaguar has twice the SIMD throughput per clock and core, even with the same GFLOPs. Now, if you use FMA, the FP throughput is the same, as both Jaguar and Piledriver issue 8 flops per core. However, if you don't use FMA, then Piledriver will issue less flops per core with floating point intensive code.

http://beyond3d.com/showpost.php?p=1668525&postcount=69

Here's the important part about FMA.

http://beyond3d.com/showpost.php?p=1668532&postcount=70

If the algorithm can be represented by series of multiplies followed by dependent additions, the FMA will help BD a lot. But no algorithm is pure FMA, and all float instructions (even the simplest ones) are using the same FMA pipelines. These pipelines can be a bottleneck in many algorithms. Sometimes it's better to have four small pipelines (2 x Jaguar cores) than two heavy hitters (one BD module).
 

enra

Neo Member
What are you talking about?

I'm talking about the difference in cinebench scores between.

CB1151.png


and

CB-Scaling1.png


51136.png

51135.png



You asked for an example with bad cinebench scaling, and I brought you one. In this case, adding 4 more cores hurt the performance because Turbo core can't kick-in with 8 cores as aggresively as it does with only 4. But you are still limited to 4 FPUs, so your performance is lower. This is what happens when you try to measure CPUs adding GFlops and multiplying them by core count.

You call an 78,3% increase for every core added bad?
Corrected for the turbo this increases to 82%.
 
Top Bottom