So, the Pi 4 has a "VideoCore VI" which appears to be reasonably unique. Wikipedia tells me there are no other Broadcom chips with a VC6. There's an older chip with a VC5 but it appears to be quite different to VC6.
Looking at the GPU dump (glxinfo), the VideoCore 6 appears as a "VC 4.2", whereas the VideoCore 4 as featured on the Pi 3B+ shows up as a "VC4 2.1". This, to me, suggests VC6 is an uprev of VC4, rather than a completely new core, which makes sense.
Can anyone with technical knowledge (RasPi team or otherwise) confirm or deny any of the following?
- Presence/absence of a H265 block? I tried H265 playback. OMXPlayer said no, VLC used software and stuttered. I'm sure if there is one, the software isn't there yet, which is fine, but is there hardware to support it with a roadmap for some software support? Is encode supported for H265 too? Will encode support 4K or be limited to 1080p30 H264 as on the Pi 3? Will other codec support be available, just like VC-1 and MPEG2 is supported on older Pi's?
- Can the Pi 4 drive both HDMI outputs @ 4K60 plus the DSI @ 1080p30? (Obviously theoretical, I'm aware that the drivers aren't there yet. The DSI bus should be capable of 1080p30.)
- Technical information on VC. Is the VC6 an uprev of VC4 and so by-and-large compatible with VC4 software (e.g. open source compilers) or is the architecture significantly different? Are there more QPUs? More VPUs? Is the VPU instruction set similar? I notice OpenGL ES 3.0 is supported (up from 2.0), so presumably some of the limitations of VC4 have been lifted.
- MMAL and CSI seem to work as before. But, I'm curious if there's much difference in the ISP because raspiraw has difficulties with AWB/gain right now. It might be related to the ability to write I2C, but it seems to initialise fine. Is this expected?
- I saw mention of a 500MHz clock rate for VC, is this genuine? This would be a significant uplift. Will overclocking be permitted on all parts as before?
- Is the 2048x2048 texture limit of the VC4 removed? I know the older Pi3 was capable of 4K with some hacks, but the 2K texture limit caused issues. I would hope a minimum 4x4K texture limit is present.
- Dynamic memory: it appears the GPU can now address at least 1.8GB (on my 2GB Pi), is the dynamic memory allocation therefore similar to Pi 3 but with a larger memory space available? But, this doesn't match up with the docs that suggest the maximum memory for the GPU is the same as older Pis.
Thanks,
Re: Pi 4 - full specification of VideoCore 6
Most of these questions have already been answered elsewhere, but to give more detail.
1. There is a H265 Decoder block, but it is not part of the Videocore. Use LibreElec which has the best support right now - we are working on better VLC support via ffmpg.
2. No. You can have 1x4kp60, 2x4kp30 or 4kp30+1080p60 etc. HDMI 0 (any res) can be combined with DSI/DPI. Cannot currently do three displays at the same time.
3. Cannot comment on this
4. ISP is the same, there was a change to the AWB prior to the release of the Pi4, are you seeing that?
5. Clock rate for VC has increased due to being on a smaller process. Change it at your peril.
6. 7680x7680
7. GPU now has its own memory manager, so is no longer constrained to 1GB, and is no longer part of the GPU split. That is just for things accessed by the VPU (which is the same) so used for codecs and camera etc.
1. There is a H265 Decoder block, but it is not part of the Videocore. Use LibreElec which has the best support right now - we are working on better VLC support via ffmpg.
2. No. You can have 1x4kp60, 2x4kp30 or 4kp30+1080p60 etc. HDMI 0 (any res) can be combined with DSI/DPI. Cannot currently do three displays at the same time.
3. Cannot comment on this
4. ISP is the same, there was a change to the AWB prior to the release of the Pi4, are you seeing that?
5. Clock rate for VC has increased due to being on a smaller process. Change it at your peril.
6. 7680x7680
7. GPU now has its own memory manager, so is no longer constrained to 1GB, and is no longer part of the GPU split. That is just for things accessed by the VPU (which is the same) so used for codecs and camera etc.
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
-
- Raspberry Pi Engineer & Forum Moderator
- Posts: 15320
- Joined: Wed Dec 04, 2013 11:27 am
- Location: ZZ9 Plural Z Alpha, aka just outside Cambridge.
Re: Pi 4 - full specification of VideoCore 6
raspiraw's AWB algorithm is a seriously basic grey world algorithm, so I wouldn't expect stellar results.
What issues specifically? The ISP hardware hasn't changed beyond the geometry shrink and clock speed increase. White balance has never been over I2C.
Software Engineer at Raspberry Pi Ltd. Views expressed are still personal views.
I'm not interested in doing contracts for bespoke functionality - please don't ask.
I'm not interested in doing contracts for bespoke functionality - please don't ask.
Re: Pi 4 - full specification of VideoCore 6
Thanks @jamesh for the answers.
Regarding AWB issues @6by9 the appearance on IMX219 device is to have almost zero gain i.e. only bright lamps show up, the rest of the image is dark. raspivid/still both work as expected and test patterns generated by the '219 also show up correctly.
However I had to hack at camera_i2c and enable the dtparam for the VC's I2C bus to get as far as raspiraw recognising the camera. So, there may be a problem there. It's not an issue for me now. I pulled the most recent raspiraw which doesn't appear to have any mods for the Pi 4 added yet (no rush from me!)
Regarding AWB issues @6by9 the appearance on IMX219 device is to have almost zero gain i.e. only bright lamps show up, the rest of the image is dark. raspivid/still both work as expected and test patterns generated by the '219 also show up correctly.
However I had to hack at camera_i2c and enable the dtparam for the VC's I2C bus to get as far as raspiraw recognising the camera. So, there may be a problem there. It's not an issue for me now. I pulled the most recent raspiraw which doesn't appear to have any mods for the Pi 4 added yet (no rush from me!)
Re: Pi 4 - full specification of VideoCore 6
tom66 wrote: - Technical information on VC. Is the VC6 an uprev of VC4 and so by-and-large compatible with VC4 software (e.g. open source compilers) or is the architecture significantly different? Are there more QPUs? More VPUs? Is the VPU instruction set similar? I notice OpenGL ES 3.0 is supported (up from 2.0), so presumably some of the limitations of VC4 have been lifted.
I really hope you guys don't mind me digging up my own information.jamesh wrote: 3. Cannot comment on this
Because, I've been looking though the open source drivers and here are some of my observations:
- vc6 is clearly derived from vc4, but it is significantly different. vc6 is only a slight extension over vc5
- The QPU pipeline stays mostly the same, you still have an add ALU and a multiply ALU and it can issue two ALU OPs per cycle. There is still 4 SIMD lanes, interleaved over 4 cycles.
- The instruction encoding for the QPUs is different, but the core instructions are the same.
- Instructions for packed 8 bit int math has been dropped, along with most of the pack modes.
- Instructions for packed 16bit float math has been added (2 floats at in a single operation)
- the multiply ALU can now fadd, so you can issue two fadds per instruction.
- the add ALU has gained a bunch of new instructions, that I don't recognise by name and I haven't explored.
- the A and B register files have been merged. You still only get an A read and a B read per instruction, but they read from one big register file (which means the underlying memory block has gone from two sets of "one read port, one write port" to one "two read ports, one write port" block)
- The theoretical max FLOPs per QPU remains the same at two per cycle, other than the bump from 400mhz to 500mhz
- But it looks like a lot of effort has been put putting those theoretical FLOPs to better use.
- vc4 could run one or two threads per QPU. When you ran in two thread mode, the available register file halfed to 32 registers.
- vc5 added a four thread per QPU mode, with 16 registers per thread.
- vc6 doubled the size of the register file. You could now use all 64 threads in two thread mode and 32 registers in for thread mode. Single thread mode was removed, you always have at least two threads.
- With the threading improvements, the QPUs should spent much less time idle waiting for memory requests.
- Most of the design changes have gone to improving the fixed function hardware around the QPUs.
- a fixed function blend unit has been added, which should reduce load on the QPUs when doing alpha blending. Though, I really hope software blending is still possible, I have a use case which takes advantage of software blending.
- The tile buffer can now store upto 4 render targets (I think it's upto 128bits per pixel, so if you are using 4 32bit render targets, you can't have a depth buffer)
- Faster LPDDR4 memory.
- A MMU, allowing a much simpler/faster kernel driver.
- Many more texture formats, framebuffer formats.
- All the features needed for opengl es 3.2 and vulkan 1.1
I've also noticed that the driver appears to be claiming to have just 8 QPUs (compared to the 12 QPUs previous RPIs). Given that the theoretical max FLOPs per QPU per cycle is the same, it looks like the theoretical max FLOPs has actually dropped from the 3+ to the 4. 12x2x400mhz = 9.6GFLOPs, 8x2x500mhz = 8.0 GFLOPs.
I'm suspicious there is something wrong with my math, or the driver is not reporting the correct number of QPUs (vc6 does add multi-gpu-core support, each with their own set of QPUs, but the driver only appears to be reporting one core with 8 QPUs). My pi4 hasn't arrived yet so I haven't tested myself.
However, I wouldn't be too surprised if the supporting hardware has been improved enough to extract more of the theoretical QPU performance into actual realised performance, allowing rendering performance improvements with less QPUs.
-
- Posts: 145
- Joined: Thu Jul 04, 2019 6:23 pm
Re: Pi 4 - full specification of VideoCore 6
"Instructions for packed 8 bit int math has been dropped, along with most of the pack modes."
Speaking from a point of ignorance, does it seem like all of the 8bit int modes are gone? (Not knowing which ones were there in the first place, but being mildly curious due to ML)
Speaking from a point of ignorance, does it seem like all of the 8bit int modes are gone? (Not knowing which ones were there in the first place, but being mildly curious due to ML)
Re: Pi 4 - full specification of VideoCore 6
vc4 had v8adds, v8subs, v8muld, v8min and v8max which operated on four 8bit uint values packed into a 32bit register. Multiplication was in the range 0.0 to 1.0 and addition/subtraction saturated. There were also a range unpacking/packing modes that allowed you to pack and unpack 8bit values into 32bit registers.Technocolour wrote: (Not knowing which ones were there in the first place, but being mildly curious due to ML)
It's clearly designed for operating on 8bit color, I'm not sure how useful they will be for machine learning. Isn't that usually done with 8bit floats?
My understanding is that packed uint math was less than useful for glsl workloads, as shaders typically required more than 8 bits of precision on color components. Even OpenGL ES 2.0 required 9 bits for it's lowp format.
For a shader compiler to actually output v8adds/v8subs/v8mud/v8min/v8max instructions, it would have to prove that the operations only needed 8 bits of precision.
But the use of packed 8bit ints was mandatory for fragment shaders, as they had to write to the tile buffer in 8bit packed format. vc4 fragment shaders spent 3 or 4 instructions packing the result into a packed 32bit vector, one components at a time before writing to the tile buffer.
With vc5/vc6, you write two packed 16f value to the tilebuffer (or four writes of 32f, if you are using the rgba32f framebuffer). And there is a handy vfpack operation which allows you to pack two f32s into a single 32bit value in a single instruction. You can vfpack directly into the tile buffer register.
-
- Posts: 145
- Joined: Thu Jul 04, 2019 6:23 pm
Re: Pi 4 - full specification of VideoCore 6
Why thank you for your lovely expose!
And, yes, people use byte sized ints for ML, and even nibbles!
https://devblogs.nvidia.com/nvidia-turi ... -in-depth/
https://www.intel.ai/introducing-int8-q ... -openvino/Turing Tensor Cores
Tensor Cores are specialized execution units designed specifically for performing the tensor / matrix operations that are the core compute function used in Deep Learning. Similar to Volta Tensor Cores, the Turing Tensor Cores provide tremendous speed-ups for matrix computations at the heart of deep learning neural network training and inferencing operations. Turing GPUs include a new version of the Tensor Core design that has been enhanced for inferencing. Turing Tensor Cores add new INT8 and INT4 precision modes for inferencing workloads that can tolerate quantization and don’t require FP16 precision.
But sure, there was a well known 2018 IBM paper that used 8 bit floats. And by all means iirc Qualcomms Hexagon DSPs are optimized around 16 bit floats.
Re: Pi 4 - full specification of VideoCore 6
For learning people use 16-bit floats. 8-bit floats are not enough, except at IBM. There are other 8-bit formats, such as unums and posits, that look promising, but where's the hardware? At any rate, it's not machine learning without the learning.Technocolour wrote: ↑Sat Jul 06, 2019 2:32 pmWhy thank you for your lovely expose!
And, yes, people use byte sized ints for ML, and even nibbles!
https://devblogs.nvidia.com/nvidia-turi ... -in-depth/
https://www.intel.ai/introducing-int8-q ... -openvino/Turing Tensor Cores
Tensor Cores are specialized execution units designed specifically for performing the tensor / matrix operations that are the core compute function used in Deep Learning. Similar to Volta Tensor Cores, the Turing Tensor Cores provide tremendous speed-ups for matrix computations at the heart of deep learning neural network training and inferencing operations. Turing GPUs include a new version of the Tensor Core design that has been enhanced for inferencing. Turing Tensor Cores add new INT8 and INT4 precision modes for inferencing workloads that can tolerate quantization and don’t require FP16 precision.
But sure, there was a well known 2018 IBM paper that used 8 bit floats. And by all means iirc Qualcomms Hexagon DSPs are optimized around 16 bit floats.
Any news on the number of QPUs?
Re: Pi 4 - full specification of VideoCore 6
Each QPU has 2 ALUs, so Raspberry Pi 3's 12 QPU graphics processor actually has 24 ALU x 2 x 0.4GHz = 19.2GFLOPs, with a clock of 0.5GHz, you get the advertised 24GFLOPs.phiren wrote: ↑Fri Jul 05, 2019 4:54 amtom66 wrote: - Technical information on VC. Is the VC6 an uprev of VC4 and so by-and-large compatible with VC4 software (e.g. open source compilers) or is the architecture significantly different? Are there more QPUs? More VPUs? Is the VPU instruction set similar? I notice OpenGL ES 3.0 is supported (up from 2.0), so presumably some of the limitations of VC4 have been lifted.I really hope you guys don't mind me digging up my own information.jamesh wrote: 3. Cannot comment on this
Because, I've been looking though the open source drivers and here are some of my observations:
- vc6 is clearly derived from vc4, but it is significantly different. vc6 is only a slight extension over vc5
- The QPU pipeline stays mostly the same, you still have an add ALU and a multiply ALU and it can issue two ALU OPs per cycle. There is still 4 SIMD lanes, interleaved over 4 cycles.
- The instruction encoding for the QPUs is different, but the core instructions are the same.
- Instructions for packed 8 bit int math has been dropped, along with most of the pack modes.
- Instructions for packed 16bit float math has been added (2 floats at in a single operation)
- the multiply ALU can now fadd, so you can issue two fadds per instruction.
- the add ALU has gained a bunch of new instructions, that I don't recognise by name and I haven't explored.
- the A and B register files have been merged. You still only get an A read and a B read per instruction, but they read from one big register file (which means the underlying memory block has gone from two sets of "one read port, one write port" to one "two read ports, one write port" block)
- The theoretical max FLOPs per QPU remains the same at two per cycle, other than the bump from 400mhz to 500mhz
- But it looks like a lot of effort has been put putting those theoretical FLOPs to better use.
- vc4 could run one or two threads per QPU. When you ran in two thread mode, the available register file halfed to 32 registers.
- vc5 added a four thread per QPU mode, with 16 registers per thread.
- vc6 doubled the size of the register file. You could now use all 64 threads in two thread mode and 32 registers in for thread mode. Single thread mode was removed, you always have at least two threads.
- With the threading improvements, the QPUs should spent much less time idle waiting for memory requests.
There is probably plenty more I've missed.
- Most of the design changes have gone to improving the fixed function hardware around the QPUs.
- a fixed function blend unit has been added, which should reduce load on the QPUs when doing alpha blending. Though, I really hope software blending is still possible, I have a use case which takes advantage of software blending.
- The tile buffer can now store upto 4 render targets (I think it's upto 128bits per pixel, so if you are using 4 32bit render targets, you can't have a depth buffer)
- Faster LPDDR4 memory.
- A MMU, allowing a much simpler/faster kernel driver.
- Many more texture formats, framebuffer formats.
- All the features needed for opengl es 3.2 and vulkan 1.1
I've also noticed that the driver appears to be claiming to have just 8 QPUs (compared to the 12 QPUs previous RPIs). Given that the theoretical max FLOPs per QPU per cycle is the same, it looks like the theoretical max FLOPs has actually dropped from the 3+ to the 4. 12x2x400mhz = 9.6GFLOPs, 8x2x500mhz = 8.0 GFLOPs.
I'm suspicious there is something wrong with my math, or the driver is not reporting the correct number of QPUs (vc6 does add multi-gpu-core support, each with their own set of QPUs, but the driver only appears to be reporting one core with 8 QPUs). My pi4 hasn't arrived yet so I haven't tested myself.
However, I wouldn't be too surprised if the supporting hardware has been improved enough to extract more of the theoretical QPU performance into actual realised performance, allowing rendering performance improvements with less QPUs.
About the number of QPUs reported for Videocore 6, there is only 8 QPUs per register, with the Raspberry Pi 3, you'd have 2 registers the first with 8 QPUs and the second with 4 QPUs. If someone didn't know this and checked for QPUs, they would come up with only 8 at any one time. That is likely what happened here; because this is a 28nm chip with a known CPU (the A72), the board simply draws too much power to have such a small GPU IMO. Just looking at other SoCs, the power draw here, would suggests around ~100GFLOPs (give or take 25%), the Nvidia Nano for instance, uses the less efficient A57 quad core clocked at 1.43GHz, with a 235GFLOPs GPU (0.921GHz) and draws 10 watts on a 20nm (very similar power draw to the low powered 28nm process node) while the Raspberry Pi 4 draws ~7.5 watts.
Re: Pi 4 - full specification of VideoCore 6
The only number I have to hand is that the VC6 on the 2711 can hit 2.4GPixels/s. IIRC.
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
-
- Posts: 145
- Joined: Thu Jul 04, 2019 6:23 pm
Re: Pi 4 - full specification of VideoCore 6
@jamesh
Thanks!
Looking around the web, an a72 is about x3 times the size of an a53.
A naive calculation of a 40nm => 28nm jump (40/28)^2 suggests that you'd see a x2 transistor count, assuming the same die area.
The Pi4 is slightly more power hungry than the Pi3, but parts of the Pi4 uncore is significantly faster (DRAM, PCIe, Ethernet) and the CPU complex should be about x3 bigger or 1,5 times once you scale with the node switch (the l2 cache is increased from 512kB to 1MB, so that keeps its die size).
The Pis are genreally cost constrained, so again with being naive about things and assuming that the chips are about the same size, the silicon wafers being the same size etc. You wouldn't expect the total die area to "significantly" grow. And since the a53 => a72 switch isn't "paid" for by the 40 => 28nm switch, I'd naively expect the VideCore to be about the same size, and not extend the number of ALUs significantly, or at all.
Thanks!
That's one way to think, here's another.z0m3le wrote: ↑Mon Jul 08, 2019 12:40 pmEach QPU has 2 ALUs, so Raspberry Pi 3's 12 QPU graphics processor actually has 24 ALU x 2 x 0.4GHz = 19.2GFLOPs, with a clock of 0.5GHz, you get the advertised 24GFLOPs.
About the number of QPUs reported for Videocore 6, there is only 8 QPUs per register, with the Raspberry Pi 3, you'd have 2 registers the first with 8 QPUs and the second with 4 QPUs. If someone didn't know this and checked for QPUs, they would come up with only 8 at any one time. That is likely what happened here; because this is a 28nm chip with a known CPU (the A72), the board simply draws too much power to have such a small GPU IMO. Just looking at other SoCs, the power draw here, would suggests around ~100GFLOPs (give or take 25%), the Nvidia Nano for instance, uses the less efficient A57 quad core clocked at 1.43GHz, with a 235GFLOPs GPU (0.921GHz) and draws 10 watts on a 20nm (very similar power draw to the low powered 28nm process node) while the Raspberry Pi 4 draws ~7.5 watts.
Looking around the web, an a72 is about x3 times the size of an a53.
A naive calculation of a 40nm => 28nm jump (40/28)^2 suggests that you'd see a x2 transistor count, assuming the same die area.
The Pi4 is slightly more power hungry than the Pi3, but parts of the Pi4 uncore is significantly faster (DRAM, PCIe, Ethernet) and the CPU complex should be about x3 bigger or 1,5 times once you scale with the node switch (the l2 cache is increased from 512kB to 1MB, so that keeps its die size).
The Pis are genreally cost constrained, so again with being naive about things and assuming that the chips are about the same size, the silicon wafers being the same size etc. You wouldn't expect the total die area to "significantly" grow. And since the a53 => a72 switch isn't "paid" for by the 40 => 28nm switch, I'd naively expect the VideCore to be about the same size, and not extend the number of ALUs significantly, or at all.
Re: Pi 4 - full specification of VideoCore 6
Power consumption is about 3x what the raspberry pi 3B's is right? (~2.5watts vs 7.5watts)Technocolour wrote: ↑Mon Jul 08, 2019 2:16 pm@jamesh
Thanks!
That's one way to think, here's another.z0m3le wrote: ↑Mon Jul 08, 2019 12:40 pmEach QPU has 2 ALUs, so Raspberry Pi 3's 12 QPU graphics processor actually has 24 ALU x 2 x 0.4GHz = 19.2GFLOPs, with a clock of 0.5GHz, you get the advertised 24GFLOPs.
About the number of QPUs reported for Videocore 6, there is only 8 QPUs per register, with the Raspberry Pi 3, you'd have 2 registers the first with 8 QPUs and the second with 4 QPUs. If someone didn't know this and checked for QPUs, they would come up with only 8 at any one time. That is likely what happened here; because this is a 28nm chip with a known CPU (the A72), the board simply draws too much power to have such a small GPU IMO. Just looking at other SoCs, the power draw here, would suggests around ~100GFLOPs (give or take 25%), the Nvidia Nano for instance, uses the less efficient A57 quad core clocked at 1.43GHz, with a 235GFLOPs GPU (0.921GHz) and draws 10 watts on a 20nm (very similar power draw to the low powered 28nm process node) while the Raspberry Pi 4 draws ~7.5 watts.
Looking around the web, an a72 is about x3 times the size of an a53.
A naive calculation of a 40nm => 28nm jump (40/28)^2 suggests that you'd see a x2 transistor count, assuming the same die area.
The Pi4 is slightly more power hungry than the Pi3, but parts of the Pi4 uncore is significantly faster (DRAM, PCIe, Ethernet) and the CPU complex should be about x3 bigger or 1,5 times once you scale with the node switch (the l2 cache is increased from 512kB to 1MB, so that keeps its die size).
The Pis are genreally cost constrained, so again with being naive about things and assuming that the chips are about the same size, the silicon wafers being the same size etc. You wouldn't expect the total die area to "significantly" grow. And since the a53 => a72 switch isn't "paid" for by the 40 => 28nm switch, I'd naively expect the VideCore to be about the same size, and not extend the number of ALUs significantly, or at all.
Why do we assume the die size is the same? I haven't seen any measurements or anyone taking off the heat spreader, I'm looking for any avenue to discover the configuration of the videocore 6, but the only clues I believe we have is performance (gpu about twice as fast so far) and power consumption (about 3 times as much). Problem with performance is the drivers for the GPU are really shot, so then all we can really look at on the outside is power draw.
Re: Pi 4 - full specification of VideoCore 6
Where is your extra x2 coming from?
I had 12 QPUs x 2 ALUs x 0.4ghz = 9.6 GFLOPs which already accounts for the fact that each QPU has two ALUs.
I'm not sure which register you are talking about.z0m3le wrote: About the number of QPUs reported for Videocore 6, there is only 8 QPUs per register, with the Raspberry Pi 3, you'd have 2 registers the first with 8 QPUs and the second with 4 QPUs. If someone didn't know this and checked for QPUs, they would come up with only 8 at any one time.
On vc4 the register V3D_IDENT1 contains the information about the number of QPUs. Bits 7:4 specify the number slices (3 slices) and bits 11:8 specify the number of QPUs per slice (4 qpus per slice). Multiply these two fields together and you get 12 QPUs.
Sadly we don't have the same level of documentation for vc6, yet. But we do have the kernel driver, written by the someone who had access to internal broadcom documentation. It specifies the same fields In it's headers, and has a handy debug function that prints out the indent registers.
Output:
Code: Select all
sudo cat /sys/kernel/debug/dri/0/v3d_ident
Revision: 4.2.14.0
MMU: yes
TFU: yes
TSY: yes
MSO: yes
L3C: no (0kb)
Core 0:
Revision: 4.2
Slices: 2
TMUs: 2
QPUs: 8
Semaphores: 0
BCG int: 0
Override TMU: 0
I was wondering if it was showing up as two separate GPUs, each with 8 QPUs as /dev/dri/card0 and /dev/dri/card1 both show up in on a raspberry pi 4.
But my Pi 4 has arrived now and it looks like card0 is the v3d driver which has no way to output to screen, and card1 is the old vc4 driver, which is just there to output to HDMI.
Power draw is really not the best way to calculate the size of a GPU in a SoC.z0m3le wrote: That is likely what happened here; because this is a 28nm chip with a known CPU (the A72), the board simply draws too much power to have such a small GPU IMO. Just looking at other SoCs, the power draw here, would suggests around ~100GFLOPs (give or take 25%), the Nvidia Nano for instance, uses the less efficient A57 quad core clocked at 1.43GHz, with a 235GFLOPs GPU (0.921GHz) and draws 10 watts on a 20nm (very similar power draw to the low powered 28nm process node) while the Raspberry Pi 4 draws ~7.5 watts.
-
- Posts: 145
- Joined: Thu Jul 04, 2019 6:23 pm
Re: Pi 4 - full specification of VideoCore 6
Fortunately (or not, I suppose), the power figures hasn't changed that much from the Pi 3B+.
https://www.tomshardware.com/reviews/ra ... ,6193.html
And, the die area thing relates to cost. Since any RPi will be cost constrained. Now different nodes come with different cost per mm^2, yields, what sort of wafers they use and so on. So I will agree on that it's a complex matter. But for a first order handwavy analysis, assuming that the cost per mm^2 stays kinda constant seems ok'ish.
https://www.icknowledge.com/news/Techno ... evised.pdf
Re: Pi 4 - full specification of VideoCore 6
Huh, yeah I was definitely wrong about the power draw from the RPi3 board. Not sure why it's drawing so much power compared to other 28nm SoC with A57/A72 quad core CPUs, but at least the power draw makes some sense coming from Raspberry Pi 3's power draw.Technocolour wrote: ↑Tue Jul 09, 2019 10:36 amFortunately (or not, I suppose), the power figures hasn't changed that much from the Pi 3B+.
https://www.tomshardware.com/reviews/ra ... ,6193.html
And, the die area thing relates to cost. Since any RPi will be cost constrained. Now different nodes come with different cost per mm^2, yields, what sort of wafers they use and so on. So I will agree on that it's a complex matter. But for a first order handwavy analysis, assuming that the cost per mm^2 stays kinda constant seems ok'ish.
https://www.icknowledge.com/news/Techno ... evised.pdf
Each ALU typically have 2 floating point operators, and as you pointed out in a earlier post videocore 6 is no exception, with both a multiply and additive floating point operator. Thus theoretical GFLOPs are calculated with both operators in mind. That is what the 2 in my formula represents, and is common across any modern programmable shader, whether you calculate Nvidia, AMD, Intel, Boardcom or any other company's GPUs. Total ALUs * 2 * GHz clock = GFLOPs, In the case of Raspberry Pi 3, it's 24 ALUs * 2 operators * 0.4GHz = 19.2GFLOPsphiren wrote:Where is your extra x2 coming from?
I had 12 QPUs x 2 ALUs x 0.4ghz = 9.6 GFLOPs which already accounts for the fact that each QPU has two ALUs.
I'm not sure which register you are talking about.z0m3le wrote: About the number of QPUs reported for Videocore 6, there is only 8 QPUs per register, with the Raspberry Pi 3, you'd have 2 registers the first with 8 QPUs and the second with 4 QPUs. If someone didn't know this and checked for QPUs, they would come up with only 8 at any one time.
On vc4 the register V3D_IDENT1 contains the information about the number of QPUs. Bits 7:4 specify the number slices (3 slices) and bits 11:8 specify the number of QPUs per slice (4 qpus per slice). Multiply these two fields together and you get 12 QPUs.
Sadly we don't have the same level of documentation for vc6, yet. But we do have the kernel driver, written by the someone who had access to internal broadcom documentation. It specifies the same fields In it's headers, and has a handy debug function that prints out the indent registers.
Output:And that's where I'm getting the 8 QPU number from. And I'd be kind of surprised if it was wrong, because mesa also uses that number to allocate the correct amount of stack space for the QPUs to spill to, and if you allocate too little, you will get crashes.Code: Select all
sudo cat /sys/kernel/debug/dri/0/v3d_ident Revision: 4.2.14.0 MMU: yes TFU: yes TSY: yes MSO: yes L3C: no (0kb) Core 0: Revision: 4.2 Slices: 2 TMUs: 2 QPUs: 8 Semaphores: 0 BCG int: 0 Override TMU: 0
I was wondering if it was showing up as two separate GPUs, each with 8 QPUs as /dev/dri/card0 and /dev/dri/card1 both show up in on a raspberry pi 4.
But my Pi 4 has arrived now and it looks like card0 is the v3d driver which has no way to output to screen, and card1 is the old vc4 driver, which is just there to output to HDMI.
Power draw is really not the best way to calculate the size of a GPU in a SoC.z0m3le wrote: That is likely what happened here; because this is a 28nm chip with a known CPU (the A72), the board simply draws too much power to have such a small GPU IMO. Just looking at other SoCs, the power draw here, would suggests around ~100GFLOPs (give or take 25%), the Nvidia Nano for instance, uses the less efficient A57 quad core clocked at 1.43GHz, with a 235GFLOPs GPU (0.921GHz) and draws 10 watts on a 20nm (very similar power draw to the low powered 28nm process node) while the Raspberry Pi 4 draws ~7.5 watts.
If the Videocore 6 does indeed only have 16 ALUs (16 * 2 * 0.5GHz), you'd have only 16GFLOPs, this is hard to believe since the raw output seems to be double, even with bad drivers, and official claims is about 3 to 4 times the performance of a full blown Videocore 4 GPU, which is actually 32 ALUs * 2 * 0.5GHz for 32GFLOPs.
Re: Pi 4 - full specification of VideoCore 6
Does the chip also support higher frequencies?
In theory 4k 60hz is the same amount of pixels as for instance 1440p 120hz
I know that in the past the reverse was possible, 4k would work if the framerate was set low enough.
In theory 4k 60hz is the same amount of pixels as for instance 1440p 120hz
I know that in the past the reverse was possible, 4k would work if the framerate was set low enough.
Re: Pi 4 - full specification of VideoCore 6
Ah. No.
The QPUs can do two floating point operations per cycle; One in the add ALU and one in the multiply ALU. That's why we say it has two ALUs. But each ALU can only do one operation per cycle. The add ALU doesn't have a multipler can't multiply, the multiply ALU doesn't had an adder and can't add.
You are correct, that in modern GPUs you can multiply by the total number of operations that can be done in a single cycle, to the point that an ALU with a fused multiply/add (FMA) operation is commonly counted as two two operations. But VC4 doesn't have FMA and I don't think vc6 has it either.
But for videocore, you either multiply number of QPUs to get ALUs, or you multiply by two. Not both.
You are conflicting theoretical FLOPs with actual realistic performance.z0m3le wrote: If the Videocore 6 does indeed only have 16 ALUs (16 * 2 * 0.5GHz), you'd have only 16GFLOPs, this is hard to believe since the raw output seems to be double, even with bad drivers, and official claims is about 3 to 4 times the performance of a full blown Videocore 4 GPU, which is actually 32 ALUs * 2 * 0.5GHz for 32GFLOPs.
It is 100% possible for vc6 to get 3-4 times the rendering performance with less theoretical FLOPs, simply by having much better utilisation of those FLOPs. My understanding is that vc4 spent a lot of time executing NOPs or stalled waiting for texture accesses.
Re: Pi 4 - full specification of VideoCore 6
I asked around, and most of the above calcs look good, BUT it should be noted that the VC4 GPU was run at 300Mhz for the majority of 3D operations, not 400Mhz. So actually, the move is from 300 to 500, not 400 to 500.
There's also a lot of tweaks over the entire system - the original VC4 GPU is not quite old, and some of the stuff in it was best guess at the time. The designers now know a lot more about the whole system, and there have been lots of tweaks throughout to improve throughput.
There's also a lot of tweaks over the entire system - the original VC4 GPU is not quite old, and some of the stuff in it was best guess at the time. The designers now know a lot more about the whole system, and there have been lots of tweaks throughout to improve throughput.
Principal Software Engineer at Raspberry Pi Ltd.
Working in the Applications Team.
Working in the Applications Team.
Re: Pi 4 - full specification of VideoCore 6
Believe me, the die size of 2837 and of 2711 are really really close.z0m3le wrote: ↑Mon Jul 08, 2019 10:15 pmPower consumption is about 3x what the raspberry pi 3B's is right? (~2.5watts vs 7.5watts)
Why do we assume the die size is the same? I haven't seen any measurements or anyone taking off the heat spreader, I'm looking for any avenue to discover the configuration of the videocore 6, but the only clues I believe we have is performance (gpu about twice as fast so far) and power consumption (about 3 times as much). Problem with performance is the drivers for the GPU are really shot, so then all we can really look at on the outside is power draw.
It cost me $35 + $50......
Re: Pi 4 - full specification of VideoCore 6
Hmm instant double performance?phiren wrote: ↑
Fri Jul 05, 2019 3:54 pm
[*] Instructions for packed 16bit float math has been added (2 floats at in a single operation)
How difficult would it be to make a simple GPU example which makes use of those half-precision floating point operations?
OpenGL can use it, did not know GIMP can too.
Question) Is the Mesa driver using 16bit floats?
I'm dancing on Rainbows.
Raspberries are not Apples or Oranges
Raspberries are not Apples or Oranges
-
- Posts: 145
- Joined: Thu Jul 04, 2019 6:23 pm
Re: Pi 4 - full specification of VideoCore 6
So I was digging around a bit (for VPU documentation, I'll spare you and your sanity further of the details). ...And in doing so I stumbled upon the following, and I wonder were the following calculation comes from?
https://github.com/hermanhermitage/vide ... 5-Overview
GPU and 3d Pipeline
Low power, high performance OpenGL-ES® 1.1/2.0 VideoCore GPU. 1 Gigapixel per second fill rate.
24 Gigaflops of floating point performance (3x4x8x250MHz)
Which is all fine enough if you have a quad SIMD using 4 32bit ALUs for add and mul respectively. But looking at what phiren mentions, this is all hogwash and a QPU operates on its quads over four cycles?
Is this a classic example of why you never should trust random sources on the internet?
https://github.com/hermanhermitage/vide ... 5-Overview
GPU and 3d Pipeline
Low power, high performance OpenGL-ES® 1.1/2.0 VideoCore GPU. 1 Gigapixel per second fill rate.
24 Gigaflops of floating point performance (3x4x8x250MHz)
Which is all fine enough if you have a quad SIMD using 4 32bit ALUs for add and mul respectively. But looking at what phiren mentions, this is all hogwash and a QPU operates on its quads over four cycles?
Is this a classic example of why you never should trust random sources on the internet?
Re: Pi 4 - full specification of VideoCore 6
No, nope. The correct theoretical performance of the GPUs is as follows:
VideoCore IV @ 250MHz: 250 [MHz] x 3 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 24 Gflop/s
VideoCore IV @ 300MHz: 300 [MHz] x 3 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 28.8 Gflop/s
VideoCore VI @ 500MHz: 500 [MHz] x 2 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 32 Gflop/s
VideoCore IV @ 250MHz: 250 [MHz] x 3 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 24 Gflop/s
VideoCore IV @ 300MHz: 300 [MHz] x 3 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 28.8 Gflop/s
VideoCore VI @ 500MHz: 500 [MHz] x 2 [slice] x 4 [qpu/slice] x 4 [processor] x 2 [op/clock] = 32 Gflop/s