[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[oc] Intel Hyper-threading and CPU design, threads etc...



Jim,

Surely people have already done this "two independent processors on one
chip"? I've already expressed my wish to load 4 copies of a processor core
into one FPGA, just waiting for the Damjan to tell me that the OpenRISC
project is small enough to fit two cores on a 200K Spartan II. Of course the
problem still comes to concurrent memory access but even that can be almost
eliminated very cheaply with a little thought. The idea won't guarantee
every CPU gets 100% zero-wait-state service but I would guess at around 90+%
efficiency for each processor up to say 16 processors. I ran a test a while
back on a 4way Xeon box with multithreading and then took 3 processors out
and ran the same test. The 4 processors managed to complete the tasks 2.4
times quicker than the single processor so you lose approx. 40% of clock
cycles due to memory conflicts? When (evil me) rewrote the test software to
be cruel and have every processor hitting the same memory block
alternatively reading and writing to it the 4 processors managed 1.8 times
quicker than the single processor, loss of 55% clock cycles there. The 4 CPU
box with the extra processors was over 10 times the price of a single CPU
box, great value for money there! lol

The alternative is for people to use the transputer idea, each CPU has a
local memory and jobs get passed to it via messages on its high speed comms
links. The Inmos T400+T800s are probably the most famous of these examples
but this would require people to sit down and break up the actual problem
into concurrent tasks. Too much mental effort for some.

Computer hardware developments seem be pushed by software sloppiness, I wait
for a whole minute for Windows 98 to load up on an Athlon 1100. Yet when I
use on old program on Win3.1 Windows loads in less than 3 seconds on an old
PII/300 that is not fast enough anymore to play new game releases on. Think
back to the 8 floppies that Windows 3.1 comes on, compare that to the 1
gigabyte+ a Windows 2000 installation takes up, is it really worth the extra
space and lack of speed? I read the other day that over 40% of all critical
databases still run on DOS systems some for over a decade without a software
crash. I won't mention the MS registry (greatest step backwards in software
history).

We do need smarter processors but surely they would work better with smarter
instructions too? I've got a lot of CPU enhancement ideas but I have to
concentrate on Buffy-C at the moment so can't implement them. I'll give you
a quick demonstration though.

Old Way:

	CMP AX,$2000
	JC  #4000

Two instructions to do that? Why use two clock cycles for that? Can't we
just have a CMPJxx instruction? Takes two values/register combinations and
based upon the results of the comparisons jumps to the destination or not.
This might also eliminate quite a few status bits which may make the ILP
much easier? The first CPU core I'll design will use only use intelligent
64bit instructions rather than two/three 32bit instructions. Memory is cheap
so whilst my programs could be 2 or even 4 times bigger than other peoples
they will execute faster which is the goal after all. The instructions
normally seen on RISC cores will be optimised for peak performance with some
CISC ones there for sloppy coders if they are willing to wait the extra
clock cycles for them.

Almost 6am so I have to go work on Buffy-C now.

Paul

-----Original Message-----
From: owner-cores@opencores.org [mailto:owner-cores@opencores.org]On
Behalf Of Jim Dempsey
Sent: 08 December 2001 01:12
To: cores@opencores.org
Subject: Re: [oc] Re: Merlin Hybrid System


>From Intel (http://developer.intel.com/technology/hyperthread/)

"Intel® Corporation introduced its Hyper-Threading Technology at the Fall
2001 Intel Developer Forum. Hyper-Threading Technology will enable the world
's first simultaneous multi-threaded (SMT) processor. Today's processor
exploits Instruction Level Parallelism (ILP), but mutually exclusive
hardware resources exist. However, by developing an architecture state for
two processors which share a single physical processor's resources, two
programs or threads can execute simultaneously. Thus, one physical processor
looks like two logical processors to the OS and applications."

What I am talking about is something completely different.

The technology I am talking about will use multiple processors that will
give the operating system and applications running thereon the appearance of
running on one processor. But in reality what would normaly be viewed as a
single thread running on one processor is in fact running in fragments
concurrently on multiplt processors.

More from letter written to Transmeta...

When examining Multi-Processor implementations on the dominant Intel based
systems the distribution of the work amongst processors is issued in work
units called threads.

When examining Single-Processor implementations on the dominant Intel based
systems you can view the system as n-processor where n equals 1. On a
1-processor system only 1 thread can be processing at a time and during
interrupt and other kernel mode processing no processors are available for
thread processing.

When examining a Single-Processor operating system on an n-Processor system
only 1-processor is available for both the thread and system. e.g. Windows
95 on a dual processor system.

When examining an application written for 1 thread running on an n-Processor
system at most only 1 processor can be used for this thread's processing.

When examining an application written using n-threads at most n-processors
can be used for this application (assuming there are n available). At second
best n-1 processors are available for this application's processing.

Additionally, the overhead to start and stop threads combined with an
increase in complexity of code written specifically for multi-threaded use
makes it cumbersome at best and ineffective at worst to make full use of the
processing capability available. This is known as a situation of
"diminishing returns".

In particular, when an operating system or application target system has
single processor there is no payback in writing the code for use with
multiple processors.

Software vendors have a strong disincentive to make a single-threaded
application into multi-threaded as well as to make a multi-threaded
application into one with finer thread granularity (more threads). Added to
this, the nature of current (Intel) SMP design is such that the multiple
processors are attached to a single global memory system. Although each
processor has a local cache, the main memory system I/O becomes saturated
with a relatively low number of processors (4 - 8).

The description above represents the classic "immovable object".

Problems to overcome

? Applications that are written with one thread
? Applications that are written with a very small number of threads (2-4)
? Thread context switch overhead
? Memory I/O saturation (Read/Write)
? Unwillingness to rewrite the applications to run on a new configuration
? Unwillingness to rewrite the operating system to run on a new
configuration
? Unwillingness to add compiler optimizations for multi-thread conversion

Software vendors will not convert a single threaded application to a
multi-threaded application if most of their customers are on single
processor systems.

Software vendors are not willing to rewrite applications to take advantage
of new processor technology when their install base is on old processor
configurations.

Compiler writers are not willing to write new optimizing compilers for
non-industry leading processor technology.

Operating system writers are not willing to write to a new multi-processor
design.

The above declarations become a circular argument of why things cannot
change.

The Magic Bullet

The new venture has a magic bullet that solves these problems. The magic
bullet can go through the "immovable object" and around it.

? No change in operating system software
? No change in application software
? No change in compiler design
? Reduction in memory R/W

About 25 years ago Mr. Dempsey developed a minicomputer operating system
named OMNI. About 15 years ago he was involved with a computer manufacturer
in the design of a multi-processor computer to run a multi-processor version
of this operating system. The design of this system included the invention
of an instruction set that permits a very fine-grained work unit
distribution. Overhead is so low that work units of even a few instructions
are feasible. This technology combined with my new process and your
Transmeta processor will produce a product where

? A single processor operating system (e.g. Windows 9x) will run using
multiple processors. And do so with no code change to the operating system
or applications.
? A single threaded application will run using multiple processors and do so
with no code change to the application.
? A multi-processor operating system artificially restricted to run on few
processors (i.e. NT4 Workstation) can run on many processors with no code
change to the operating system.
? An n-threaded application can efficiently utilize more than n processors.
? The shared memory Read/Write saturation of current SMP designs is reduced
thus permitting an increase in the number of attached processors.

Jim Dempsey


----- Original Message -----
From: "David Feustel" <dfeustel@mindspring.com>
To: <cores@opencores.org>
Sent: Friday, December 07, 2001 1:45 PM
Subject: Re: [oc] Re: Merlin Hybrid System


> Are the participants in this thread familiar with
> Intel's hyperthreading as implemented (but not
> yet turned on) in the P4?
>
>
>
> --
> To unsubscribe from cores mailing list please visit
http://www.opencores.org/mailinglists.shtml
>

--
To unsubscribe from cores mailing list please visit
http://www.opencores.org/mailinglists.shtml

--
To unsubscribe from cores mailing list please visit http://www.opencores.org/mailinglists.shtml