IoT : Internet of Things and Market Segmentation

August 6, 2015

As a television ad once promoted, we live in a modern day society where we want to :

Share this. Share that.

How can we do that?

The current set of millions of smartphones (iPhones) can easily share voice and text conversations and photos.

But what’s next to do.

We build some intelligence into devices to allow remote monitoring and other sensing applications so that we can share even more information related to devices like washing machines, refrigerators, lights, sprinklers, home security, and even smart watches to monitor health conditions in real-time.

This growing network of interconnected devices will expand the internet bandwidth requirements.

This new market for intelligent internet of things (IoT) is generally segmented into these categories with the top two already existing and the remainder needing to be built or already in design and production.  These are ordered in somewhat reducing complexity:

  • Servers / Routers
  • Smartphones / Tablets / Home PCs / Laptops
  • Wearable infotainment
  • Wearable Fitness and Health
  • Smart Home
  • Smart Appliance
  • Safety and Security
  • Smart City / Metering
  • Commerce

Reference Synopsys analysis of IP components needed to build these IoT devices


The high-end Internet of Things are occupied by the Smartphones/Tablets/Home PCs/Laptop and an example SOC diagram would like below with CPU, display, LDDR and sensors.  These usually use the latest technology process nodes like 10nm FinFET to make the billion transistor devices economical and probably cost between $20 to $100 per device.



The low-end SOC for IoT has significantly reduced IP but maintains bare functional and communication capabilities such as as a CPU, bluetooth and sensor.  These would probably go into smart metering devices and use 28nm process node technology and cost around $5 to $20.



The bottom SOC for IoT can be a bare minimum of CPU, some local storage RAM and sensor.  Cost should be within a $1 or less to be economically feasible and produced in the billions.  It can probably use older process nodes like 90nm and cost less than $1 to manufacture. These could be added as tracking devices for very expensive items shipped across international borders.


UVM Tutorial 3: Systemverilog Testbench Principles

March 6, 2015

UVM Tutorial 2 : Basic Building Blocks

February 25, 2015

UVM is generally built on a few basic blocks

  • DUT (Device Under Test)
  • Interface (An interface between DUT and test env)
  • Test environment using classes


Simple verilog to display “Hello World”

The module is named top but it can even be “xyz”.

The “initial” statement is what kicks off the run.

///////////////////////////////// VERILOG ONLY /////////////////////////////////

module top;

$display(“Hello World !!”);  // verilog only

////////////////////////////// END VERILOG ONLY /////////////////////////

Simple systemverilog code with UVM code to say “Hello World”

// usually test environments use the name top but one can call it anything because the environment is kicked off by the the code enclosed by “initial begin” and “end”.

///////////////////// SYSTEMVERILOG WITH UVM BEGIN ////////////////////////


/////////////// SYSTEMVERILOG WITH UVM END //////////////////////////

Before we can code the equivalent systemverilog version, we need to understand how simulation phases work for systemverilog. In systemverilog, multiple steps or phases occur before the final simulation execution.  See the steps in the diagram below:

For now, we avoid a complex example and only use a few essential aspects of UVM to build up to the full fledged understanding of UVM phases in next section.

We need to take one step backwards and then take two steps forward by understanding the systemverilog point of view and then add UVM on top of that.

See next section on systemverilog coding of a counter to point out the main pieces of how a systemverilog testbench is constructed, coded and implemented : UVM Tutorial 3 : Systemverilog Testbench Principles

UVM Tutorial 1: Overview

February 24, 2015

What is UVM?

UVM = Universal Verification Methodology

UVM stands for Universal Verification Methodology as shown in the previous line.

UVM uses the language features of systemverilog verification to build testbenches using a specific methodology. The reasoning for a methodology is to enforce a common “way” or “method” to build testbenches so that reuse is maximized in a standardized way.  Anyway with knowledge of UVM doesn’t need to relearn the coding style of another person.  This increases the efficiency of building verification environments.  One of the major stumbling blocks to every increasing billion gate designs have been verification.  Thus, the specific effort put into creating this Universal Verification Methodology.  It has been shown by various studies that verification efforts can increases 1000x for a 10x increase in semiconductor transistors.  Think about it.  There can be up to 2**10 combinations of connections for a semiconductor chip design.

The efforts to standardize this verification methodology were pushed forward by Mentor Graphics and Cadence in early 2000s.  Eventually, Synopsys had to get onboard too.  See for more detailed information about UVM.

Think of UVM as specialized systemverilog libraries or technically macros that are called so that this universal verification methodology can be strictly followed and quickly adapted.

For those new to systemverilog, this language has been upgraded from Verilog-2001 to add object-oriented features to support design as well as verification.

Features of SYSTEMVERILOG Verification:

  • Functional Coverage
  • Randomization of Objects (with/without constraints)
    • rand addr_t addr
  • Methodology Libraries
    • Encapsulation of data and function
  • Object Oriented language constructs (just like C++)
    • classes
    • inheritance

Features of SYSTEMVERILOG Design (which are synthesizable):

  • Processes
    • assign statements
    • always_comb, always_ff
  • Operators
    •  . , .* Operator
    • Basic logic operations
  • Datatypes and Literals
    • Logic (4 state)
    • Typedef (user defined types)
    • Enumerations
    • Structures
    • Literals
  • Interfaces
    • Generic Interface
    • Interface ports
    • Interface modports
    • Parameterized Interfaces
  • For loop, Generate (with caution)
  • Disallowed SET
    • # delays
    • Initialization
    • No tasks and functions
    • Auto increment, decrement
    • Statically unknown bounds

Smartphone and Tablet Computing Requirements

July 4, 2014

Geeky Details to the hotest consumer electronics

From iPhone/iPad to Samsung Galaxy S5/Tab, millions of consumers have purchased these device not knowing the technical geeky details that thousands of engineers have designed to maximize the user experience.  Now some of these details are revealed below.

To drive the display which are now 1080p (1920 x 1080), one needs to have the functional logic to drive the bandwidth :

1920 x 1080 pixels x (4 bytes/pixel) = 8294400 bytes = 8.3 GB (gigabytes)

and some other overhead features and requirement bumps up to 9GB.


and this amount of data needs to be transferred at a rate of 60 Hz or 1/60 sec

DISPLAY TIME REQUIREMENT = 1/60 sec (0.0167 sec)

To communicate the smartphone/tablet to the cell carrier tower for web browsing, the serial bitstream is from 512 Kbits/sec for 3G to 10 Mbits/sec for 4G LTE.  A typical web page with images is 1MByte, so with a 1Mbit/sec bitstream


and it takes 

1 MByte x 8 bits/1 Byte x sec/1Mbit = 8 seconds



Low Power Design Techniques for today’s VLSI/ULSI chips

June 25, 2014

Low Power Design Techniques (from Sorin Dobre UCSD lecture)

User experience perspective

– Active Usage Time is the time interval to perform various tasks (audio play, voice calls, web browsing, video playback and game play) between two full battery charges

– Standby Time is the time interval ready to be activated.  Normally, this means only the radio is still occasionally running with the local cell towers to maintain time synchronization.  Other  functional tasks are not running. Minimize the leakage power will maximize the standby time.

Electrical Power Efficiency

– Power consumption to perform a set of tasks relative to performance targets,

– measured in (mW / MHz or mW / Perf target like mW / MIPS ) (MIPS = millions of instructions per second)

– Average power consumption

– Peak power consumption

Power consumption in digital systems

– Ptotal = Pactive + Pleakage

– Pactive = Pinternal + Pswitching

– Pswitching = aCV**2f (a = activity factor, C = capacitive load, V = voltage, f = frequency)

– Pinternal =  power consumed when input to CMOS gate changes but output does not change

Methods to reduce dynamic power (Pswitching)

– Reduce power supply voltage ( V )

– Reduce voltage swing in all nodes

– Reduce the switching probability (transition factor)

– Reduce load capacitance


Low power implementation in today’s VLSI/ULSI chips require a holistic and concurrent approach
that includes collaboration and methodology between:
System level design
Architectural design
Software Hardware co-design
IP design:
–  Circuit design
–  Physical implementation of the IP
Physical design (chip/block level)
– Power verification and modeling
– Silicon correlation and validation

System Optimization

Power delivery network optimization:
– On die vs on board (PCB) voltage regulators
– Voltage regulators efficiency
– Voltage rails definition
– System level power management:
– Adaptive voltage scaling (AVS)
– Dynamic clock frequency and voltage scaling (DCVS)
– Static voltage scaling (SVS)
Analog vs digital processing system level optimization
Optimization at the system with the goal of moving most of the signal processing
(data transformation) in the digital domain. The power consumption in the digital
domain is scalable with the process technology scaling and with the system use
mode requirements.
Digitally assisted analog processing

Architectural Optimization

Memory hierarchy
On die vs. off die memory
Cache size (miss penalty)
Cache hierarchy (architecture)
Address space definition

Processor architecture
Von Neumann , Harvard
VLIW (high IPC)
16bit, 32bit, 64 bit instruction architecture (IA) (Code compression)
In order vs out of order execution
Superscalar implementation
Multi thread implementation
Scalability : Single core vs. Multi core
Application specific IA optimization
– FFT cores
– Multipliers, adders ,shifters

Hardware accelerators:
Graphic 2D, 3D
Video encoder/decoder (720p, 1080p, 2160p)
Multimedia display
Audio + DSP (digital signal processing unit)
Modem baseband

Bus architecture
AHB implementation (Advanced high performance bus)
Fabric (high speed, high bandwidth interconnect):
– Bandwidth
– Latency
– Power

Clocking architecture:
Frequency planning
Clock architecture
Synchronous vs asynchronous clocks

IO interfaces
Engineering system level design and optimization (ESL):
Algorithmic driven hardware implementation and optimization
System level power modeling
Hardware software co-design and optimization


Synchronizers for Asynchronous Signals

June 8, 2014

Asynchronous signals causes the big issue with clock domains, namely metastability.  This is a situation where the clock domain trying to capture the asynchronous event goes into a metastable state.  Is the asychronous signal a logic “1” or a logic “0” state?  For now we ignore the voltage value because metastability is independent of voltage.

Metastability cannot be prevented but it can be reduce.  High-speed digital circuits rely on synchronizers to create a time buffer for recovering from a metastable event, thereby reducing the possibility that metastability will cause a circuit to malfunction.

EDA companies such as Synopsys,Cadence and Mentor Graphics, create software to automatically read verilog code and detect synchronization problems.  The number one rule is to NOT synchronize inputs by more than one synchronizer.   The outputs of multiple synchronizers can produce different synchronized signals.

There are two basic types of synchronizers: 1) Asynchronous signal wider than the clock period of the synchronizer clock domain and 2) Asynchronous signal smaller than the clock period of the synchronizer clock domain.

Asynchronous signal > Synchronizer clock period


If designed into an ASIC (Application Specific Integrated Circuit), this synchronizer is typically put into a special library cell to keep the two back to back D flip-flop close to each other functionally and to minimize any clock skew in the ASIC.  In addition, as a rule of thumb, this synchronizer usually has a special cell name like “sync_ss”, meaning synchronize slow input signal.

Verilog code for above synchronizer

module sync_ss (clk, async_in, reset);

input clk, async_in, reset;

output synch_out;

always @(posedge clk)

if (reset)

meta <= 1’b0;

sync_out <= 1’b0;


meta <= async_in;

sync_out <= meta;



Asynchronous signal < Synchronizer clock period


Similar to the circuit above, this synchronizer is typically put into a special library cell to keep the D flip-flops and special logic close to each other for functionality purposes and to minimize any clock skew in the ASIC.  In addition, as a rule of thumb, this synchronizer usually has a special cell name like “sync_fs”, meaning synchronize fast input signal.





VLSI/ULSI Design Specification, Design Partition, and Design Entry

May 27, 2014

Design Specification

The design flow begins with a written specification for the design.  Usually it is a list of general feature requirements and specifications related to process technology, power consumption, essential use cases, timing, silicon area which translates to costs, testability, fault coverage, and other design specifics.

Additional specifications that began between the VLSI to ULSI era was usage of higher level languages such as SystemC or even C++ that could be translated and synthesized into a circuit.

An example specification may be:

  • 3D Graphics processor with 512 VLIW IEEE 754 floating point processors with 50 cycle structural  latency
  • 1.0W typical with 0.8V Core voltage and typical 90nm  process technology
  • LPDDR3 512MB with 32 bit memory interface running at 800MHz
  • 500 MHz core clock
  • Two USB3 ports
  •  10mm2 die
  • Cost of die and package and testing must be less than $5
  • 362 pin BGA (Ball grid array)
  • 98% fault coverage with testing cost of 50 cents

Design Partition

The design partitioning process creates an architecture – various configurations of interacting functional units such as ARM A7 processor with AXI, Denali DDR3 with AXI and DFI interface, Synopsys USB3.0, and JTAG 1149.1 interface.  Top-down design is the process of progressively partitioning the design into smaller and simpler functional units.  Each of these design blocks can be easily synthesized.  In VLSI era, 100k gate instances was a typical design block.  In the current ULSI era, EDA (Electronic Design Automation) tools can handle 1000k gate instances with relative ease.

Whole design blocks are already designed into specific IP such as CPU, DDR3, and USB3.  The common interface between them is the AXI bus.  And the glue that binds them together is the NOC (Network On a Chip) developed by Arteris and Sonics.



Design Entry

Design entry means composing the decomposed functional units into a language-based description such as systemverilog which is a superset of verilog and VHDL (VHSIC Hardware Description Language).  For VLSI era, Verilog description could encourage architectural exploration with 1000 lines of code.  However, for ULSI systemverilog is needed to allow for modular and parameterizable designs.

HDL based designs are easier to debug than schematics (hundreds and thousands of gate instances).  Documentation may also be embedded into the design via comments and descriptive signal naming : e.g. AXI_DATA_READY, AXI_REQUEST, AXI_DATA, and CLOCK.

Behavioral modeling encourages designers to rapidly create a behavioral prototype and verify its functionality, and then use a synthesis tool to optimize and map the design into a selected physical technology such as TSMC 90 nm CMOS.  If the model was written in a synthesis-ready style, the synthesis tool will remove redundant logic, perform tradeoffs  between alternative architectures and eventually achieve the area and timing constraints.

Some IP (Intellectual Property) such as NOC (Network On a Chip) by Arteris and Sonics can directly take the Design Entry settings and generate behavioral models in SystemC.  Thus allowing for rapid architectural explorations.  For example, since the NOC is the central switch fabric connecting the various IPs together, the traffic control (bus sizing), topology, and priority can be quickly adjusted, simulation run and KPI (Key Performance Indicators) data collected and analyzed.  Tradeoffs can be made very early on thus allowing for shorter TTM (Time To Market).



VLSI Design Methodology Flow

May 26, 2014


Very Large Scale Integrated circuit became a household technology acronym when various semiconductor manufacturers such as LSI Logic, Fujitsu, Samsung, IBM, Intel and TSMC scaled the thousand transistor chips to millions of transistors.

When I was growing up in the 70’s, I thought the LED watch was the coolest thing and I wanted to get involved with it.  My uncle Joe worked at National Semiconductor during those years and he would bring various gadgets back to sell.  They were so fascinating and useful for the general public.  Just imagine, you can read your watch at night without turning on the light bulb.  It was now minaturized into a puny half-in square device you can put on your wrist.

Now Samsung and soon Apple will introduce wristwatches that will not just tell you the time.  These intelligent devices will have multiple sensory inputs that can report to you : barometer, temperature, heart beat, humidity, updates on favorite sports teams and your stock portfolio.  If you had Google glasses you would be able to visual see these info as though it were your own pair of glasses.

All these type of gadgets are due to the engineering processes and flow methodologies that accelerate the command of nature by given God granted talents and skills of a man/woman’s mental abilities.

VLSI Design Methodology Flow

A methodology flow allows one to have a feedback process of continuous refinements and analysis of each step of the process.  Usually mistakes made in one flow will be revised by the step of “Lessons learned”.  This usually results in another step or series of steps to reduce the time, improve quality of results, improve manufacturability, reduce area and costs, improve debugging, and eventually make a product very robust that can withstand the harshness of inside an automotive engine compartment, or even the year long trip to Mars.

  1. Design Specification
  2. Design Partition
  3. Design Entry: Verilog RTL and Behavoiral Modeling
  4. Simulation/Functional Verification
  5. Design Integration and Verification
  6. Presynthesis Sign-off
  7. Synthesize and Map Gate-level Netlist
  8. Postsynthesis Design Validation
  9. Postsynthesis Timing Verification (if final timing verification jump to 14.)
  10. Test Generation and Fault Simulation
  11. Cell Placement, Scan-chain and Clock Tree Insertion, Cell Routing
  12. Verify Physical and Electrical Design Rules
  13. Extract Parasitics (feedback SDF to 9.)
  14. Design Sign-off

Production Ready Masks

 ULSI Design Verification Flow

As millions quickly progressed to billions of transistors following Moore’s law of doubling transistors every two years, power skyrocketed, thermal issues multiplied, and manufacturability crushed under the weight of complexities.  It’s not a simple job to design, verify, validate and manufacture a billion transistor device.



New methodologies and flows were added to make this leap possible.

  1. Design Specification
  2. Design Partition (add PPA : Power Performance Area analysis and early DFT)
  3. Design Entry: Verilog RTL and Behavoiral Modeling
  4. Simulation/Functional Verification (add formal verification)
  5. Design Integration and Verification (add system level emulation)
  6. Presynthesis Sign-off
  7. Synthesize and Map Gate-level Netlist
  8. Postsynthesis Design Validation
  9. Postsynthesis Timing Verification (if final timing verification jump to 14.)
  10. Test Generation and Fault Simulation
  11. Cell Placement, Scan-chain and Clock Tree Insertion, Cell Routing (add UPF low-power cells)
  12. Verify Physical and Electrical Design Rules (add UPF low-power verification)
  13. Extract Parasitics (feedback SDF to 9.)
  14. Design Sign-off





Computer Architecture : QOS (Quality of Service)

May 25, 2014

Computer Architecture


Quality of Service just 30 years ago in the early 1980’s used to apply to phone calls made locally and internationally.

Local phone calls in a small town have little traffic to interfere with.  And international calls have time delays which introduce latency in the phone call.

However local phone calls in a metropolitan city with millions of neighbors now cause the quality of service for the phone call to possibly be delayed or dropped or even disconnected due to reduced bandwidth and fluctuating priorities of the telephone switch network.  The telephone switch networks have limited bandwidth and constantly manage phone calls.  Thus, one needs to “adjust” calls or packets according to priorities.  


The same principle applies to computers with multiple devices wanting access to the limited resource of the memory.  The bottleneck in the computer is that everyone wants to communicate with each other and that common area of access is called the main memory. Technically it’s called dynamic random access memory DRAM for short.  It’s a type of technology that maximizes the physical space for memory.  The components of a computer need to have a common area to access data and that area is the main memory.  If there is nothing in common then devices are not synchronized together.  That’s another topic about data coherency and semaphores.  The computer is broken down into two main blocks which manage access to the main memory : switch fabric and arbiter.



How to solve bandwidth limitations and latency issues?

The same principle as a phone call applies to computers : We want to maximize the bandwidth and minimize latency so that we can have more calls and less delay, respectively.  This process involves using the mechanisms of QoS (Quality of Service).

But before we describe these mechanisms we need to understand the devices that want access to the switch fabric and main memory.  The terminology for devices requesting service is “Initiators”.  This makes sense since the device is the one wanting to make the call and initiate the dialing.

 Initiator Types

Real-time : Devices like a digital photo capture requires real-time access to process and store the captured image.  Any limitations of bandwidth or latency could result in a slightly corrupted or distorted image.  Similarly, the display refresh for your smartphone LCD needs immediate access to memory or else you might see portions of the display glitch with a blink of the eye.  See the black dots pointed by the red arrow as a visual artifact that shows up in one display frame which is less than the blink of the eye. This is not acceptable.  (The photo is from for non-commercial purposes).


Latency-sensitive : The CPU works best when it is not delayed but latencies to access the main memory or else “bubbles” will occur in the processor pipeline.  Bubbles are time periods when there is nothing being processed in the processor pipeline.


Best-effort : Devices such as portable hard drives will not have an adverse visual effect if the data it needs to transfer is not at its maximum speed.  It just means the user may be delayed by an extra second or tens of seconds.  That is tolerable to the user.

QoS mechanisms

  • Traffic control
  • Topology
  • Priority mechanisms

Traffic control is further segmented into these concepts

  • Splitting which involves breaking up the bursts from the initiators into smaller packets.  One large burst will cause delays or increased latencies to the other devices.
  • Shaping limits the number of packets that can be pending
  • Pending transactions limits the number of transactions that can be pending (a transaction can have multiple packets)


  • Concurrency involves duplicating logic in a cross-bar switch like structure to allow parallel paths to occur
  • Arbitration levels can be flat so that all paths have equal access or it can be tiered or layered to emphasize certain paths and reduce access to other paths




  • Local buffers can absorb traffic bursts from the initiator and handle reduced bandwidths
  • FIFO are best used when bandwidth reductions such as when a faster clock domain transfers data to a slower clock domain
  • Rate Adapters are used to avoid long wait cycles when a “slower clock domain” transfers data to a faster clock domain

Priority Mechanisms

  • Priority within the packets
  • Sideband priority at the interface channel to indicate an overall priority level of the pending packets to notify the downstream arbiter of increased priority.