Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors

ABSTRACT

As the semiconductor industry continues its relentless push for nano-CMOS technologies, long-term device reliability and occurrence of hard errors have emerged as a major concern. Long-term device reliability includes parametric degradation that results in loss of performance as well as hard failures that result in loss of functionality. It has been reported in the ITRS roadmap that effectiveness of traditional burn-in test in product life acceleration is eroding. Thus, to assure sufficient product reliability, fault detection and system reconfiguration must be performed in the field at runtime. Although regular memory structures are protected against hard errors using error-correcting codes, many structures within cores are left unprotected. Several proposed online testing techniques either rely on concurrent testing or periodically check for correctness. These techniques are attractive, but limited due to significant design effort and hardware cost. Furthermore, lack of observability and controllability of microarchitectural states result in long latency, long test sequences, and large storage of golden patterns. In this paper, we propose a low-cost scheme for detecting and debugging hard errors with a fine granularity within cores and keeping the faulty cores functional, with potentially reduced capability and performance. The solution includes both hardware and runtime software based on codesigned virtual machine concept. It has the ability to detect, debug, and isolate hard errors in small noncache array structures, execution units, and combinational logic within cores. Hardware signature registers are used to capture the footprint of execution at the output of functional modules within the cores.

INTRODUCTION

TRANSISTOR tially increasing number of transistors. It is widely scaling has enabled integration of an exponenbelieved that Chip Multiprocessors (CMPs) will allow a clear path to ITRS technology scaling projections of 100 billion transistors on a single chip by 2020 . In the area of computing, availability of an ever-increasing number of transistors has generally translated to additional resources. However, due to emerging device reliability and marginality problems, coupled with the lack of exhaustive testing and verification during various phases of design and operation of a chip, the susceptibility of these components to hard errors has also grown. Manufacturers are shipping more parts with incomplete testing than ever before . Silicon debug and diagnosis is now at the forefront of design constraints. A deeper perspective and analysis of silicon debug and diagnosis in context of this paper is presented in Section 2. A system that enables speedy debug and diagnosis in today’s vigorous time-to-market environment is highly desirable. Fault tolerance techniques are generally categorized into detection/isolation followed by correction/recovery phases.

چکیده

به عنوان صنعت نیمه هادی همچنان فشار بی وقفه خود را برای فن آوری های نانو CMOS همچنان ادامه می دهد، قابلیت اطمینان دستگاه طولانی مدت و خطاهای سخت به عنوان یک نگرانی عمده مطرح شده است. قابلیت اطمینان درازمدت شامل تخریب پارامتریک است که منجر به از دست دادن عملکرد و همچنین خرابی های سخت می شود که در نتیجه از دست دادن قابلیت ها می شود. در نقشه راه ITRS گزارش شده است که اثربخشی تست سوزاندن سنتی در شتاب دادن به محصول کاهش می یابد. بنابراین، برای اطمینان از قابلیت اطمینان کافی، تشخیص خطا و تنظیم مجدد سیستم باید در این زمینه در زمان اجرا انجام شود. اگر چه ساختارهای حافظه منظم در برابر خطاهای سخت محافظت شده با استفاده از کدهای خطا اصلاح، بسیاری از ساختارها در هسته ها محافظت نشده اند. چندین تکنیک تست آنلاین پیشنهاد شده یا به آزمایش همزمان کمک می کنند و یا به صورت دوره ای برای صحت بررسی می شوند. این تکنیک ها جذاب هستند اما به دلیل تلاش های قابل توجه طراحی و هزینه های سخت افزاری محدود شده اند. علاوه بر این، عدم مشاهده و کنترل پذیری وضعیت میکروارساختارها، موجب می شود تاخیر طولانی، توالی های آزمایش طولانی و ذخایر زیادی از الگوهای طلایی باشد. در این مقاله، ما یک طرح ارزان قیمت برای تشخیص و اشکال زدایی خطاهای سخت افزاری با دانه بندی خوب در هسته ها ارائه می دهیم و هسته های معیوب را به صورت کاربردی نگه می داریم و قابلیت و عملکرد آن را کاهش می دهیم. راه حل شامل هر سخت افزار و نرم افزار زمان اجرا بر اساس مفهوم ماشین مجازی کدنویسی است. این توانایی شناسایی، اشکال زدایی و جداسازی خطاهای سخت در ساختارهای آرایه کوچک، واحدهای اجرا و منطق ترکیبی در هسته ها را دارد. ثبت نام امضا سخت افزار برای ضبط رد پای اجرای در خروجی ماژول های کاربردی داخل هسته استفاده می شود.

مقدمه

ترانزیستور، تعداد ترانزیستورها را افزایش می دهد. این به طور گسترده ای مقیاس سازی یکپارچه سازی یک نمایشگر را فعال کرده است که معتقد است که چند پردازنده چیپ (CMP) اجازه می دهد راه روشن روش های پیشنهادی تکنولوژی ITRS 100 میلیارد ترانزیستور در یک تراشه ی تک تک تا سال 2020 به دست آید. در محدوده محاسبات، دسترسی به تعداد زیاد ترانزیستورها به طور کلی به منابع اضافی تبدیل شده است. با این حال، با توجه به قابلیت اطمینان دستگاه و مشکلات حاشیه ای، همراه با عدم تست کامل و تایید در مراحل مختلف طراحی و بهره برداری از یک تراشه، حساسیت این قطعات به خطاهای سخت نیز افزایش یافته است. تولید کنندگان قطعات بیشتری را با تست ناقص حمل می کنند تا هر زمان دیگری. اشکال زدایی و تشخیص سیلیکون در حال حاضر در خط مقدم محدودیت های طراحی است. دیدگاه عمیق تر و تجزیه و تحلیل اشکال زدایی سیلیکون و تشخیص در متن این مقاله در بخش 2 ارائه شده است. یک سیستم که اشکال زدایی و تشخیص سریع را در محیط روزمره در حال حاضر به بازار می دهد بسیار مطلوب است. تکنیک های تحمل گسل به طور کلی به تشخیص / انزوا و به دنبال آن مراحل تصحیح / بازیابی طبقه بندی می شوند.

Year: 2011

Publisher: IEEE

By : Omer Khan, and Sandip Kundu

File Information: English Language/ 14 Page / size: 1.46 KB

Download

سال : 1390

ناشر : IEEE

کاری از : عمر خان و سندیپ کندو

اطلاعات فایل : زبان انگلیسی / 14 صفحه / حجم : KB 1.46

لینک دانلود

Hardware/Software Codesign Architecture for Online Testing in Chip Multiprocessors

دیدگاه خود را ثبت کنید

دیدگاهتان را بنویسید لغو پاسخ

درباره فروشگاه

ارتباط با ما