Projects

Introduction

Future many-core CMPs have the potential to offer unprecedented computing power and energy efficiency, but only as long as processor cores are able to access the data that they need to process. This is not straightforward due to the well known problem of the memory-wall (i.e.: the increasing difference in speed between the core and the memory) and the complexities of the communication among so many cores. To overcome this hurdles, on-chip caches and powerful interconnection networks are crucial.

Unfortunately, the presence of private or partially-shared caches creates difficulties for sharing data between cores. In order to keep the shared-memory programming model which is familiar to most programmers, a cache coherence protocol needs to be implemented.

In addition to traditional scientific and commercial parallel workloads, currently one of the most practical uses of many-core CMPs is server consolidation (usually by means of virtualization). At the same time, new or revived programming models like transactional memory and message-passing will benefit from hardware support at the level of the cache coherence protocol.

The aims of this research line are to provide new solutions to the cache coherence problem, to study novel organizations and management policies for the on-chip caches and to propose techniques to enable the efficient execution of established and emerging parallel workloads.

Main Results Achieved

We have designed Direct Coherence, a new family of cache coherence protocols which avoid the indirection of directory-based based protocols without relying on broadcast communication by means of shifting the responsibility of ordering requests and keeping coherence information from the home node, used by directory-based protocols, to the owner node. Direct Coherence for cc-NUMAs was presented in the HiPC 2007 conference [Ros2007] and it was later extended for CMPs and presented inthe IPDPS 2008 conference [Ros2008].

We also proposed novel mechanisms for conflict detection and resolution in Hardware Transactional Memory (HTM) systems. The first of these works proposes a directory-based conflict detection scheme, which can alleviate the performance degradation that eager systems experience when contention is high, and it has the potential to minimize the effect of false positives when hash signatures are used for transactional book-keeping. This work was published in the HiPC 2008 conference [Titos2008]. Later, we proposed a scheme for speculative resolution of conflicts that allows a writer transactions to continue its execution past conflicting access with other concurrent readers. This proposal was published in the IPDPS 2009 conference [Titos2009]. Furthermore, we have designed a hybrid HTM system that is capable of selecting the most appropriate policy, eager or lazy, for managing each individual cache line, depending on the past behaviour of the line in regards to contention. This data-centric design combines the best of both worlds, as it is able to achieve truely parallel commits when contention is low, while being able to extract good concurrency in situations of high contention. This proposal was published in ICS 2011 [Titos2011a]. Furthermore, we have thoroughly analised the implications that common structural optimizations such as store buffering have in the performance of both eager and lazy systems, which had been ignored in the literature. Our findings confirm that when write buffering is employed, the behaviour of eager systems is lazified and both HTM designs exhibit comparable performance, debunking the generalized perception that lazy systems consistently outperform their eager counterparts. This study was published in ICPP 2011 [Titos2011b].

In the context of consolidated servers, we proposed a operating system based distance-aware round-robing mapping policy that tries to map memory pages to the cache bank belonging to the core that most frequently accesses the blocks within that page which was presented in HiPC 2009 [Ros2009]. Recently, we analized previously proposed cache-coherence protocols for server consolidation using virtualization, found problems with the handling of shared memory (i.e.: deduplicated pages) and presented a solution in the SBAC-PAD 2010 conference [García-Guirado2010].

References

[Ros2007] A. Ros, M. E. Acacio and J. M. García. Direct Coherence: Bringing Together Performance and Scalability in Shared-Memory Multiprocessors. In Proc. of the 14th Int’l Conference on High Performance Computing (HiPC 2007), pp. 147-160, ISBN: 3-540-77219-7, December 2007.

[Ros2008] A. Ros, M.E. Acacio and J.M. García. DiCo-CMP: Efficient Cache Coherency in Tiled CMP Architectures. In Proc. of the 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), IEEE Computer Society Press, ISBN: 978-1-4244-1693-6, April 2008.

[Ros2009] Alberto Ros, Marcelo Cintra, Manuel E. Acacio, José M. García, Distance-Aware Round-Robin Mapping for Large NUCA Caches. 16th Int'l Conference on High Performance Computing (HiPC), pages 79–88, Kochi (India), December 2009.

[Titos2008] R. Titos, M. E. Acacio and J. M. García. Directory-Based Conflict Detection in Hardware Transactional Memory. In Proc. of the 15th Annual IEEE International Conference on High Performance Computing (HiPC 2008), Springer, pp. 541-554, ISBN: 978-3-540-89893-1, December 2008.

[Titos2009] R. Titos, M. E. Acacio and J. M. García. Speculation-Based Conflict Resolution in Hardware Transactional Memory. In Proc. of 23rd IEEE Int’l Parallel and Distributed Processing Symposium (IPDPS 2009), pp. 1-12, ISBN: 978-1-4244-3750-4/09, Rome (Italy), May 2009.

[Titos2011a] R. Titos, A. Negi, M. E. Acacio, J. M. García, P. Stenstom. ZEBRA: A Data-Centric, Hybrid-Policy Hardware Transactional Memory Design. In Proceedings of 25th ACM International Conference on Supercomputing (ICS'2011), ISBN: 978-1-4503-0102-2, Tucson, USA, June, 2011.

[Titos2011b] A. Negi, R. Titos, M. E. Acacio, J. M. García, P. Stenstom. Eager meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory. To appear in 40th International Conference on Parallel Processing (ICPP'2011), Taipei, Taiwan, Sept, 2011.

[García-Guirado2010] Antonio García-Guirado, Ricardo Fernández-Pascual, and José M. García. Analizing Cache Coherence Protocols for Server Consolidation. Proc. of the 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD-2010), Petrópolis (Brazil), October 2010.

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.

Introduction

Current technology trends are increasing the number of available transistors per chip. Nonetheless, these trends are also making these transistors more prone to permanent, intermittent and transient faults. To overcome these problems, we need to develop new architectural techniques that will ensure the reliability of the chip. Traditionally, this can be achieved by adding a significant amount of redundant hardware, something which increases the cost of the device and decreases its performance and energy efficiency.

Our main goal consists of providing fault-tolerance with minimal performance degradation. For achieving this, we propose fault-tolerance techniques both at the microarchitectural level and at the interconnection network level.

Main Results Achieved

At microarchitectural level, we proposed a fault-tolerant architecture by redundant execution within SMT cores called REPAS. With this proposal, we achieve an improvement in terms of both performance degradation and area overhead compared to previous works. The results were published in the Euro-Par 2009 conference [Sánchez2009].

As a way to minimize the increased complexity, hardware and performance overhead, we presented Log-Based Redundant Architecture (LBRA) in HiPC 2010 [Sánchez2010], a highly decoupled redundant architecture based on a hardware transactional memory implementation. We leverage the already introduced hardware of LogTM-SE to provide a consistent view of the memory between master and slave threads through a virtualized memory log, achieving both transient fault detection and recovery, more scalability, higher decoupling and lower performance overhead than previous proposals.

For handling faults that happen in the on-chip interconnection network of CMPs, we propose to add fault-tolerance at the level of the cache coherence protocol instead of at the level of the interconnection network itself. We have shown the viability of our approach and we have developed several fault-tolerant cache coherence protocols. These results have been published in well-know international conferences (like HPCA [Fernández2007] and DSN [Fernández2008a]) and in the TPDS journal ([Fernández2008b] and [Fernández2010]).

Finally, we study the impact of hard faults on cache memories. To this end, we develop an analytical model for determining the implications of word/block disabling techniques due to random cell failure on cache miss rate behaviour presented at IOLTS 2011 [Sánchez2011]. The proposed model is distinct from previous work in that it is an exact model rather than an approximation. Besides, it is simpler than previous experimental frameworks which are based on the use of fault maps as a brute force approach to statistically approximate the effect of random cell failure on caches.

References

[Fernández2007] Ricardo Fernández-Pascual, José M. García, Manuel E. Acacio and José Duato. A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures. Proc. of the 13th International Symposium on High-Performance Computer Architecture (HPCA-13). February 2007.

[Fernández2008a] Ricardo Fernández-Pascual, José M. García, Manuel E. Acacio and José Duato. A Fault-Tolerant Directory-Based Cache Coherence Protocol for CMP Architectures. Proc. of the 38th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-2008). June 2008.

[Fernández2008b] Ricardo Fernández-Pascual, José M. García, Manuel E. Acacio and José Duato. Extending the TokenCMP Cache Coherence Protocol for Low Overhead Fault Tolerance in CMP Architectures. IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 19, no. 8, pages 1044-1056, August, 2008.

[Fernández2010] Ricardo Fernández-Pascual, José M. García, Manuel E. Acacio and José Duato. Dealing with Transient Faults in the Interconnection Network of CMPs at the Cache Coherence Level. IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 21, no. 8, pages 1117-1131, August, 2010.

[Sánchez2009] D. Sánchez, J.L. Aragón and J.M. García. REPAS: Reliable Execution for Parallel Applications in Tiled-CMPs. In Proc. of the15th Int. Conference on Parallel and Distributed Computing (Euro-Par), Delft (Netherlands), pp. 321-333, ISBN: 978-3-642-03868-6, August 2009.

[Sánchez2010] Daniel Sánchez, Juan L. Aragón and José M. García. A Log-Based Redundant Architecture for Reliable Parallel Computation. Proc. of the 17th International Conference on High Performance Computing (HiPC), Goa (India), December 2010.

[Sánchez2011] D. Sánchez, Yiannakis Sazeides, Juan L. Aragón and José M. García. An Analytical Model for the Calculation of the Expected Miss Ratio in Faulty Caches. Proc. Of the 17th Int'l On-Line Testing Symposium (IOLTS), Athens, Greece, July 2011.

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.

Introduction

Current processors are endowed with many simpler processors, having a tremendous potential in terms of peak performance. Moreover, emergent platforms such as Graphics Processing Units (GPUs), the Field Programmable Gate Array (FPGAs), Accelerating Processing Units (APUs), etc. have been consolidated for developing scientific applications in different areas including bioinformatics, finance, seismic Processing, fluid dynamics, etc. However, it is not a trivial task to take advantage of the potential performance that these platforms provide to the scientific community. In this task, we develop scientific application from different fields such as linear algebra, system biology, natural computing, image processing, etc. On these emergent platforms, providing insight into the peculiarities of their programming models and architectures. Currently, we are researching in applying those models to challenging problems, mainly derived from Bioinformatics.

Main Results Achieved

Our studies began with a performance study of the GPU as general purpose computing device, providing some insights into the peculiarities of CUDA programming model [Cecilia2009][Cecilia2010a]. In addition, we discuss alternative ways of computation inspired on natural computing and their efficient design on GPUs. Currently, we have worked on two of those models: P systems [Cecilia2010b] and Ant Colony Optimisation [Cecilia2011].

References

[Cecilia2009] José M. Cecilia, José M. García, Manuel Ujaldón. The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions. Proc. of Int. Conference on Parallel Computing (ParCo2009). Lyon, France, pp 331-340, 2010. ISBN: 978-1-60750-529-7.

[Cecilia2010a] José M. Cecilia, José M. García, Manuel Ujaldón. CUDA 2D Stencil Computations for the Jacobi Method. 10th PARA'2010: State of the Art in Scientific and Parallel Computing, Reykjavik (Islandia), June 6-9, 2010.

[Cecilia2010b] José M. Cecilia, José M. García, Ginés D. Guerrero, Miguel A. Martínez-del-amor, Ignacio Pérez-Hurtado, Mario J. Pérez-Jiménez. Simulation of P Systems with Active Membranes on CUDA. Briefings in Bioinformatics, Vol. 11, Number 3, pp. 313-322, May 2010. Impact Factor: 7.329.

[Cecilia2011] José M. Cecilia, José M. García, Manuel Ujaldón, Andy Nisbet, Martyn Amos. Parallelization Strategies for Ant Colony Optimisation on GPUs. Proc. 14th Int. Workshop on Nature Inspired Distributed Computing -NIDISC11- (in conjunction with IPDPS 2011), Anchorage (Alaska), May 2011. 2011 14th International Workshop Nature Inspired Distributed Computing -NIDISC11- (in conjuntion with IPDPS 2011), Anchorage, Alska, CD-ROM, May 2011.

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.

Introduction

In biomedical research, experimentation can be unfeasible in relevant study cases due, among other factors, to the intrinsic complexity of nature. Theoretical and computational methods jointly with biophysical and biochemical modeling can overcome these limitations providing new understanding and solutions for world health problems. Nevertheless, they need to process huge amounts of data with high accuracy and this can be a serious limitation for the applicability of bioinformatics methods. GPUs and Supercomputers can drastically speedup required calculations and facilitate the inclusion of new methodological enhancements not feasible in the past [Pérez-Sánchez2009][Pérez-Sánchez2010][Pérez-Sánchez2011a]. Our main objectives are related with the development, implementation and exploitation of bioinformatics applications in massively parallel architectures like GPUs and Supercomputers, its experimental validation, and its application to relevant world health problems.

Nowadays:

  • We are developing new biophysical simulation methodologies from scratch on massively parallel architectures [Sánchez-Linares2011c][Cepas-Quiñonero2011a][Sánchez-Linares2012].
  • We are adding new improvements to the biomedical applications.
  • We are exploiting the new methodological features introduced in the bioinformatics applications, thanks to the huge increment of computational capability obtained implementing these methods in massively parallel architectures, to biomedical relevant problems in collaboration with several experimental groups [Pérez-Sánchez2011b][Navarro-Fernández2012].

Main Results Achieved

We have worked on the implementation of the most computationally demanding kernels of the Virtual Screening program FlexScreen on GPUs [Guerrero2011][Sánchez-Linares2011a][Sánchez-Linares2011b], Grids/Clusters [Pérez-Sánchez2011c] and Supercomputers [Pérez-Sánchez2011d]Guerrero2012].

References

[Guerrero2011] Ginés D. Guerrero, Horacio Pérez Sánchez, Wolfgang Wenzel, José M. Cecilia, José M. García. Effective Parallelization of Non-bonded Interactions Kernel for Virtual Screening on GPUs. 5th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2011), Salamanca, Spain, pp. 63-69, April 2011.

[Sánchez-Linares2011a] I. Sánchez-Linares, H. Pérez-Sánchez, G.D. Guerrero, J.M. Cecilia, J.M. García, (2011). Accelerating multiple target drug screening on GPUs. CMSB 2011 (9th International Conference on Computational Methods in Systems Biology) (accepted).

[Sánchez-Linares2011b] I. Sánchez-Linares, H. Pérez-Sánchez, J.M. García. Accelerating Grid Kernels for Virtual Screening on Graphics Processing Units. Proceedings of the International Conference on Parallel Computing (Parco) (accepted).

[Sánchez-Linares2011c] I. Sánchez-Linares, H. Pérez-Sánchez, J.M. Cecilia, J.M. García, (2011). BINDSURF: a fast blind Virtual Screening methodology on GPUs. In: NETTAB 2011 workshop focused on Clinical Bioinformatics (accepted).

[Cepas-Quiñonero2011a] E. J. Cepas Quiñonero, H. Pérez-Sánchez, J.M. Cecilia, J.M. García, (2011) MURCIA: Fast parallel solvent accessible surface area calculation on GPUs and application to drug discovery and molecular visualization. In: NETTAB 2011 workshop focused on Clinical Bioinformatics (accepted).

[Pérez-Sánchez2009] H.E. Pérez Sánchez and W. Wenzel. Implementation of an effective non-bonded interactions kernel for biomolecular simulations on the Cell processor. Proceedings of Massively Parallel Computational Biology on GPUs, Jahrestagung der Gesellschaft für Informatik e.V., Lübeck, Germany, September 29, 2009. Volume 154 of Lecture Notes in Informatics, 721-729.

[Pérez-Sánchez2010] K. Klenin, H. Pérez-Sánchez, W. Wenzel. Method and system for determining the solvent accessible surface area and its derivatives of a molecule. European Patent Application 10002203.7, 2010.

[Pérez-Sánchez2011a] H. Pérez-Sánchez, W. Wenzel. Optimization methods for virtual screening on novel computational architectures. Current Computer Aided Drug Design, 7, 44-52, 2011.

[Pérez-Sánchez2011b] H. Pérez-Sánchez, I. Meliciani, W. Wenzel, I. Martínez-Martínez, J. Navarro-Fernández, J. Corral, V. Vicente. A Molecular Scaffold to Modulate Thrombin/Antithrombin Activity by Heparin Binding. European patent application 11002649.9, 2011.

[Pérez-Sánchez2011c] H. Pérez-Sánchez, I. Kondov, J.M. García, K. Klenin, W. Wenzel, (2011). A Pipeline Pilot based SOAP implementation of FlexScreen for High-Throughput Virtual Screening. In IWSG-Life 2011 (3rd International Workshop on Science Gateways for Life Sciences) (accepted for oral presentation and publication).

[Pérez-Sánchez2011d] H. Pérez-Sánchez, G. D. Guerrero, I. Sánchez-Linares, J. M. Cecilia, J. M. García, I. Martínez-Martínez, J. Navarro-Fernández, V. Vicente-García, J. Corral, I. Meliciani and W. Wenzel. Poster entitled High Throughput Virtual Screening against flexible protein receptors; implementation on GPUs and application to the discovery of novel scaffolds for the modulation of antithrombin anticoagulant activit. XI Congreso de la Sociedad de Biofísica de España, held in Murcia (Spain), 2011.

[Navarro-Fernández2012] J. Navarro-Fernández, H. Pérez-Sánchez, I. Martínez-Martínez, I. Meliciani, J.A. Guerrero, V. Vicente, J. Corral, W. Wenzel. In-silico discovery of a compound with nanomolar affinity to antithrombin causing partial activation and increased heparin affinity. Journal of Medicinal Chemistry, 55(14), 6403-6412, 2012.

[Sánchez-Linares2012] I. Sánchez-Linares, H. Pérez-Sánchez, J.M. Cecilia, J.M. García. Fast blind Virtual Screening with BINDSURF. BMC Bioinformatics, 13(Suppl 14):S13, 2012.

[Guerrero2012] G.D. Guerrero, H. Pérez-Sánchez, J.M. Cecilia, J.M. García Parallelization of Virtual Screening in Drug Discovery on Massively Parallel Architectures. In: 20th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2012, 588-595, 2012.

Publications

The full list of publications can be found visiting the personal web pages of the group members, where you could download most of the papers in pdf format.