TTU Home IT Division HPCC

TechGrid

What is TechGrid?

TechGrid is the union of over 600 networked Texas Tech University desktop WindowsXP lab machines and WindowsXP office machines, and Linux server machines.

What is a Grid? A Grid is a dynamic network of computing resources that work together as a single, uniform operating environment. It can span locations and administrative domains, and can flexibly support dynamically changing organizations and computing requirements. In order to accomplish this, Grids must: n Provide access to resources needed by users, including processing, data, and applications. n Provide a security infrastructure sufficient to support Grid-wide sharing of resources, yet allow for local, diverse, and changing security policies.n Be resilient in the face of individual system failure or changes in the composition of the Grid. n Incorporate heterogeneous hardware, operating systems, and system configurations.

How has the concept of a Grid evolved? The term "Grid computing" originated in the early 1990s as a metaphor for making computer power as easy to access as an electric power Grid. The first applications focused on aggregating or “harvesting” the unused processing in desktop computers. An example is SETI@home, which aggregated PC processing power in order to execute a computing-intensive application. But comprehensive Grids go beyond PC cycle aggregation to include a range of hardware and operating systems. They also go beyond processing power to provide access to data and applications. As a result, they provide the kind of industrial-strength support that large organizations need without having to purchase more super computers.

What type of Grid is TechGrid? According to the figure below, TechGrid is a campus-wide grid.

How does the Grid work? The Grid distributes a compute job among compute nodes within the Grid using Grid middleware as the modus operandi to facilitate distributed computing. The name of the Grid middleware is Condor.

Where is TechGrid now? TechGrid’s compute nodes are located in the ATLC, the High Performance Computing Center at Reese Center, the Computer Science department, the Business Building, the North Computing Center, and the Math Building. Currently, TechGrid is made up of 600 compute nodes spanning several domains and various operating systems.

  • What is Condor? Condor is a specialized workload management system for compute-intensive jobs. Like other full-featured batch systems, Condor provides a job queuing mechanism, scheduling policy, priority scheme, resource monitoring, and resource management. Users submit their serial or parallel jobs to Condor, Condor places them into a queue, chooses when and where to run the jobs based upon a policy, carefully monitors their progress, and ultimately informs the user upon completion. While providing functionality similar to that of a more traditional batch queuing system, Condor's novel architecture allows it to succeed in areas where traditional scheduling systems fail. Condor can be used to manage a cluster of dedicated compute nodes (such as a "Beowulf" cluster). In addition, unique mechanisms enable Condor to effectively harness wasted CPU power from otherwise idle desktop workstations. For instance, Condor can be configured to only use desktop machines where the keyboard and mouse are idle. Should Condor detect that a machine is no longer available (such as a key press detected), in many circumstances Condor is able to transparently produce a checkpoint and migrate a job to a different machine which would otherwise be idle. Condor does not require a shared file system across machines - if no shared file system is available, Condor can transfer the job's data files on behalf of the user, or Condor may be able to transparently redirect all the job's I/O requests back to the submit machine. As a result, Condor can be used to seamlessly combine all of an organization's computational power into one resource. The ClassAd mechanism in Condor provides an extremely flexible and expressive framework for matching resource requests (jobs) with resource offers (machines). Jobs can easily state both job requirements and job preferences. Likewise, machines can specify requirements and preferences about the jobs they are willing to run. These requirements and preferences can be described in powerful expressions, resulting in Condor's adaptation to nearly any desired policy. Condor can be used to build Grid-style computing environments that cross administrative boundaries. Condor's "flocking" technology allows multiple Condor compute installations to work together. Condor incorporates many of the emerging Grid-based computing methodologies and protocols. For instance, Condor-G is fully interoperable with resources managed by Globus. Condor is the product of the Condor Research Project at the University of Wisconsin-Madison (UW-Madison), and it was first installed as a production system in the UW-Madison Department of Computer Sciences nearly 15 years ago.