HPC (High Performance Computing) 101 Tutorial and RQJ


Send an email to sales@kingwoodsoftware.com for pricing or questions or Click to Contact Sales



What does RQJ do?

RQJ is a highly specialized software program that takes a few to thousands of computers and make them function as a single large computer. RQJ enables engineers and scientists to rapidly iterate on mathematical models for whatever problem they are working on. Essentially, a few engineers can do the work of many. Drawing on our past work, one scientist was able to compute two years worth of work in two weekends with a large cluster. It freed him up to do other more productive work. Another scientist doing nuclear modeling for the oilfield was able to perform a new down hole tool design analysis every couple of days. What used to take years and an extremely large team was completed in less than a year by a small team. RQJ can handle tens of thousands of small jobs taking a few minutes each. It can also handle hundreds of large jobs with each job taking up to several weeks to complete.

What is a Batch Queue?

Often problems require numerous separate computations. There can be thousands of these computations. It is not cost effective to build a large enough cluster of computers to run all of these jobs at one time. The cost effective solution is to store all of these jobs and run them as space becomes available in the cluster. This allows the computers to run nights, weekends and holidays.

How does your software help me?

Most GUI software does not allow you to setup work and have it execute it in the future. The core analysis software generally has the capability to be run in the future. RQJ allows you to optimally schedule your work across all of your available computer cores. Even if you only have a single computer, this software will often triple the number of jobs per day run. If you have smaller jobs, the results will be even better.

Can't I just use a cloud environment?

First let's define the goals of most cloud providers. Next, let's define what is required for acceptable (not great HPC performance). Note that none of these requirements are met by most cloud providers. Ultimately, cloud providers have so much overhead, they have to grossly overcharge and massively under deliver performance to make a profit. If you are not prepared to spend a huge amount of money, do not use the cloud. We have witnessed two hour mistakes that were over $25,000 in file transfer fees alone. Your hardware on your premises is simply safer, easier on your budget and has higher computational throughput.

Double checking our statements about cloud providers via an Amazon analysis

What are your favorite computing environments and why?

What is HPC software?

HPC software allows one or more individuals to link multiple computers together and make them function as a single computer. HPC software will optimally schedule your entire computational load across all of your computers. Because of the massive increase in computational power (even with just a few computers), it makes running "what if we did x" studies trival.

The different types of HPC software by programming model.

How does a Computer Cluster help my business?

Normally you have to conduct physical tests or construct mathematical models of whatever problem you are solving (Genetics, Seismic, Nuclear, Electrical, Mechanical, new Drugs, Chemical,...) in new product development. It is much safer for humans, animals and the environment if a proper mathematical model can be constructed for the problem at hand. If your team can find a much better solution numerically in months instead of guessing with numerous physical trials over years, your company will have better product-market fit at a much lower cost. RQJ allows your highly paid scientists and engineers to work much more productively.

We have centralized HPC job submission facilities and don't need this software

For most individuals and groups, having a small dedicated cluster that don't require IT approval and management priority arbitration will produce much higher levels of computations and results. It also removes many fights with other groups about the relative priority of your jobs versus everyone else's jobs.

High Performance Computing

This is the software technique that allows the linking of computers together to work as a single large computer. Most of the easy scientific/business developments have already been found. New and break-through developments will almost certainly need lots of computer modeling since that is the most cost effective way to prove new theories.

What is a single core job?

The easiest numerical problem models run on a single CPU core. These are the types of jobs that are initially coded. They start at the top of the computer code and linearly run down the code to the end. They will never use more than a single CPU core. These jobs typically take much longer to run (the computation load is not being shared among multiple CPU cores).

What is SMP?

Symmetric Multi-Processing(SMP) is programming technique where two or more CPU cores are used to solve a single problem. SMP processes kind of scale with the number of CPU cores. Each numerical code has its own scaling behavior. Very few problems will run N times faster with N cores.

What is MPI?

Message Passing Interface (MPI) is when you link X computers with each computer having N CPU cores. Much larger types of problems can be solved with this setup. Some typical problems types are Weather Prediction, Large Stock Market Prediction, Nuclear, Seismic, Electromagnetic and Fluid Flow problems. MPI processes kind of scale with the number of CPU cores. Very few problems will run N times faster with N cores.

How do I size machines in the Cluster?

Often, relatively few scientists and engineers will account for the bulk of your computational needs. Each person should be able to accurately describe the type of machine (CPU speed, number of CPU's, RAM, disk space, network speeds...) that are required to solve their problems. Hopefully there is some overlap in specifications. SRS recommends that a separate batch queue be setup for each size of problems. Trying to run most large scientific codes on too small of computers can take dramatically more wall clock time since the CPU will spend more time in wait states and performing the actual calculations.

Can I randomly mix and match machines?

Not really. See "How do I size machines in the Cluster".

Can I use older desktops that have gone out of warranty?

Providing that they meet the problem specifications, sure. We have done this several times at several different companies with great success.

Can I just buy several new AMD Threadripper CPU's?

This option should be seriously considered. We REALLY like AMD's new CPU. Beasts!

Stages of Research and Development and Cluster Size

As an example, based on many decades in the oilfield - first, an engineer has to tune his math model to the problem. This can take some time. In this stage, the cluster can be small. Make sure that excess capacity is available to help speed this stage along. In the second stage, the engineer will start to apply his model to an increasing number of conditions and problems. The cluster will need to be sized accordingly. In the third stage, the engineer will have created a repeatable process to model and analyze the results of the hundreds of separate computer runs. This is when the cluster should be expanded rapidly.

Political Problems between Departments

There is usually a disagreement between departments on how to divide up computer resources. Additionally, this causes a charge back/billing nightmare. It really does simplify life to build a separate cluster for each department. Been there, done that. Also, it is relatively rare for two departments to have similar computer sizing needs and effectively, you will have two clusters in practice anyway. It does not matter if the clusters are logically separated or physically separated. They still can't process the same problems.

What kind of IT staff do I need and how many?

Obviously, the larger the cluster, the more support you need. The real question is "Do you choose a truly complex queuing software" or a next generation software product (RQJ) that automates as much of the queuing problems as possible? The next question is do you need eight hours per day Monday-Friday, weekend days and/or 24x7 support? If you choose the complex legacy software, plan on trying to find multiple very senior IT staff experienced in HPC. These types of individuals can normally go to cloud computer companies and have a very secure future with higher pay. In Houston, Texas where there is a lot of HPC activity because of the oilfield, it is difficult to find and keep these employees. We can only guess at how hard it is in other parts of the US. If you choose a next generation package like RQJ, you can get by with junior administrators.

Are you really serious about your Open Source Claims?

Absolutely! We really like Open Source a lot but not for HPC. Our principals have 50 years of HPC experience, wrote two queuing systems from scratch along with a research paper and have decades of commercial software development and administration. We tried to compile the most likely Open Source project and spent a fair amount of time with nothing to show for it but wasted time. We even chose one that we had previously gotten to work in years past and is still running at a client site. Do you have equivalent skills on your team? Is it worth the wait? Can you fix any problems? Probably not. We want to stress that these projects have not been maintained for years (often decades) and can't be compiled with current build environments and compilers. The good packages were written with Java. Java 7 and 8 have broken most Open Source Java projects. Not too long ago 60%+ of Open Source projects could not compile under Java 8. Migrating from Java 8 to 11 is fraught with problems. See the comments from the JodaTime library Author Over 90% of Java Developers are still at Java 8 and below. Here are three JVM usage surveys from developers Survey 1 , Survey 2 and Survey 3 . If you are not convinced yet, a lot of the documentation is located in really old newsgroups. It is hard to find what you are looking for. If the help file is daunting (like the good HPC Open Source packages are), then you know supporting the software is going to be bad. Your time is not free.

Windows Support

RQJ supports Windows 10 and above that are fully patched. It also supports all current Windows Server versions that are fully patched. A Windows High Performance Computing cluster may only contain Windows boxes as cluster members. The main RQJ server *should be* be a supported Windows Server version.
We developed the code on Windows 7/10 boxes and we have discovered a number of issues during development with personal editions of Windows that cause issues with cluster members: Some of these issues apply to Windows Server versions. It should be noted that ALL of the top 1000 HPC installations run Linux. This is not an accident. All this being said, if you need to run Windows or only on Windows 10 that is OK and you will have a few extra issues from time to time.

Linux Support

We only support current Redhat, Centos and Ubuntu distributions. The OS must be at current patch levels. A Linux cluster may only contain Linux machines.

Licensing

All CPU cores used as compute nodes must be licensed. RQJ will refuse to assign jobs to a computer if there are not enough licenses.

Multiple Cluster Costs

Since RQJ is licensed by the CPU cores used as compute nodes, there are no additional charges to have any number of clusters that you want.

What if I need KWS to Create Scripts, Install the Software for us or give additional help/guidance?

We offer remote and local installation services for a fee. We can write required scripts either remotely or on site for a fee. If you need additional consulting, this can also be arranged.

What kind of Guarantee do you offer?

Within the first 30 days, you can get a full refund.


To Obtain Additional Information:


Send an email to sales@kingwoodsoftware.com for pricing or questions or Click to Contact Sales


or

Attn: Sales
Kingwood Software LLC
4321 Kingwood Dr. #57
Kingwood Texas 77339