What's the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA
Common statistics program packages differ considerably in terms of their strengths, weaknesses, and handling. The decision as to which system is the best fit should be made with care. Changing to a new system can result in high costs for things like new licenses and re-training. This article introduces and contrasts the market leaders - R, Python, SAS, SPSS, and STATA - to help to illustrate their relative pros and cons, and help make the decision a bit easier.
R is a popular, open-source statistics environment that can be extended by packages almost at will. R is commonly used with RStudio, a comfortable development environment that can be used locally or in a client-server installation via a web browser. R applications can also be used directly and interactively on the web via Shiny.
- Very large range of functions (well over 2,000 packages)
- New statistical methods are quickly implemented
- Very easy to automate and integrate (for example, with Git, LaTeX, ODBC, Oracle R Enterprise, teradataR, Apache Hadoop, Microstrategy, etc.)
- Very good community support, as well as fee-based support via third-party providers
- Extensive help resources freely available (manuals, tutorials, and so on)
- Very powerful and flexible scripting language (e.g. support of object-oriented programming)
- All common platforms are supported (Windows, Linux, MacOS...)
- Future-proof due to very large, active developer community
- Getting familiar with the R syntax presents a barrier to entry
- Stability/quality of little-used packages is often not as high as the core distribution
- Powerful hardware is required when working with very large data sets
Licensing model and cost
R is free and open source: there are no fees for use
Originally, R was only a low-cost alternative for those that could not afford a commercial statistics program. R has outgrown this perception and now trumps the commercial competition in terms of functionality, flexibility, and integrability with other applications. Many competitors (e.g. SPSS) have reacted by integrating R into their programs. The criticism that R is much harder to learn and use than commercial competitors is less valid today with the availability of RStudio. R is a particularly good choice for frequent users that plan to deal more extensively with statistics and don’t want to be restricted by their statistical program.
Python is a fully functional, open, interpreted programming language that has become an equal alternative for data science projects in recent years. Python is particularly well-suited to the Deep Learning and Machine Learning fields, and is also practical as statistics software through the use of packages, which can easily be installed. A variety of development environments are available, such as jupyter, spyder, and PyCharm. Python is a widely-used language that is also popular in fields like web development.
- Powerful, fully-functional programming language
- Offers the potential for object-oriented, structured, and functional concepts
- Mature programming language, resulting in unit tests and debugging functionalities, for example
- A large number of stable packages in the data science sector and beyond
- Readable, clean syntax
- Constant development by a large developer community
- Full availability of the latest Deep Learning and Machine Learning methods
- Very easy to automate (e.g. via scripts or a web server)
- Fully integratable (Git, teradata, PySpark, Hadoop, KNIME)
- Extremely good community support from a large and constantly-growing community
- Visualizations are appealing and easy to create
- Professional development environments are available
- Future-proof due to continued growth in use in scientific and commercial fields
- Not all statistical methods are available
- Some development environments for statistics are still in their infancy
- High bar of entry due to being a “full” programming language
Licensing model and cost
There are no user fees for the use of Python. However, in some special areas (e.g. text mining) not all packages are released for commercial use.
Python stands out in this summary given that it is a complete programming language suitable for a wide range of applications. In recent years it has also developed into a serious statistics program due to a large number of high-performance packages, and is increasing in popularity. In particular, Python is indispensable for procedures that are more likely to come from the field of computer science, such as Deep Learning. Its advantages are also clear for automation, and in interaction with other programs (which can also be written in Python). Learning Python requires being prepared to learn a complete programming language, though many good tutorials and trainings are available on the subject due to the language’s popularity. A development environment specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist.
SAS Institute offers a professional statistics software that is commonly used in biometrics, clinical research, and in the banking sector.
- Fast integration of new statistical methods, very stable and reliable routines
- Very good documentation and professional support
- Numerous modules and interfaces are available, as well as their own Business Intelligence Software (for a fee)
- Well-suited for handling large data sets
- Extensive in-house training offered
- Different, partly complicated (but powerful) program languages
- Partially obsolete interface, GUI optional
Licensing model and cost
A one-year license for SAS® Analytics Pro starts at approximately 7,500€. Typically an individual offer will be made, and special conditions exist for those in the education sector, for example.
SAS is a powerful and very stable tool which is particularly well-utilized by large organizations, and has become the quasi-standard for many analyses in the pharmaceutical sector. The software consists of different modules, some of which follow completely different operating concepts, and the training for SAS is correspondingly complex. Compared to other commercial competitors, SAS is one of the most expensive solutions (which is partially driven by its focus on larger companies and organizations).
SPSS is considered to be particularly easy to use, and is one of the most widely-used statistics programs. The originally independent provider has since been taken over by IBM.
- Easy to learn (though not always intuitive to use)
- Expandable via commercial modules (with prices starting around 800€)
- Extensive literature - especially on introductory topics - available for Windows and MacOS
- Stability has suffered from the short, one-year update cycle
- Despite syntax and script language it is more difficult to automate and integrate into other applications than other solutions.
Licensing model and cost
There are different license types available, starting at approximately 1,200€ per year for IBM SPSS Statistics Base, 2,700€ for the standard version, and 5,400€ for the professional version, on up to about 8,000€ for the premium version (which includes unlimited use and support for one year). Inexpensive (70€) licenses are available for students, with a fixed one-year term. There are also options for monthly subscriptions.
SPSS has the reputation of being the easiest statistics software to use. SPSS is commonly used in universities, particularly in the social sciences and psychology. In more recent versions, the software is developed by IBM, strongly in the direction of a tool which accomplishes evaluations that can be largely automated and do not require special method knowledge from the user. This development has led to the image of SPSS being somewhat degraded in the eyes of the scientific community, where the software has a reputation for being used by users who “click” together results, without understanding what they are doing. In addition, the short release cycle has negatively impacted stability in the past. While SPSS comes with some more specific modules (e.g. for direct marketing), the overall spectrum of well-supported methods is smaller than for R or SAS, for example.
STATA is a commercial statistical software that is particularly favored by econometricians.
- Wide range of functions - almost every established statistical method can be found in STATA
- Easily accessible through a GUI
- Can be automated
- Compatible with older versions
- Good support from the STATA community, and extensive literature available
- Available for Windows, MacOS, and Unix
- Comparatively inexpensive relative to commercial competitors
- Investment security ensured by a three-year release cycle
- A bit sluggish in terms of incorporating new methods (version updates)
- Up to version 16.0, the Integration of and into other software was cumbersome, since version 16.0 the Python integration is possible
- Up to version 16.0, only one record could be opened and edited simultaneously. With the introduction of version 16.0, it is possible to load multiple data sets at the same time.
Licensing model and cost
Commercial single-user license (IC) from approximately 730€. Considerable discounts are available when purchasing multiple licenses, and special conditions apply for the education sector.
Although STATA is a mature, very stable, and powerful software, its distribution - especially in companies - is low. For users who value a broad spectrum of methods, stability, a mature operating concept including scripting language and a fair price, STATA is superior to the more expensive commercial competition.
The five programs discussed above are the undisputed market leaders in the field of universally-applicable statistics programs, and they cover almost the entire spectrum of statistical methods.
There are also several programs that have specialized in certain methods, and have thus been able to establish themselves for certain applications. Some of these programs are mentioned - briefly - below:
EViews is particularly established in econometrics, and focuses on working with time series data. It is primarily used in university economics departments, and economic research institutes. A less powerful alternative for time series analysis is the free software JMulTi, which is implemented in JAVA.
- SPSS Amos
A relatively easy to use program for modeling and estimating structural equation models. Alternatives to Amos include LISREL, Mplus and SmartPLS (for partial least-squares).
- WinBUGS and OpenBUGS
WinBUGS, and the related open source project OpenBUGS, are tools especially for Bayesian statistics. While OpenBUGS was developed from the commercial software WinBUGS, the efforts for further development are now concentrated on OpenBUGS. Packages exist that integrate the functionality in R with BRugs and R2OpenBUGS. An alternative to BUGS is JAGS ("Just Another Gibbs Sampler").
- Mathematica and MATLAB
These two commercial programs are used for more numerically-oriented problems.
In addition, there are a number of commercial and open source programs that specialize in data mining methods.