Researchers Use Inter-rater Reliability Calculations To Propose mHealth App Review Standards


The digital health industry has largely balked at the idea of embracing a set of vendor neutral mobile health app standards, despite an overwhelming body of evidence showing that app stores are currently littered with clinically questionable health and wellness apps. The first attempt to introduce a neutral review process came from Happtique, a startup focused on creating a formulary of clinically vetted mobile health apps that doctors could trust and prescribe to patients. The startup ultimately failed for several reasons. First, it attempted to monetize by charging app developers to have their products reviewed, something most developers were not interested in paying money for, especially considering that Happtique was a new startup and its seal of approval carried no weight in the market. Second, Happtique’s own validation process was grossly inadequate. After publishing its long awaited formulary, the apps it had approved were almost immediately scrutinized by the digital health industry for a lack of security and clinically questionable content.

Since Happtique’s fall, other organizations have stepped in to bring credibility to app stores. In England, the NHS is building its own app formulary that it hopes will help local doctors and patients steer clear of untrustworthy apps. Others in the industry question the need for ratings at all, such as Paul Sonnier, a digital health strategist who notes, “I continue to be amazed at the patronizing stance of those in the medical establishment who feel it’s their duty to say which apps are good or bad. They’re not rating gyms, fitness classes, or even personal fitness trainers, for example, so why the fixation with consumer digital health apps?”

While Sonnier questions the need for oversight, others working in the digital health space continue to search for a set of standards that could be used to objectively evaluate health apps in a way that would succeed where Happtique failed. To that end, researchers from Harvard University, Brigham and Women’s Hospital, and UC Davis School of Medicine have teamed up to see which measures generate the most consensus among reviewers. To facilitate the study, researchers asked six reviewers to review the 10 ten depression apps and smoking cessation apps in the iTunes app store, based on 22 pre-defined criteria. Next, researchers compared the results to see which criteria the reviewers seemed to have the most agreement on, and which resulted in wide variations in the results.

Researchers found the most agreement among reviewers when they evaluated interactiveness and feedback, and noted the most differences of opinion among reviewers when they scored performance issues and errors. In general, there was not significant consensus among the scores each reviewer submitted for the apps. As expected, researchers found the most consensus on criteria that had little observational discretion, and then a wider difference of opinion on those that were more subjective. Clinical measures, such as effectiveness, ease of use, and performance had wide variations among reviewers. As Sonnier suggests, the need for an independent review of mHealth apps may be unnecessary, especially if even educated reviewers cannot find consensus as they attempt to evaluate apps.

Enjoy HIStalk Connect? Sign up for update alerts, or follow us at @HIStalkConnect.

↑ Back to top

Founding Sponsors

Platinum Sponsors