Static analysis is good at finding things you already know about. You write a signature for Facebook’s ad SDK, you match it against DEX bytecode, you report it. It works. It’s what AppXpose has done since day one, and it catches the majority of embedded trackers on any given phone.
But static analysis has a ceiling. It only finds what you told it to look for. An obfuscated SDK that renames its classes every release will slip through. A repackaged APK that swaps the signing certificate but keeps the same code will look identical. A network of apps that share data through a common backend, not a common SDK, won’t trigger any signature at all.
That’s where machine learning comes in. Not as a replacement for static analysis, but as a second layer that learns from patterns the signature matcher cannot see. If you want to understand how traditional spyware detection on Android works, that’s the baseline ML builds on.
What we shipped this week
AppXpose now includes five detection systems that sit alongside the existing DEX tracker scanner:
APK Integrity Checker runs entirely on your device, parallel to the tracker scan. It inspects DEX headers (magic bytes, endian tags, link fields), signing blocks, native libraries, and file structures. It detects hooking frameworks (Cydia Substrate, Whale/LSPosed, ADBI, BDHook), root tools (su, supersu, busybox, resetprop), debug artifacts (gdbserver, lldb-server, ida.key), repackaging indicators (patched DEX files, original classes backups), and known packer signatures (Jiagu, DexProtector, Bangcle). Findings are severity-classified from LOW to CRITICAL.
MalwareBazaar Hash Lookup takes the SHA256 hash of every scanned APK and checks it against abuse.ch’s open malware database. If the hash matches a known malicious sample, the risk score jumps to CRITICAL. This is a simple but powerful check: it costs nothing, adds no latency (runs parallel to other server calls), and catches every APK that has already been reported by the security community.
Koodous Threat Intelligence checks the same APK hash against Koodous, a community-driven Android threat intelligence platform with millions of analyzed samples. It runs in parallel with MalwareBazaar, so two independent malware databases are queried simultaneously without adding latency. A Koodous detection adds +13 to the risk score.
AppXpose CertNet is our own crowd-sourced database of signing certificates. Every scan that passes through AppXpose stores the app’s signing cert hash in a central ledger. When a second user scans the same app, we compare certs. If they don’t match, one of the two versions was re-signed, which is one of the strongest indicators of a repackaged or counterfeit APK. We seeded CertNet with 4,700+ verified certs from F-Droid, and it grows with every scan.
Community APK Hash Verification is a crowd-sourced APK hash database. Every scan contributes the app’s SHA256 hash. Once 7+ distinct devices report the same hash for a package version, it becomes the verified baseline. If your APK hash differs from the community consensus, the app was likely tampered with or repackaged.
These five systems are deterministic. They don’t use machine learning. They use structural analysis, hash comparison, and certificate verification. They’re the foundation.
What we’re training next
Two ML models are in active development on the corpus of scans we already have:
Tracker Permission Linker (M01) learns which Android permissions statistically co-occur with which tracker SDKs. The hypothesis: certain permission-SDK combinations are suspicious even when the SDK itself is obfuscated beyond recognition. If an app requests READ_CONTACTS and RECORD_AUDIO and contains a class structure that matches the shape of an attribution SDK (even without a recognizable package name), the model flags it. This is useful precisely in the cases where the static signature matcher fails.
The training data comes from AppXpose’s own scan corpus. Every scan records the full permission list, the matched trackers, the unmatched suspicious classes, and the obfuscation status. Over time, the model learns which combinations are normal (a messaging app with contacts + camera + a known analytics SDK) and which are anomalous (a flashlight app with contacts + audio + an unrecognized SDK that structurally resembles an ad network).
Behavioural Pattern Recognizer (M02) works at a higher level. Instead of analyzing one app at a time, it looks across the entire corpus for non-obvious relationships: apps from different developers that share the same backend servers. Apps that use the same rare SDK combination. Apps whose permission profiles are statistically identical despite claiming to be in different categories. The output is a graph of the surveillance ecosystem that no single-app scanner can produce.
This is the model we’re most excited about, and the one that needs the most data to be useful. Every scan makes it marginally smarter.
Why this matters for users
Most Android security tools today are reactive. They maintain a list of known threats and check against it. If a threat isn’t on the list, it doesn’t exist. ML changes that dynamic by introducing pattern recognition that generalizes beyond the training set.
A concrete example: if a new attribution SDK appears on the market next month with a fresh package name and no public documentation, the static matcher will miss it entirely. But if that SDK requests the same permissions, uses the same class structure, and co-occurs with the same companion libraries as three known attribution SDKs, the Tracker Permission Linker will flag it as “structurally similar to known trackers” before any human analyst writes a signature for it.
That’s not a replacement for signatures. It’s an early warning system that buys time until the signatures catch up.
The data question
ML models are only as good as their training data, and training data in the mobile security space is surprisingly hard to come by. Most datasets are either:
- Academic (large but restricted, like AndroZoo, which is limited to research institutions and prohibits commercial use)
- Commercial (VirusTotal, which is owned by Google and gated behind expensive enterprise tiers)
- Outdated (public malware sample collections that represent threats from 3+ years ago)
AppXpose is in an unusual position: every user who scans an app contributes a datapoint to the corpus. Our full research data from 3,745 analyzed apps shows the scale of what we’re working with. Not personal data (we don’t collect any), but structural data: which classes exist, which permissions are requested, which trackers are present, whether the APK is obfuscated, what the signing certificate looks like. That structural data, aggregated across thousands of scans, is exactly what the ML models need.
The more people scan, the smarter the models get. And the smarter the models get, the more useful the scans become. It’s a flywheel that doesn’t require us to buy datasets or scrape third-party sources. The data comes from the product itself. This is also why we’re committed to making our detection pipeline open source.
What’s next
The Tracker Permission Linker is closest to shipping. We expect it to be in production within the next few app updates, initially as a secondary signal alongside the static matcher (not replacing it). The Behavioural Pattern Recognizer needs more data before it produces actionable results. We’re giving it time rather than rushing a model that flags false positives.
Both models will be documented on our methodology page before they ship, the same way we documented the 10 pre-score factors and the detection pipeline. If you can’t read how the model works, you shouldn’t trust its output.
Every scan helps. If you haven’t scanned your phone recently, now would be a good time. Try starting with a popular app like WhatsApp to see the detection pipeline in action.