JVM性能调优实战：用Arthas诊断和优化Java应用

2026-06-13 约 2823 字预计阅读 6 分钟

前言

在生产环境中，Java应用偶尔会出现CPU飙升、响应变慢、内存溢出等问题。传统的排查方式往往需要加日志、重启应用，不仅效率低下，还可能错过现场。

Arthas（阿尔萨斯）是阿里巴巴开源的Java诊断工具，能够在不重启应用的情况下，实时诊断线上问题。本文将从实战角度出发，覆盖最常见的几种性能问题场景，手把手教你用Arthas定位和解决问题。

快速安装与启动

下载与启动

Arthas提供了极其简便的安装方式，通过一键脚本即可完成：

1
2
3
# 下载并启动Arthas
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar

启动后会列出当前JVM中运行的所有Java进程，选择目标进程编号即可连接：

1
2
3
4
5
$ java -jar arthas-boot.jar
* [1]: 12345 com.example.myapp.Application
  [2]: 67890 org.apache.catalina.startup.Bootstrap

Please choose an application number (1 to 2) :

连接成功后进入Arthas交互终端，会出现如下提示：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
[INFO] Try to attach success.
[INFO] Attach success.
         ,---.
        ,-----.    _______
       / _\   \   /  _   \
      | (    `. |  | (_)  |
       \ `-.  `_ |  |     /
        `----'  \ |  (\_/
                 |   \
       __        |    \  ___
     /'__`\     |     \/   \
    (  (  )     |  (\_/\  / |
    /`    \     /   \___/  / |
    \_/\  /   /         \/'  |
     // / /  /  \ \     \    \
    /' /'  / \  \ \     \    \
       v  v    v  v      v    v
       Arthas v3.7.2

基础操作

以下是一些常用的入门命令：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 查看当前JVM的所有线程状态
thread

# 查看JVM基本信息（版本、GC、内存等）
jvm

# 查看系统属性
sysprop

# 查看环境变量
sysenv

场景一：CPU飙升排查

CPU突然飙高是线上最常见的问题之一。使用Arthas可以快速定位到具体是哪个线程、哪段代码导致的。

第一步：查看最忙的线程

使用 thread 命令找出占用CPU最高的线程：

1
2
# 查看CPU使用率最高的前3个线程堆栈
thread -n 3

输出示例：

1
2
3
4
5
"pool-3-thread-2" Id=25 cpuUsage=85% deltaTime=1700ms time=5600ms RUNNABLE
    at com.example.service.DataProcessor.processBatch(DataProcessor.java:142)
    at com.example.service.DataProcessor.process(DataProcessor.java:98)
    at com.example.controller.ApiController.handleRequest(ApiController.java:56)
    ...

第二步：定位热点方法

通过 trace 命令追踪方法调用耗时：

1
2
# 追踪DataProcessor.processBatch方法的内部调用耗时
trace com.example.service.DataProcessor processBatch -n 5

输出结果会以树状结构展示方法调用链及每一步的耗时：

1
2
3
4
5
+---ts=15683ms, cost=2300ms, avg=460ms, timestamp=2026-06-13 10:23:45
|  +---ts=15684ms, cost=1800ms, avg=360ms, timestamp=2026-06-13 10:23:45
|  |  +---[1200ms] com.example.dao.BatchMapper.insertBatch():1200ms
|  |  +---[580ms]  com.example.dao.BatchMapper.updateStats():580ms
|  +---[500ms]  com.example.cache.RedisCache.set():500ms

从结果可以看到，数据库批量插入耗时1200ms，占总耗时的52%，这就是性能瓶颈所在。

第三步：生成火焰图（可选）

如果需要更直观地分析CPU热点，可以使用 profiler 命令生成火焰图：

1
2
3
4
5
6
7
8
# 开始采样（默认采样30秒）
profiler start

# 查看采样状态
profiler status

# 停止并生成火焰图HTML
profiler stop --format html --file /tmp/flame.html

将生成的 flame.html 文件在浏览器中打开，即可看到完整的调用栈火焰图，宽度越大的方法占用CPU越多。

场景二：方法耗时分析

当接口响应变慢时，需要找出具体是哪个方法拖慢了整体响应。Arthas的 monitor 和 watch 命令是分析方法耗时的利器。

monitor：统计方法调用数据

1
2
# 每5秒统计一次UserController.getUser方法的调用数据，共统计10次
monitor -c 5 com.example.controller.UserController getUser -n 10

输出：

1
2
3
 timestamp            class                    method    total  success  fail  avg-rt(ms)  fail-rate
-----------------------------------------------------------------------------------------------
 2026-06-13 10:30:05  UserController  getUser  120    118      2     850.50      1.67%

从统计中可以看到，getUser 方法平均响应850ms，失败率1.67%。

watch：观察方法入参和返回值

1
2
# 观察getUser方法的入参、返回值和抛出的异常
watch com.example.controller.UserController getUser '{params, returnObj, throwExp}' -x 2 -n 10

-x 2 表示对象展开2层深度，输出结果类似：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
ts=2026-06-13 10:32:15; [cost=1203ms]
@Object[][][
    @Object[][
        @Long[10086],   // userId参数
        @String[true],   // useCache参数
    ],
    @Result[
        code=200,
        data=@UserDTO[
            id=@Long[10086],
            name=@String["张三"],
        ],
    ],
    null,  // 没有异常
]

stack：查看方法调用栈

当想了解方法是从哪里被调用过来的：

1
2
# 查看UserService.findUser的完整调用栈
stack com.example.service.UserService findUser -n 3

1
2
3
4
5
ts=2026-06-13 10:35:22; cost=0.5ms
    @com.example.service.UserService.findUser()
    at com.example.controller.UserController.getUser(UserController.java:78)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:897)

场景三：内存泄漏诊断

内存泄漏是Java应用中最隐蔽的问题之一。Arthas提供了多种内存诊断手段。

查看堆内存概览

1
2
3
4
5
# 查看堆内存使用情况
dashboard -i 2000 -n 5

# 或者使用memory命令查看详细内存信息
memory

memory 命令输出：

1
2
3
4
5
6
Memory                     used     total    max      usage
heap                       512M     1024M    2048M    25.00%
eden_space                 128M     512M     512M     25.00%
old_gen                    384M     512M     1408M    27.27%
nonheap                    96M      128M     -1       75.00%
metaspace                  80M      96M      -1       83.33%

heapdump导出堆转储

1
2
3
4
5
# 导出堆转储文件到指定路径
heapdump /tmp/heapdump.hprof

# 如果只需要dump live对象（更小的文件）
heapdump --live /tmp/heapdump_live.hprof

导出的 .hprof 文件可以使用 Eclipse MAT 或 VisualVM 进行离线分析，找出大对象和泄漏引用链。

使用ognl表达式查对象

1
2
3
4
5
6
7
8
# 查看某个类的实例数量
ognl '@com.example.service.CacheService@instanceCount'

# 查看Spring Bean的状态
ognl '@org.springframework.context.ApplicationContext@getBean("userService").cacheSize()'

# 查看某个静态变量的值
ognl '@com.example.config.AppConfig@getGlobalConfig().getMaxRetryCount()'

场景四：类加载与热更新

Arthas还支持运行时替换class文件，实现不停机修复Bug。

查看已加载的类

1
2
3
4
5
6
7
8
# 搜索已加载的包含"Controller"的类
sc *Controller*

# 查看类的详细信息（字段、方法、注解等）
sc -d com.example.controller.UserController

# 查看类加载器层级关系
sc -f com.example.service.UserService

热更新Class文件

当发现线上Bug需要紧急修复时：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 第一步：反编译目标类，获取源代码
jad com.example.service.UserService > /tmp/UserService.java

# 第二步：修改代码（在/tmp/UserService.java中修复Bug）

# 第三步：编译修改后的文件（需要当前目录有编译环境）
mc /tmp/UserService.java -d /tmp/compiled/

# 第四步：将编译后的class加载到JVM中
redefine /tmp/compiled/com/example/service/UserService.class

⚠️ 注意：热更新有以下限制：不能增删字段和方法签名，不能修改类名，只能修改方法体内容。适用于紧急Bug修复，不建议作为常规发布手段。

场景五：线程死锁检测

当应用出现卡死现象时，很可能是发生了线程死锁。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# 查看是否有死锁
thread -b

# 输出示例：
# Found one Java-level deadlock:
# =============================
# "thread-1":
#   waiting to lock monitor 0x00007f8b4c006218 (object 0x00000000aab1f0a0, a com.example.service.UserService),
#   which is locked by "thread-2"
# "thread-2":
#   waiting to lock monitor 0x00007f8b4c003a28 (object 0x00000000aab1f070, a com.example.service.OrderService),
#   which is locked by "thread-1"

thread -b 会直接找出死锁的线程及它们互相等待的锁资源，一目了然。

生产环境最佳实践

在生产环境中使用Arthas需要注意以下几点：

1. 资源控制

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Arthas本身会消耗少量CPU和内存，生产环境建议：
# - 单次trace/watch不要设置过大的 -n 值
# - 使用完毕后及时退出关闭隧道
# - 避免同时开启多个诊断命令

# 退出Arthas
quit

# 或者关闭整个Arthas服务端
stop

2. 安全建议

Arthas功能强大但也有风险，生产环境建议：

1
2
3
4
5
# 通过配置限制可访问的端口
java -jar arthas-boot.jar --ip 127.0.0.1

# 使用arthas-tunnel-server进行统一管控
java -jar arthas-tunnel-server.jar

3. 常用命令速查表

命令	用途	场景
`thread -n 3`	查看最忙线程	CPU飙升
`trace`	追踪方法调用链耗时	接口慢
`watch`	观察入参返回值	数据异常
`monitor`	统计方法调用数据	慢接口排查
`memory`	查看内存使用	内存问题
`heapdump`	导出堆转储	内存泄漏
`thread -b`	检测死锁	应用卡死
`jad`	反编译class	查看线上代码
`redefine`	热更新class	紧急修复
`profiler`	生成火焰图	CPU热点分析

总结

Arthas是Java开发者必备的线上诊断神器，它的核心价值在于不停机诊断。掌握以下五个核心场景，基本能应对90%的线上问题：

CPU飙升：thread -n 3 → trace → profiler
接口变慢：monitor → watch → trace
内存泄漏：memory → heapdump → MAT分析
紧急修复：jad → mc → redefine
死锁检测：thread -b 一键排查

建议将Arthas集成到日常运维流程中，遇到线上问题时第一时间使用它来抓取现场，而不是盲目重启。记住：重启只能掩盖问题，不能解决问题。

🔗 Arthas官方文档：https://arthas.aliyun.com/doc/ 🔗 Arthas GitHub：https://github.com/alibaba/arthas