each::sense is in private beta.
Eachlabs | AI Workflows for app builders
elevenlabs-text-to-speech-with-timestamp

ELEVENLABS

Converts written text into natural, lifelike speech with precise timestamps. Offers clear pronunciation, smooth pacing, and expressive delivery, making it ideal for voiceovers, narration, and time synchronized audio content.

Avg Run Time: 7.000s

Model Slug: elevenlabs-text-to-speech-with-timestamp

Playground

Input

Aria
Roger
Sarah
Laura
Charlie
George
Callum
River
Liam
Charlotte
Alice
Matilda
Will
Jessica
Eric
Chris
Brian
Daniel
Lily
Bill
Advanced Controls

Output

Example Result

Preview and download your result.

{
"output":{
"alignment":{
"character_end_times_seconds":[
0:0.081
1:0.139
2:0.163
3:0.221
4:0.267
5:0.325
6:0.383
7:0.418
8:0.453
9:0.499
10:0.522
11:0.546
12:0.569
13:0.592
14:0.627
15:0.65
16:0.673
17:0.697
18:0.731
19:0.755
20:0.789
21:0.824
22:0.871
23:0.917
24:0.952
25:0.975
26:0.998
27:1.033
28:1.08
29:1.103
30:1.126
31:1.149
32:1.196
33:1.242
34:1.289
35:1.335
36:1.393
37:1.451
38:1.474
39:1.509
40:1.533
41:1.556
42:1.579
43:1.614
44:1.637
45:1.66
46:1.683
47:1.707
48:1.753
49:1.788
50:1.834
51:1.881
52:1.904
53:1.962
54:1.997
55:2.043
56:2.101
57:2.229
58:2.38
59:2.589
60:2.914
61:2.961
62:3.042
63:3.1
64:3.135
65:3.158
66:3.193
67:3.274
68:3.332
69:3.413
70:3.564
71:3.738
72:4.122
73:4.191
74:4.238
75:4.284
76:4.307
77:4.342
78:4.365
79:4.435
80:4.47
81:4.528
82:4.574
83:4.644
84:4.702
85:4.853
86:5.062
87:5.875
88:5.921
89:5.968
90:5.991
91:6.037
92:6.084
93:6.142
94:6.2
95:6.246
96:6.316
97:6.362
98:6.478
99:6.548
100:6.606
101:6.664
102:6.711
103:6.769
104:6.792
105:6.838
106:6.873
107:6.908
108:6.943
109:6.978
110:7.001
111:7.047
112:7.07
113:7.094
114:7.117
115:7.163
116:7.21
117:7.279
118:7.338
119:7.384
120:7.407
121:7.454
122:7.477
123:7.5
124:7.546
125:7.581
126:7.639
127:7.686
128:7.744
129:7.79
130:7.814
131:7.837
132:7.883
133:7.918
134:8.011
135:8.069
136:8.127
137:8.22
138:8.464
139:8.742
140:8.812
141:8.87
142:8.905
143:8.975
144:9.009
145:9.056
146:9.102
147:9.149
148:9.172
149:9.23
150:9.276
151:9.358
152:9.427
153:9.497
154:9.543
155:9.625
156:9.648
157:9.694
158:9.776
159:9.81
160:9.845
161:9.88
162:9.927
163:9.973
164:9.996
165:10.031
166:10.066
167:10.089
168:10.159
169:10.194
170:10.252
171:10.298
172:10.356
173:10.414
174:10.472
175:10.542
176:10.6
177:10.658
178:10.716
179:10.809
180:11.053
181:11.865
182:11.912
183:11.97
184:11.993
185:12.063
186:12.121
187:12.214
188:12.341
189:12.434
190:12.504
191:12.597
192:12.806
193:13.015
194:13.084
195:13.131
196:13.177
197:13.212
198:13.259
199:13.282
200:13.317
201:13.34
202:13.363
203:13.386
204:13.444
205:13.479
206:13.56
207:13.595
208:13.653
209:13.7
210:13.735
211:13.769
212:13.816
213:13.862
214:13.886
215:13.955
216:13.978
217:14.002
218:14.025
219:14.095
220:14.141
221:14.257
222:14.315
223:14.431
224:14.454
225:14.501
226:14.547
227:14.605
228:14.64
229:14.71
230:14.756
231:14.826
232:14.872
233:14.919
234:14.965
235:15
236:15.058
237:15.116
238:15.186
239:15.267
240:15.302
241:15.395
242:15.673
243:16.115
244:16.173
245:16.208
246:16.242
247:16.289
248:16.335
249:16.358
250:16.393
251:16.463
252:16.509
253:16.567
254:16.614
255:16.649
256:16.684
257:16.73
258:16.8
259:16.858
260:16.904
261:16.939
262:16.985
263:17.02
264:17.067
265:17.101
266:17.148
267:17.183
268:17.218
269:17.252
270:17.299
271:17.334
272:17.392
273:17.45
274:17.485
275:17.519
276:17.543
277:17.566
278:17.624
279:17.659
280:17.705
281:17.752
282:17.81
283:17.903
284:17.972
285:18.03
286:18.1
287:18.181
288:18.53
]
"character_start_times_seconds":[
0:0
1:0.081
2:0.139
3:0.163
4:0.221
5:0.267
6:0.325
7:0.383
8:0.418
9:0.453
10:0.499
11:0.522
12:0.546
13:0.569
14:0.592
15:0.627
16:0.65
17:0.673
18:0.697
19:0.731
20:0.755
21:0.789
22:0.824
23:0.871
24:0.917
25:0.952
26:0.975
27:0.998
28:1.033
29:1.08
30:1.103
31:1.126
32:1.149
33:1.196
34:1.242
35:1.289
36:1.335
37:1.393
38:1.451
39:1.474
40:1.509
41:1.533
42:1.556
43:1.579
44:1.614
45:1.637
46:1.66
47:1.683
48:1.707
49:1.753
50:1.788
51:1.834
52:1.881
53:1.904
54:1.962
55:1.997
56:2.043
57:2.101
58:2.229
59:2.38
60:2.589
61:2.914
62:2.961
63:3.042
64:3.1
65:3.135
66:3.158
67:3.193
68:3.274
69:3.332
70:3.413
71:3.564
72:3.738
73:4.122
74:4.191
75:4.238
76:4.284
77:4.307
78:4.342
79:4.365
80:4.435
81:4.47
82:4.528
83:4.574
84:4.644
85:4.702
86:4.853
87:5.062
88:5.875
89:5.921
90:5.968
91:5.991
92:6.037
93:6.084
94:6.142
95:6.2
96:6.246
97:6.316
98:6.362
99:6.478
100:6.548
101:6.606
102:6.664
103:6.711
104:6.769
105:6.792
106:6.838
107:6.873
108:6.908
109:6.943
110:6.978
111:7.001
112:7.047
113:7.07
114:7.094
115:7.117
116:7.163
117:7.21
118:7.279
119:7.338
120:7.384
121:7.407
122:7.454
123:7.477
124:7.5
125:7.546
126:7.581
127:7.639
128:7.686
129:7.744
130:7.79
131:7.814
132:7.837
133:7.883
134:7.918
135:8.011
136:8.069
137:8.127
138:8.22
139:8.464
140:8.742
141:8.812
142:8.87
143:8.905
144:8.975
145:9.009
146:9.056
147:9.102
148:9.149
149:9.172
150:9.23
151:9.276
152:9.358
153:9.427
154:9.497
155:9.543
156:9.625
157:9.648
158:9.694
159:9.776
160:9.81
161:9.845
162:9.88
163:9.927
164:9.973
165:9.996
166:10.031
167:10.066
168:10.089
169:10.159
170:10.194
171:10.252
172:10.298
173:10.356
174:10.414
175:10.472
176:10.542
177:10.6
178:10.658
179:10.716
180:10.809
181:11.053
182:11.865
183:11.912
184:11.97
185:11.993
186:12.063
187:12.121
188:12.214
189:12.341
190:12.434
191:12.504
192:12.597
193:12.806
194:13.015
195:13.084
196:13.131
197:13.177
198:13.212
199:13.259
200:13.282
201:13.317
202:13.34
203:13.363
204:13.386
205:13.444
206:13.479
207:13.56
208:13.595
209:13.653
210:13.7
211:13.735
212:13.769
213:13.816
214:13.862
215:13.886
216:13.955
217:13.978
218:14.002
219:14.025
220:14.095
221:14.141
222:14.257
223:14.315
224:14.431
225:14.454
226:14.501
227:14.547
228:14.605
229:14.64
230:14.71
231:14.756
232:14.826
233:14.872
234:14.919
235:14.965
236:15
237:15.058
238:15.116
239:15.186
240:15.267
241:15.302
242:15.395
243:15.673
244:16.115
245:16.173
246:16.208
247:16.242
248:16.289
249:16.335
250:16.358
251:16.393
252:16.463
253:16.509
254:16.567
255:16.614
256:16.649
257:16.684
258:16.73
259:16.8
260:16.858
261:16.904
262:16.939
263:16.985
264:17.02
265:17.067
266:17.101
267:17.148
268:17.183
269:17.218
270:17.252
271:17.299
272:17.334
273:17.392
274:17.45
275:17.485
276:17.519
277:17.543
278:17.566
279:17.624
280:17.659
281:17.705
282:17.752
283:17.81
284:17.903
285:17.972
286:18.03
287:18.1
288:18.181
]
"characters":[
0:"S"
1:"h"
2:"e"
3:" "
4:"s"
5:"t"
6:"o"
7:"p"
8:"p"
9:"e"
10:"d"
11:" "
12:"i"
13:"n"
14:" "
15:"t"
16:"h"
17:"e"
18:" "
19:"m"
20:"i"
21:"d"
22:"d"
23:"l"
24:"e"
25:" "
26:"o"
27:"f"
28:" "
29:"t"
30:"h"
31:"e"
32:" "
33:"s"
34:"t"
35:"r"
36:"e"
37:"e"
38:"t"
39:" "
40:"w"
41:"h"
42:"e"
43:"n"
44:" "
45:"t"
46:"h"
47:"e"
48:" "
49:"r"
50:"a"
51:"i"
52:"n"
53:" "
54:"b"
55:"e"
56:"g"
57:"a"
58:"n"
59:","
60:" "
61:"n"
62:"o"
63:"t"
64:" "
65:"t"
66:"o"
67:" "
68:"r"
69:"u"
70:"n"
71:","
72:" "
73:"b"
74:"u"
75:"t"
76:" "
77:"t"
78:"o"
79:" "
80:"l"
81:"i"
82:"s"
83:"t"
84:"e"
85:"n"
86:"."
87:" "
88:"T"
89:"h"
90:"e"
91:" "
92:"c"
93:"i"
94:"t"
95:"y"
96:" "
97:"s"
98:"o"
99:"f"
100:"t"
101:"e"
102:"n"
103:"e"
104:"d"
105:" "
106:"u"
107:"n"
108:"d"
109:"e"
110:"r"
111:" "
112:"t"
113:"h"
114:"e"
115:" "
116:"s"
117:"o"
118:"u"
119:"n"
120:"d"
121:" "
122:"o"
123:"f"
124:" "
125:"f"
126:"a"
127:"l"
128:"l"
129:"i"
130:"n"
131:"g"
132:" "
133:"w"
134:"a"
135:"t"
136:"e"
137:"r"
138:","
139:" "
140:"a"
141:"n"
142:"d"
143:" "
144:"f"
145:"o"
146:"r"
147:" "
148:"a"
149:" "
150:"m"
151:"o"
152:"m"
153:"e"
154:"n"
155:"t"
156:","
157:" "
158:"e"
159:"v"
160:"e"
161:"r"
162:"y"
163:"t"
164:"h"
165:"i"
166:"n"
167:"g"
168:" "
169:"f"
170:"e"
171:"l"
172:"t"
173:" "
174:"s"
175:"l"
176:"o"
177:"w"
178:"e"
179:"r"
180:"."
181:" "
182:"S"
183:"h"
184:"e"
185:" "
186:"s"
187:"m"
188:"i"
189:"l"
190:"e"
191:"d"
192:","
193:" "
194:"t"
195:"u"
196:"c"
197:"k"
198:"e"
199:"d"
200:" "
201:"h"
202:"e"
203:"r"
204:" "
205:"h"
206:"a"
207:"n"
208:"d"
209:"s"
210:" "
211:"i"
212:"n"
213:"t"
214:"o"
215:" "
216:"h"
217:"e"
218:"r"
219:" "
220:"c"
221:"o"
222:"a"
223:"t"
224:","
225:" "
226:"a"
227:"n"
228:"d"
229:" "
230:"k"
231:"e"
232:"p"
233:"t"
234:" "
235:"w"
236:"a"
237:"l"
238:"k"
239:"i"
240:"n"
241:"g"
242:","
243:" "
244:"k"
245:"n"
246:"o"
247:"w"
248:"i"
249:"n"
250:"g"
251:" "
252:"s"
253:"o"
254:"m"
255:"e"
256:" "
257:"m"
258:"o"
259:"m"
260:"e"
261:"n"
262:"t"
263:"s"
264:" "
265:"d"
266:"o"
267:"n"
268:""
269:"t"
270:" "
271:"n"
272:"e"
273:"e"
274:"d"
275:" "
276:"t"
277:"o"
278:" "
279:"b"
280:"e"
281:" "
282:"r"
283:"u"
284:"s"
285:"h"
286:"e"
287:"d"
288:"."
]
}
"audio_url":"https://storage.googleapis.com/magicpoint/outputs/elevenlabs_tts_w_timestamp_output.mp3"
"normalized_alignment":{
"character_end_times_seconds":[
0:0.046
1:0.081
2:0.139
3:0.163
4:0.221
5:0.267
6:0.325
7:0.383
8:0.418
9:0.453
10:0.499
11:0.522
12:0.546
13:0.569
14:0.592
15:0.627
16:0.65
17:0.673
18:0.697
19:0.731
20:0.755
21:0.789
22:0.824
23:0.871
24:0.917
25:0.952
26:0.975
27:0.998
28:1.033
29:1.08
30:1.103
31:1.126
32:1.149
33:1.196
34:1.242
35:1.289
36:1.335
37:1.393
38:1.451
39:1.474
40:1.509
41:1.533
42:1.556
43:1.579
44:1.614
45:1.637
46:1.66
47:1.683
48:1.707
49:1.753
50:1.788
51:1.834
52:1.881
53:1.904
54:1.962
55:1.997
56:2.043
57:2.101
58:2.229
59:2.38
60:2.589
61:2.914
62:2.961
63:3.042
64:3.1
65:3.135
66:3.158
67:3.193
68:3.274
69:3.332
70:3.413
71:3.564
72:3.738
73:4.122
74:4.191
75:4.238
76:4.284
77:4.307
78:4.342
79:4.365
80:4.435
81:4.47
82:4.528
83:4.574
84:4.644
85:4.702
86:4.853
87:5.062
88:5.875
89:5.921
90:5.968
91:5.991
92:6.037
93:6.084
94:6.142
95:6.2
96:6.246
97:6.316
98:6.362
99:6.478
100:6.548
101:6.606
102:6.664
103:6.711
104:6.769
105:6.792
106:6.838
107:6.873
108:6.908
109:6.943
110:6.978
111:7.001
112:7.047
113:7.07
114:7.094
115:7.117
116:7.163
117:7.21
118:7.279
119:7.338
120:7.384
121:7.407
122:7.454
123:7.477
124:7.5
125:7.546
126:7.581
127:7.639
128:7.686
129:7.744
130:7.79
131:7.814
132:7.837
133:7.883
134:7.918
135:8.011
136:8.069
137:8.127
138:8.22
139:8.464
140:8.742
141:8.812
142:8.87
143:8.905
144:8.975
145:9.009
146:9.056
147:9.102
148:9.149
149:9.172
150:9.23
151:9.276
152:9.358
153:9.427
154:9.497
155:9.543
156:9.625
157:9.648
158:9.694
159:9.776
160:9.81
161:9.845
162:9.88
163:9.927
164:9.973
165:9.996
166:10.031
167:10.066
168:10.089
169:10.159
170:10.194
171:10.252
172:10.298
173:10.356
174:10.414
175:10.472
176:10.542
177:10.6
178:10.658
179:10.716
180:10.809
181:11.053
182:11.865
183:11.912
184:11.97
185:11.993
186:12.063
187:12.121
188:12.214
189:12.341
190:12.434
191:12.504
192:12.597
193:12.806
194:13.015
195:13.084
196:13.131
197:13.177
198:13.212
199:13.259
200:13.282
201:13.317
202:13.34
203:13.363
204:13.386
205:13.444
206:13.479
207:13.56
208:13.595
209:13.653
210:13.7
211:13.735
212:13.769
213:13.816
214:13.862
215:13.886
216:13.955
217:13.978
218:14.002
219:14.025
220:14.095
221:14.141
222:14.257
223:14.315
224:14.431
225:14.454
226:14.501
227:14.547
228:14.605
229:14.64
230:14.71
231:14.756
232:14.826
233:14.872
234:14.919
235:14.965
236:15
237:15.058
238:15.116
239:15.186
240:15.267
241:15.302
242:15.395
243:15.673
244:16.115
245:16.173
246:16.208
247:16.242
248:16.289
249:16.335
250:16.358
251:16.393
252:16.463
253:16.509
254:16.567
255:16.614
256:16.649
257:16.684
258:16.73
259:16.8
260:16.858
261:16.904
262:16.939
263:16.985
264:17.02
265:17.067
266:17.101
267:17.148
268:17.183
269:17.218
270:17.252
271:17.299
272:17.334
273:17.392
274:17.45
275:17.485
276:17.519
277:17.543
278:17.566
279:17.624
280:17.659
281:17.705
282:17.752
283:17.81
284:17.903
285:17.972
286:18.03
287:18.1
288:18.181
289:18.355
290:18.53
]
"character_start_times_seconds":[
0:0
1:0.046
2:0.081
3:0.139
4:0.163
5:0.221
6:0.267
7:0.325
8:0.383
9:0.418
10:0.453
11:0.499
12:0.522
13:0.546
14:0.569
15:0.592
16:0.627
17:0.65
18:0.673
19:0.697
20:0.731
21:0.755
22:0.789
23:0.824
24:0.871
25:0.917
26:0.952
27:0.975
28:0.998
29:1.033
30:1.08
31:1.103
32:1.126
33:1.149
34:1.196
35:1.242
36:1.289
37:1.335
38:1.393
39:1.451
40:1.474
41:1.509
42:1.533
43:1.556
44:1.579
45:1.614
46:1.637
47:1.66
48:1.683
49:1.707
50:1.753
51:1.788
52:1.834
53:1.881
54:1.904
55:1.962
56:1.997
57:2.043
58:2.101
59:2.229
60:2.38
61:2.589
62:2.914
63:2.961
64:3.042
65:3.1
66:3.135
67:3.158
68:3.193
69:3.274
70:3.332
71:3.413
72:3.564
73:3.738
74:4.122
75:4.191
76:4.238
77:4.284
78:4.307
79:4.342
80:4.365
81:4.435
82:4.47
83:4.528
84:4.574
85:4.644
86:4.702
87:4.853
88:5.062
89:5.875
90:5.921
91:5.968
92:5.991
93:6.037
94:6.084
95:6.142
96:6.2
97:6.246
98:6.316
99:6.362
100:6.478
101:6.548
102:6.606
103:6.664
104:6.711
105:6.769
106:6.792
107:6.838
108:6.873
109:6.908
110:6.943
111:6.978
112:7.001
113:7.047
114:7.07
115:7.094
116:7.117
117:7.163
118:7.21
119:7.279
120:7.338
121:7.384
122:7.407
123:7.454
124:7.477
125:7.5
126:7.546
127:7.581
128:7.639
129:7.686
130:7.744
131:7.79
132:7.814
133:7.837
134:7.883
135:7.918
136:8.011
137:8.069
138:8.127
139:8.22
140:8.464
141:8.742
142:8.812
143:8.87
144:8.905
145:8.975
146:9.009
147:9.056
148:9.102
149:9.149
150:9.172
151:9.23
152:9.276
153:9.358
154:9.427
155:9.497
156:9.543
157:9.625
158:9.648
159:9.694
160:9.776
161:9.81
162:9.845
163:9.88
164:9.927
165:9.973
166:9.996
167:10.031
168:10.066
169:10.089
170:10.159
171:10.194
172:10.252
173:10.298
174:10.356
175:10.414
176:10.472
177:10.542
178:10.6
179:10.658
180:10.716
181:10.809
182:11.053
183:11.865
184:11.912
185:11.97
186:11.993
187:12.063
188:12.121
189:12.214
190:12.341
191:12.434
192:12.504
193:12.597
194:12.806
195:13.015
196:13.084
197:13.131
198:13.177
199:13.212
200:13.259
201:13.282
202:13.317
203:13.34
204:13.363
205:13.386
206:13.444
207:13.479
208:13.56
209:13.595
210:13.653
211:13.7
212:13.735
213:13.769
214:13.816
215:13.862
216:13.886
217:13.955
218:13.978
219:14.002
220:14.025
221:14.095
222:14.141
223:14.257
224:14.315
225:14.431
226:14.454
227:14.501
228:14.547
229:14.605
230:14.64
231:14.71
232:14.756
233:14.826
234:14.872
235:14.919
236:14.965
237:15
238:15.058
239:15.116
240:15.186
241:15.267
242:15.302
243:15.395
244:15.673
245:16.115
246:16.173
247:16.208
248:16.242
249:16.289
250:16.335
251:16.358
252:16.393
253:16.463
254:16.509
255:16.567
256:16.614
257:16.649
258:16.684
259:16.73
260:16.8
261:16.858
262:16.904
263:16.939
264:16.985
265:17.02
266:17.067
267:17.101
268:17.148
269:17.183
270:17.218
271:17.252
272:17.299
273:17.334
274:17.392
275:17.45
276:17.485
277:17.519
278:17.543
279:17.566
280:17.624
281:17.659
282:17.705
283:17.752
284:17.81
285:17.903
286:17.972
287:18.03
288:18.1
289:18.181
290:18.355
]
"characters":[
0:" "
1:"S"
2:"h"
3:"e"
4:" "
5:"s"
6:"t"
7:"o"
8:"p"
9:"p"
10:"e"
11:"d"
12:" "
13:"i"
14:"n"
15:" "
16:"t"
17:"h"
18:"e"
19:" "
20:"m"
21:"i"
22:"d"
23:"d"
24:"l"
25:"e"
26:" "
27:"o"
28:"f"
29:" "
30:"t"
31:"h"
32:"e"
33:" "
34:"s"
35:"t"
36:"r"
37:"e"
38:"e"
39:"t"
40:" "
41:"w"
42:"h"
43:"e"
44:"n"
45:" "
46:"t"
47:"h"
48:"e"
49:" "
50:"r"
51:"a"
52:"i"
53:"n"
54:" "
55:"b"
56:"e"
57:"g"
58:"a"
59:"n"
60:","
61:" "
62:"n"
63:"o"
64:"t"
65:" "
66:"t"
67:"o"
68:" "
69:"r"
70:"u"
71:"n"
72:","
73:" "
74:"b"
75:"u"
76:"t"
77:" "
78:"t"
79:"o"
80:" "
81:"l"
82:"i"
83:"s"
84:"t"
85:"e"
86:"n"
87:"."
88:" "
89:"T"
90:"h"
91:"e"
92:" "
93:"c"
94:"i"
95:"t"
96:"y"
97:" "
98:"s"
99:"o"
100:"f"
101:"t"
102:"e"
103:"n"
104:"e"
105:"d"
106:" "
107:"u"
108:"n"
109:"d"
110:"e"
111:"r"
112:" "
113:"t"
114:"h"
115:"e"
116:" "
117:"s"
118:"o"
119:"u"
120:"n"
121:"d"
122:" "
123:"o"
124:"f"
125:" "
126:"f"
127:"a"
128:"l"
129:"l"
130:"i"
131:"n"
132:"g"
133:" "
134:"w"
135:"a"
136:"t"
137:"e"
138:"r"
139:","
140:" "
141:"a"
142:"n"
143:"d"
144:" "
145:"f"
146:"o"
147:"r"
148:" "
149:"a"
150:" "
151:"m"
152:"o"
153:"m"
154:"e"
155:"n"
156:"t"
157:","
158:" "
159:"e"
160:"v"
161:"e"
162:"r"
163:"y"
164:"t"
165:"h"
166:"i"
167:"n"
168:"g"
169:" "
170:"f"
171:"e"
172:"l"
173:"t"
174:" "
175:"s"
176:"l"
177:"o"
178:"w"
179:"e"
180:"r"
181:"."
182:" "
183:"S"
184:"h"
185:"e"
186:" "
187:"s"
188:"m"
189:"i"
190:"l"
191:"e"
192:"d"
193:","
194:" "
195:"t"
196:"u"
197:"c"
198:"k"
199:"e"
200:"d"
201:" "
202:"h"
203:"e"
204:"r"
205:" "
206:"h"
207:"a"
208:"n"
209:"d"
210:"s"
211:" "
212:"i"
213:"n"
214:"t"
215:"o"
216:" "
217:"h"
218:"e"
219:"r"
220:" "
221:"c"
222:"o"
223:"a"
224:"t"
225:","
226:" "
227:"a"
228:"n"
229:"d"
230:" "
231:"k"
232:"e"
233:"p"
234:"t"
235:" "
236:"w"
237:"a"
238:"l"
239:"k"
240:"i"
241:"n"
242:"g"
243:","
244:" "
245:"k"
246:"n"
247:"o"
248:"w"
249:"i"
250:"n"
251:"g"
252:" "
253:"s"
254:"o"
255:"m"
256:"e"
257:" "
258:"m"
259:"o"
260:"m"
261:"e"
262:"n"
263:"t"
264:"s"
265:" "
266:"d"
267:"o"
268:"n"
269:"'"
270:"t"
271:" "
272:"n"
273:"e"
274:"e"
275:"d"
276:" "
277:"t"
278:"o"
279:" "
280:"b"
281:"e"
282:" "
283:"r"
284:"u"
285:"s"
286:"h"
287:"e"
288:"d"
289:"."
290:" "
]
}
}
}
Calculated using formula: len(text) * 0.00005

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

The model name "elevenlabs-text-to-speech-with-timestamp" refers to integrations built on top of ElevenLabs’ text-to-speech (TTS) API that generate natural, human-like speech and expose precise timestamps (typically at word or chunk level) alongside the audio. These integrations are usually maintained by third-party developers (for example, in open-source toolchains and agent frameworks) rather than being a separately branded core model from ElevenLabs itself. They wrap ElevenLabs’ production TTS models and add structured timing metadata for downstream synchronization tasks such as subtitles, karaoke-style highlights, or aligning visuals with spoken content.

Underlying technology is ElevenLabs’ neural TTS stack, which is widely reported by practitioners and reviewers as delivering state-of-the-art naturalness, emotional expressiveness, and multilingual support compared to most other commercial and open-source systems. Users frequently compare ElevenLabs’ output quality favorably against open-source models, noting smoother prosody, fewer pronunciation glitches, and better expressive range for narration and character voices. The “with timestamp” variants augment this with alignment information (timestamps per word, token, or sentence) for time-synchronized audio applications like interactive agents, media automation, or programmatic content creation.

Technical Specifications

  • Architecture:
  • Proprietary neural text-to-speech architecture by ElevenLabs, commonly described by users as multi-speaker, neural, and capable of voice cloning and expressive prosody control.
  • “With timestamp” wrappers typically implement alignment and metadata extraction around the core TTS inference.
  • Parameters:
  • Exact parameter counts are not publicly disclosed for ElevenLabs’ TTS models.
  • Community sources and reviewers classify it as a large, production-grade neural TTS system optimized for cloud inference rather than a lightweight edge model.
  • Resolution:
  • Audio sample rates typically reported around 22.05 kHz or 44.1 kHz for high-quality output; some integrations add support for 8 kHz output for telephony/low-bandwidth use (for example, integrations mentioning that ElevenLabs TTS services support an 8000 Hz sample rate option for certain pipelines).
  • Time resolution for timestamps is at least at word or token granularity; some toolchains aim for word-level timestamps suitable for precise lip-sync or highlighting.
  • Input/Output formats:
  • Input: plain text (UTF-8), with support for multiple languages and punctuation-based prosody; some wrappers also support SSML-like or custom markup for pauses/emphasis where exposed.
  • Output: compressed or raw audio formats such as WAV/PCM or common compressed formats depending on the integration; plus structured metadata (JSON or similar) containing timestamps and sometimes segmentation boundaries.
  • “With timestamp” implementations generally output:
  • Audio stream or file.
  • An array of segments with start/end times and the associated text (word, token, or sentence).
  • Performance metrics:
  • No official public benchmarks specific to “elevenlabs-text-to-speech-with-timestamp” as a named model.
  • Independent TTS comparison articles and guides often state that ElevenLabs leads commercial systems in perceived naturalness, emotional range, and low pronunciation error rates when compared with prominent open-source alternatives; GLM-TTS authors, for example, explicitly note that ElevenLabs still leads in overall naturalness and emotional expressiveness while their open-source model is competitive on character error rate.
  • Latency is generally regarded as low enough for interactive applications (e.g., voice assistants and agents) when streamed, based on developer reports integrating ElevenLabs in real-time pipelines.

Key Considerations

  • The “with timestamp” naming usually indicates a wrapper or integration that adds timing metadata around ElevenLabs’ TTS, not a fundamentally different core acoustic model.
  • For accurate timestamps, ensure that the integration’s alignment logic is configured correctly (e.g., consistent text normalization between input text and what is used for alignment).
  • Long passages of text can lead to slightly drifting timestamps if the wrapper segments text poorly; chunking text into manageable segments (e.g., sentences or paragraphs) often yields more reliable timing.
  • Prosody and timing are influenced by punctuation and capitalization; clear sentence boundaries improve both naturalness and alignment.
  • There is a practical trade-off between speed and quality when requesting higher sample rates or more expressive/complex voices; some users report higher latency with more expressive settings, especially in real-time agent contexts.
  • Network latency and streaming configuration significantly affect perceived responsiveness in real-time use; local buffering strategy for audio and timestamps should be tuned carefully.
  • When using voice cloning or highly expressive voices, ensure consistent text style; abrupt switches in register, all-caps, or excessive punctuation can lead to prosody artifacts that make timestamps feel visually “off” when synchronized to visuals.
  • Some users report edge cases where rapid code-switching (multiple languages in one sentence) affects pronunciation and rhythm; this can slightly distort practical synchronization with fine-grained visual cues.
  • In multi-component systems (LLM + TTS + timestamp wrapper), failures are often in the plumbing (buffering, chunk boundaries, encoding) rather than in the TTS model itself; robust error handling and logging for the timestamp pipeline are important.

Tips & Tricks

  • Use clear segmentation:
  • Split long input text into sentences or logical narration units and generate audio per chunk, then stitch audio and timestamps. This often yields more stable alignment and easier debugging.
  • Normalize text consistently:
  • Apply the same text normalization rules (numbers to words, abbreviation expansion) both before sending to TTS and for any downstream alignment logic so that word-level timestamps match the displayed text.
  • Control pacing with punctuation:
  • Add commas and periods where natural pauses should be; this both improves naturalness and creates natural anchor points for timestamps to align with cuts or scene changes.
  • Optimize for specific use cases:
  • For telephony or bandwidth-limited environments, use 8 kHz or similarly low sample rates where supported, trading some fidelity for faster transfer and decoding; this is often sufficient for IVRs and callbots.
  • For audiobooks or high-end narration, choose higher sample rates and “warmer” or more expressive voices and allow slightly higher latency.
  • Iterative refinement:
  • Generate a small sample of the script and inspect timestamps visually (e.g., overlay on subtitles) before running large batches; adjust punctuation, wording, or chunk sizes based on observed drift or misalignment.
  • Where minor misalignments occur, adjust text slightly (e.g., breaking up long compound sentences) and regenerate that segment only.
  • Prompt structuring:
  • Avoid excessive use of emojis, all caps, or repeated punctuation ("!!!") when precise timing matters; these can influence prosody in ways that make alignment slightly less predictable.
  • For character dialogue or dramatic narration, include stage directions in brackets and remove them from the text sent to TTS or treat them separately; otherwise they may be spoken and break intended timing.
  • Advanced techniques:
  • Use a separate alignment or diarization step if you need extremely tight word-level timestamps (e.g., for lip-sync); some pipelines generate audio via TTS and then run a forced aligner or ASR with word timestamps to refine timing.
  • In interactive agents, stream both audio and interim timestamps; let the UI or client progressively refine subtitles as more precise segment boundaries become available.

Capabilities

  • Converts written text into high-quality, natural-sounding speech suitable for narration, voiceovers, dialogue, and interactive speech interfaces.
  • Provides time-aligned metadata (timestamps) enabling synchronization with subtitles, on-screen text, animations, or other timed media.
  • Supports multiple voices and styles (e.g., neutral narration, character voices, more expressive tones) depending on the underlying ElevenLabs voice configuration exposed through the wrapper.
  • Handles relatively long texts for use cases such as podcasts, audiobooks, and long-form video narration, especially when chunked appropriately.
  • Works in real-time or near-real-time contexts when coupled with streaming pipelines, enabling responsive conversational agents and live applications.
  • Adaptable across multiple languages and accents based on ElevenLabs’ language support, with generally strong pronunciation and prosody noted in user comparisons.
  • Technically robust enough to integrate into complex, multi-service agent frameworks that combine TTS, STT, and LLM reasoning, as shown in open-source changelogs that treat ElevenLabs TTS as a first-class service for production-grade voice agents.

What Can I Use It For?

  • Professional applications:
  • Automated video narration for educational content, explainer videos, and corporate training where subtitles or visual elements must be synchronized to spoken text.
  • Audiobook and podcast production pipelines where developers want to automatically generate both audio and aligned transcripts or chapter markers.
  • Customer support and sales agents that speak responses generated by language models, with timestamps enabling synchronized on-screen hints, suggested replies, or call-center dashboards.
  • Media localization workflows that generate localized audio plus timing information for dubbing or subtitling.
  • Creative projects:
  • Storytelling, audio dramas, and game dialogue where creators want to drive in-game animations or UI elements from word- or line-level timestamps.
  • Indie video production where individual creators use TTS to generate voiceovers and then auto-align captions and visual effects to the narration.
  • Business use cases:
  • Automated content repurposing: turning blog posts or documentation into narrated videos or podcasts, with timestamps used to generate chaptered content or scrubbable players.
  • Interactive product demos and onboarding flows that speak explanations while highlighting relevant UI areas at timecodes derived from the timestamps.
  • Internal knowledge agents that read out knowledge-base answers while highlighting key sentences in sync for training or accessibility.
  • Personal and community projects:
  • GitHub-hosted chatbots and personal agents that speak responses and show subtitles aligned with timestamps, leveraging ElevenLabs TTS service wiring visible in open-source frameworks.
  • Hobbyist tools that generate language-learning materials, such as spoken sentences with synchronized text for karaoke-style reading practice.
  • Industry-specific applications:
  • E-learning and edtech systems that auto-generate instruction audio plus synchronized on-screen text for accessibility and engagement.
  • Call center QA tools that replay synthesized prompts and capture alignment data as part of scenario simulations.
  • Accessibility tools for visually impaired users where timestamps help coordinate audio cues with haptic feedback or limited visual elements.

Things to Be Aware Of

  • Experimental integration behavior:
  • Some open-source toolchains that wire ElevenLabs TTS with timestamp-like functionality evolve rapidly; changelogs show ongoing adjustments to parameters such as sample rate support (e.g., adding 8000 Hz support) and runtime-configurable model/language/voice settings, indicating that timestamp-enabled pipelines may change behavior across versions.
  • Known quirks and edge cases:
  • Developers report occasional errors when forwarding generated audio into downstream messaging or telephony systems if the audio format, sampling rate, or headers are not exactly as expected; these issues typically arise in the glue code rather than in the TTS itself but affect end-to-end reliability.
  • Code changes in libraries integrating ElevenLabs sometimes fix TTS-related argument or parameter issues (e.g., restoring full voice listing functionality after a missing argument in a text-to-speech helper), which can indirectly affect available voices and settings used for timestamped TTS flows.
  • Performance considerations:
  • Streaming setups depend on network stability and server response times; intermittent latency spikes can desynchronize timestamps from user perception if the client assumes perfectly linear playback.
  • High sample rate audio and very expressive voices may incur slightly higher compute cost and latency than simpler, lower-fidelity configurations, which matters in large-scale or real-time deployments.
  • Resource requirements:
  • As a cloud-style neural TTS, most resource requirements fall on the remote inference side; on the client or integration side, developers should account for buffering, decoding, and handling timestamp metadata (potentially large JSON structures for long texts).
  • Consistency factors:
  • Minor variations in prosody between runs (especially with expressive voices) can change exact word timings slightly, which is usually acceptable for subtitles but may need consideration for frame-perfect synchronization scenarios.
  • Multi-language and code-switching text may lead to occasional pronunciation or rhythm anomalies that slightly shift timestamp expectations in those segments.
  • Positive user feedback themes:
  • Users and technical reviewers consistently praise ElevenLabs-based TTS for its naturalness, emotional range, and low error rates compared with open-source models; some open-source authors explicitly recommend commercial systems like ElevenLabs when ultimate quality is critical.
  • Community comments highlight that it is particularly strong for narration, marketing videos, and character voices, with minimal robotic artifacts.
  • Common concerns or negative feedback:
  • Some users note reliance on proprietary infrastructure and lack of low-level control over the core model architecture and parameters.
  • Occasional integration bugs (e.g., handling of voice listing, argument mismatches, or sample-rate mismatches) require developers to track upstream library changelogs and adjust accordingly.
  • For extremely tight lip-sync or phoneme-perfect animation, relying solely on TTS-generated timestamps may not be sufficient; developers often add a separate forced alignment step.

Limitations

  • The exact internal architecture, parameter counts, and training details are proprietary and not publicly documented, which limits deep customization or on-premise replication of the core TTS model.
  • Timestamp precision, while generally sufficient for subtitles and synchronized UI elements, may not always meet the strictest requirements for frame-perfect lip-sync or phoneme-level animation without supplementary alignment tools.
  • In highly constrained network or real-time conditions, the combination of high-fidelity audio and expressive voices can introduce latency that may be noticeable in low-latency conversational interfaces unless carefully engineered.